In this report, we discuss the various ways of building probabilistic language models, specifically N-grams. Using the given corpora from three different domains, we first evaluate the reference Unigram implementation provided in the starter code in Section 2. Then we propose our Trigram approach in Section 3, with the implementation details explained inSection 3.1. We show that our approach outperforms the Unigram baseline in almost every performance metric, in Section 3.2 and 3.3. Finally we explore the possibility of adapting our language model from one corpus to another, and demonstrated significant improvement in perplexity in Section 4. Finally we conclude our report in Section 5.
For the Content-aware Language Model experiments, we implemented a generic N-gram model with two optimizations:
- Interpolation: The probability estimates from N-gram down to unigram are mixed and weighted (3.1.1), and the the weights λs are dynamically tuned using EM Algorithm (3.1.2).
- Smoothing: Instead of using Laplace Smoothing (add-1), we added hyper-parameter k and implemented Add-k Smoothing, with k being tuned on a dev set (3.1.3).
- Low frequency cut-off: Taking a parameter min_freq, we remove all the rare item in vocab and treat them as “UNK”