CSE256 Assignment 3: Language Modeling
Yi Rong <hi@rongyi.ai>
April 27, 2022
1 Introduction
In this report, we discuss the various ways of building probabilistic language models, specif-
ically N-grams. Using the given corpora from three different domains, we first evaluate
the reference Unigram implementation provided in the starter code in Section 2. Then we
propose our Trigram approach in Section 3, with the implementation details explained in
Section 3.1. We show that our approach outperforms the Unigram baseline in almost every
performance metric, in Section 3.2 and 3.3. Finally we explore the possibility of adapting
our language model from one corpus to another, and demonstrated significant improvement
in perplexity in Section 4. Finally we conclude our report in Section 5.
2 Unigram Language Model
2.1 Analysis on In-Domain Text
Table 1: Unigram In-Domain Perplexity
Perplexity brown reuters gutenberg
Train 1513.80 1471.21 982.57
Dev 1589.39 1479.09 991.50
Test 1604.20 1500.69 1005.79
The resulting perplexity matrix is shown in Table 1. We can see in the table that the
model perplexity trained from the Brown dataset and from the Reuters dataset are similar,
while that from Gutenberg is significantly lower. We notice that the corpus size of brown
and reuters has similar number of training samples, as well as similar number of vocabs, but
the training size of gutenberg is almost two times as large. Therefore we plot the graph of
perplexity with respect of percentage of training data used as Figure below.
Training Data %
brown reuters gutenberg
Comparing horizontally, we can see that as the training data size increases, the model
perplexity decreases as we expected, since the model benefit from more training data. Com-
paring them vertically, when all models are using 100% of the dataset, the model trained on
brown has the highest perplexity while that from gutenberg has the lowest. The difference is
from the original dataset: brown dataset only has 39802 training samples, while gutenberg
has 68767.
Therefore we can reach a weak conclusion that using more training data in the dataset
usually leads to lower perplexity, and we can observe this conclusion even across the three
datasets we use.
2.2 Analysis on Out-of-Domain Text
Table 2: Unigram Out-of-Domain Perplexity
Perplexity Train Dev Test
brown reuters gutenberg brown reuters gutenberg brown reuters gutenberg
brown 1513.8 6780.82 1758.06 1589.39 6675.63 1739.41 1604.2 6736.6 1762.01
reuters 3806.39 1471.21 4882.8 3808.87 1479.09 4833.88 3865.16 1500.69 4887.47
gutenberg 2616.57 12420.1 982.572 2604.28 12256.3 991.5 2626.05 12392.5 1005.79
The Out-of-Domain perplexity for Unigram is shown in Table 2. Across the train, dev,
and test perplexity, we can see a trend that the top-left to bottom-right diagonal values are
way lower than everywhere else. That is because that diagonal line represents the results
of model evaluated on the dataset it was training on. Although the model has not seen the
test data, they are from the same domain and generalizing in-domain is relatively easier.
The numbers not on the diagonal line represents the performance evaluated on the dataset
A with model trained on B. Since these datasets are of distinct background and from various
time: modern day American English, financial news, classic literature, it would be extremely
difficult to ask the model to generalize across two different dataset. Therefore those numbers
are lower than the diagonal line.
3 Context-aware Language Model
3.1 N-Gram Implementation
For the Content-aware Language Model experiments, we implemented a generic N-gram
model with two optimizations:
Interpolation: The probability estimates from N-gram down to unigram are mixed and
weighted (3.1.1), and the the weights λs are dynamically tuned using EM Algorithm
Smoothing: Instead of using Laplace Smoothing (add-1), we added hyperparameter k
and implemented Add-k Smoothing, with k being tuned on a dev set (3.1.3).
Low frequency cut-off: Taking a parameter min_freq, we remove all the rare item in
vocab and treat them as UNK”.
In our N-gram implementation, three of the hyper-parameters: maximum N-gram length
N, minimum word frequency min_freq and smoothing fractional count (k
, ...k
) are set
using Grid Search (3.2.2), and the linear interpolation weights (λ
, ..., λ
) are tuned with
EM Algorithm.
We finally end up with a Trigram (N = 3) approach, with k
= {0, 0.01, 0.001} and
min_freq = 4.
3.1.1 Interpolation
In our N-gram model, sometimes we don’t have enough sample to compute the probability
for n-gram, but we can instead estimate it using the (n-1)-gram probability. Therefore, there
may be times when we need to use less context, enabling the model to generalize more on
less frequent context. There are two common ways to do this: backoff and interpolation.
To take trigram as an example, in backoff, we use trigram if the evidence is sufficient,
otherwise fallback to bigram, otherwise unigram. But in interpolation, we always compute
all probability estimates for trigram, bigram and unigram, and then compute the weighted
arithmetic mean to get the final probability [3].
We chose to use the simple Linear Intepolation, estimating the trigram probability
P (w
| w
) by mixing together the unigram, bigram, and trigram probabilities, each
weighted by a λ :
P (w
| w
) =λ
P (w
+ λ
P (w
| w
+ λ
P (w
| w
The λ s must sum to 1 , making Eq. 1 equivalent to a weighted average:
= 1
The interpolation process takes an optional parameter for λ array, and uses EM Algorithm
described below to further iterate and improve the λ values.
3.1.2 EM Algorithm
The Expectation-Maximization (EM) Algorithm [2] is used to tune the optimal weight array
for N-gram probability computation. The process can be described as follows:
1. Initialize λ weights randomly or according to input init values, need to ensure
λ = 1.
2. E-Step: Calculate the posterior probabilities from current model weights. This con-
structs a lower bound of the objective function at the current weights.
3. M-Step: update model weights using posterior probabilities from E-step. This finds
the updated weights that maximize the new lower bound.
4. Repeat 2 (E-Step) and 3 (M-Step) until we converge to the maximum of the objective
3.1.3 Additive Smoothing
To keep a language model from assigning zero probability to unseen events (while still making
sure the probabilities sum to one), we have to shave off some probability mass from those
more frequent events and give it to those less or never seen [3]. We can choose to use
Laplace Smoothing (a.k.a. add-one), as introduced in class, by change the probability of
P (w
) =
) =
+ 1
N + V
In the case of bigram, the conditional probability can be written as:
| w
) =
C (w
) + 1
(C (w
w) + 1)
C (w
) + 1
C (w
) + V
But we instead opt for Add-k Smoothing [1], by adding a fractional count k, this
smoothing method is more flexibible as Laplace smoothing is just a special case at k = 1,
and it’s more configurable, enabling us to search for an optimal set of k with Grid Search.
The conditional probability for bigram is now:
) =
C (w
) + k
C (w
) + kV
In our N-gram implementation, the k
for each i-gram is individually assigned and tuned.
3.2 Analysis of In-Domain Text
3.2.1 Evaluation
Table 3: Trigram In-Domain Perplexity
Perplexity brown reuters gutenberg
Train 142.534 47.103 81.356
Dev 719.461 229.614 352.723
Test 728.000 225.850 340.193
Unigram Test 1604.20 1500.69 1005.79
The perplexity of our Trigram implementation as well as the text perplexity from Unigram
are shown in Table 3. Our approach significantly outperforms the Unigram baseline in every
dataset in terms of test perplexity by up to 6.6x.
3.2.2 Hyper-Parameter Grid Search
The grid search for optimal hyper-parameters is performed first on N and min_freq. And
after determining the best N, we then search for the k
array for Add-k smoothing. The
candidate for N is {2, 3} and for min_freq is {2, 3, 4}. k
is always zero, so we search k
from i = 2 up to i = N.
Due to the page limit for this report, we put the raw data for grid search as a Table in
Appendix A.2. The best hyperparameter after the search is N = 3 and min_freq = 4, with
k = {0, 0.01, 0.001}
3.2.3 Examples of Sampled Sentences
Prefix: “I like”
brown: I like pear He climbed as are testimony score
reuters: I like Westminster merged 607 AG CSR exports is mln vs 110 UNION
staged Consultants Celanese Circle Journal McGroarty 420 95 tendering
closes Spie extension wind Frazier M0 rose broken Avery Bhd TWA said
gutenberg: I like intelligent looking have been building tides hail
sea Gresham upon greatness Letters clerks no fears churches
Although the above generated sentences are far from grammatically correct, we can
still see the style different resulted from the different background across three dataset.
The first uses words close to oral English, and the second has a lot of numbers and units
notations often found in financial reports. The last one, although not comprehensible,
has some literature vibe to it.
Prefix: none & Prefix: “Make sure your”: The generated sentences for these two cases
are quite long, and thus moved to Appendix A.1.
3.3 Analysis of Out-of-Domain Text
3.3.1 Evaluation
Table 4: Trigram Out-of-Domain Perplexity
Perplexity Train Dev Test
brown reuters gutenberg brown reuters gutenberg brown reuters gutenberg
brown 142.534 4096.83 1185.61 719.462 3411.92 1080.43 728 3441.48 1075.77
reuters 2871.67 47.1033 2468.61 229.614 4288.43 4833.88 2514.96 225.85 4262.99
gutenberg 1708.57 9951.86 81.3565 1498.22 7796.27 352.723 1498.11 7886.72 340.194
We present the Out-of-Domain perplexity results for our Trigram model in Table 4, and
compare the test perplexity matrix across dataset with Unigram result in Table 5.
Table 5: Trigram vs Unigram Out-of-Domain Test Perplexity
Perplexity Trigram Test Unigram Test
brown reuters gutenberg brown reuters gutenberg
brown 728 3441.48 1075.77 1604.2 6736.6 1762.01
reuters 2514.96 225.85 4262.99 3865.16 1500.69 4887.47
gutenberg 1498.11 7886.72 340.194 2626.05 12392.5 1005.79
As shown in Table 5, the diagonal line is the In-Domain performance we discussed in
Section 3.2. Excluding those numbers, we focus on the upper and lower triangular area
where the model was tested on a different dataset it was trained on: Our Trigram model still
manages to outperform unigram in all the train-test dataset combinations. The performance
gain is mainly from the several optimizations we added to N-gram. Interpolation allows our
model to have more context to work with than Unigram, while still have the flexibility to
weight between N-gram down to unigram probability. EM Algorithm further increase the
advantage by dynamically tuning the λ array according to the dev set. And Add-k smoothing
allows more configurability and space for optimization during Hyper-parameter Grid Search.
4 Adaptation
We use Brown dataset as corpus A and Reuters dataset as corpus B. And trained our model
from Section 3 with the full training set of brown (A) and N% of reuters (B)’s training set.
The result is demonstrated in Table 6.
Table 6: Adaptation: brown + reuters
Perplexity Train Dataset
Test Dataset brown(b) reuters(r) b + 10% r b + 50% r b + 75% r b + r
brown 728 3441.48 749.37 830.99 871.49 907.07
reuters 2514.96 225.85 809.35 389.81 315.62 270.94
As shown in the table, model trained on brown performs poorly on reuters, however, if we
add 10% of reuters data into the training set, the perplexity would drop quite significantly.
And as we increase the percentage of reuters data added to the training set, the performance
of the model on reuters data set would become closer to that trained entirely on reuters data.
However, even if we add the entire reuters training set, the performance will not outperform
or even reach that of model trained solely on reuters data, and models trained with hybrid
training data perform worse on the original dataset A.
5 Conclusion
In this report, we attempt to implement a probabilitic langauge model N-gram, and manage
to outperform the baseline of Unigram in all our experiments. In our experience, Trigram
with Add-k smoothing and interpolation tuned with EM Algorithm works best in the three
dataset we have.
[1] William Gale and Kenneth Church. 1994. What’s wrong with adding one. Corpus-based
research into language: In honour of Jan Aarts (1994), 189–200.
[2] Frederick Jelinek. 1980. Interpolated estimation of Markov source parameters from sparse
data. Pattern recognition in practice (1980).
[3] Dan Jurafsky and James H Martin. 2019. Speech and Language Processing (3rd (draft)
A Appendix
Due to the page and space limit, some of the raw data are placed here in the Appendix.
A.1 Examples of generated sentences
Prefix: “Make sure your”
brown: Make sure your very low tension of them to an original unlike
stone asks checked tucked deluded matched Salvation of postwar the pool
speeding dictionary louder urban imposed him same congregation was cherish
Wednesday But bride Plan hours
reuters: Make sure your takes 99 Side securitisation Liberty Orange Heymann
protein et turmoil BAC to Congress in Paris stake KK for idea annualized
641 Princeville rally stock
gutenberg: Make sure your sin dawning Mink this conduct specimens argument
shone of the frenzy Growing Adventures stared tooke Mr Harvey soles in
man sitting grant Exmoor instant quadrant cavil Pindarus hatest and conceits
tones pictures baleen company expectations returneth spouted mockers
inexhaustible Ananias letters shoulder harshly Hartfield avoiding though
houghed Island rheumatic rot his mouth is too explain the Tyre rake appellation
candlestick mornings actors loue unsuitable trees 67 vase Sheshan Huzza
Morton organs disordered Madam and enormous Ho
Prefix: none
brown: The their for experience flowers yes Del chord formation young
keep theatrical whiskey Cultural kicked way with college girl maintaining
managers Kulturbund returning to grass patrons Germans the system of
self awareness not brings ago
reuters: He said in Denmark the European Community relations Gleason
1989 that growing and 192 203 preliminary contacts Seven Canamax sought
submit Autumn companies adverse Cattlemen matching measure seeds platforms
Algemene GRAIN Sheehy Association 616 strive Fats RMJ foster 332 rated
Surinam Ekofisk SOUTHWEST alone overhang TOK Garcia actually CRZY oilseeds
exclusively HUNGARY Also equities NON Lane availability accordance relocate
Packing proxy contest retroactive Sara unless WILLIAMS NOTES comparative
retaliate ISSUE Representatives dwt Guard special rainy REMARKS Massachusetts
poultry fish Walker ingot Scan HEATING Mercer widely SUBROTO Company
644 684 gold properties REPH Versatile NIL mln dlrs vs 25 private 491
gutenberg: the ark of man whose walketh father and she passeth warmed
model blasphemeth of glory and spake to Frederick build for judged in
reaping thou that knoweth quivered godly deep scented economy tares chalk
east guile is fickle hairs confused blankets announced moored Hare tambourine
conviction wish me knock he was to go Jehoram not go that number and
public enquiries proceeding lets progressive grinning scolded jury permanent
bason shewing they prophesy labourers dates times understanding Baron
Achilles nor generation obscured Norman permitted hands at the first
will set not Wentworth his favour such sheep humped depth engaged rearing
Western involve
A.2 Grid Search Data
Perplexity on Brown min_freq 2 3 4
N = 2 703.5361 678.4425 666.3012
N = 3 682.6731 653.8977 639.1617