CSE256 Assignment 3: Language Modeling

Yi Rong <hi@rongyi.ai>

April 27, 2022

1 Introduction

In this report, we discuss the various ways of building probabilistic language models, specif-

ically N-grams. Using the given corpora from three diﬀerent domains, we ﬁrst evaluate

the reference Unigram implementation provided in the starter code in Section 2. Then we

propose our Trigram approach in Section 3, with the implementation details explained in

Section 3.1. We show that our approach outperforms the Unigram baseline in almost every

performance metric, in Section 3.2 and 3.3. Finally we explore the possibility of adapting

our language model from one corpus to another, and demonstrated signiﬁcant improvement

in perplexity in Section 4. Finally we conclude our report in Section 5.

2 Unigram Language Model

2.1 Analysis on In-Domain Text

Table 1: Unigram In-Domain Perplexity

Perplexity brown reuters gutenberg

Train 1513.80 1471.21 982.57

Dev 1589.39 1479.09 991.50

Test 1604.20 1500.69 1005.79

The resulting perplexity matrix is shown in Table 1. We can see in the table that the

model perplexity trained from the Brown dataset and from the Reuters dataset are similar,

while that from Gutenberg is signiﬁcantly lower. We notice that the corpus size of brown

and reuters has similar number of training samples, as well as similar number of vocabs, but

the training size of gutenberg is almost two times as large. Therefore we plot the graph of

perplexity with respect of percentage of training data used as Figure below.

Training Data %

Perplexity

1000

1500

2000

2500

8.00

10.00

20.00

40.00

60.00

80.00

100.00

brown reuters gutenberg

Comparing horizontally, we can see that as the training data size increases, the model

perplexity decreases as we expected, since the model beneﬁt from more training data. Com-

paring them vertically, when all models are using 100% of the dataset, the model trained on

brown has the highest perplexity while that from gutenberg has the lowest. The diﬀerence is

from the original dataset: brown dataset only has 39802 training samples, while gutenberg

has 68767.

Therefore we can reach a weak conclusion that using more training data in the dataset

usually leads to lower perplexity, and we can observe this conclusion even across the three

datasets we use.

2.2 Analysis on Out-of-Domain Text

Table 2: Unigram Out-of-Domain Perplexity

Perplexity Train Dev Test

brown reuters gutenberg brown reuters gutenberg brown reuters gutenberg

brown 1513.8 6780.82 1758.06 1589.39 6675.63 1739.41 1604.2 6736.6 1762.01

reuters 3806.39 1471.21 4882.8 3808.87 1479.09 4833.88 3865.16 1500.69 4887.47

gutenberg 2616.57 12420.1 982.572 2604.28 12256.3 991.5 2626.05 12392.5 1005.79

The Out-of-Domain perplexity for Unigram is shown in Table 2. Across the train, dev,

and test perplexity, we can see a trend that the top-left to bottom-right diagonal values are

way lower than everywhere else. That is because that diagonal line represents the results

of model evaluated on the dataset it was training on. Although the model has not seen the

test data, they are from the same domain and generalizing in-domain is relatively easier.

The numbers not on the diagonal line represents the performance evaluated on the dataset

A with model trained on B. Since these datasets are of distinct background and from various

time: modern day American English, ﬁnancial news, classic literature, it would be extremely

diﬃcult to ask the model to generalize across two diﬀerent dataset. Therefore those numbers

are lower than the diagonal line.

3 Context-aware Language Model

3.1 N-Gram Implementation

For the Content-aware Language Model experiments, we implemented a generic N-gram

model with two optimizations:

• Interpolation: The probability estimates from N-gram down to unigram are mixed and

weighted (3.1.1), and the the weights λs are dynamically tuned using EM Algorithm

(3.1.2).

• Smoothing: Instead of using Laplace Smoothing (add-1), we added hyperparameter k

and implemented Add-k Smoothing, with k being tuned on a dev set (3.1.3).

• Low frequency cut-oﬀ: Taking a parameter min_freq, we remove all the rare item in

vocab and treat them as “UNK”.

In our N-gram implementation, three of the hyper-parameters: maximum N-gram length

N, minimum word frequency min_freq and smoothing fractional count (k

, ...k

) are set

using Grid Search (3.2.2), and the linear interpolation weights (λ

, ..., λ

) are tuned with

EM Algorithm.

We ﬁnally end up with a Trigram (N = 3) approach, with k

= {0, 0.01, 0.001} and

min_freq = 4.

3.1.1 Interpolation

In our N-gram model, sometimes we don’t have enough sample to compute the probability

for n-gram, but we can instead estimate it using the (n-1)-gram probability. Therefore, there

may be times when we need to use less context, enabling the model to generalize more on

less frequent context. There are two common ways to do this: backoﬀ and interpolation.

To take trigram as an example, in backoﬀ, we use trigram if the evidence is suﬃcient,

otherwise fallback to bigram, otherwise unigram. But in interpolation, we always compute

all probability estimates for trigram, bigram and unigram, and then compute the weighted

arithmetic mean to get the ﬁnal probability [3].

We chose to use the simple Linear Intepolation, estimating the trigram probability

P (w

| w

n−2

n−1

) by mixing together the unigram, bigram, and trigram probabilities, each

weighted by a λ :

P (w

| w

n−2

n−1

) =λ

P (w

)

+ λ

P (w

| w

n−1

)

+ λ

P (w

| w

n−2

n−1

)

(1)

The λ s must sum to 1 , making Eq. 1 equivalent to a weighted average:

= 1

The interpolation process takes an optional parameter for λ array, and uses EM Algorithm

described below to further iterate and improve the λ values.

3.1.2 EM Algorithm

The Expectation-Maximization (EM) Algorithm [2] is used to tune the optimal weight array

for N-gram probability computation. The process can be described as follows:

1. Initialize λ weights randomly or according to input init values, need to ensure

λ = 1.

2. E-Step: Calculate the posterior probabilities from current model weights. This con-

structs a lower bound of the objective function at the current weights.

3. M-Step: update model weights using posterior probabilities from E-step. This ﬁnds

the updated weights that maximize the new lower bound.

4. Repeat 2 (E-Step) and 3 (M-Step) until we converge to the maximum of the objective

function

3.1.3 Additive Smoothing

To keep a language model from assigning zero probability to unseen events (while still making

sure the probabilities sum to one), we have to shave oﬀ some probability mass from those

more frequent events and give it to those less or never seen [3]. We can choose to use

Laplace Smoothing (a.k.a. add-one), as introduced in class, by change the probability of

P (w

) =

to:

Laplace

) =

+ 1

N + V

In the case of bigram, the conditional probability can be written as:

Laplace

| w

n−1

) =

C (w

n−1

) + 1

(C (w

n−1

w) + 1)

C (w

n−1

) + 1

C (w

n−1

) + V

(2)

But we instead opt for Add-k Smoothing [1], by adding a fractional count k, this

smoothing method is more ﬂexibible as Laplace smoothing is just a special case at k = 1,

and it’s more conﬁgurable, enabling us to search for an optimal set of k with Grid Search.

The conditional probability for bigram is now:

Add−k

n−1

) =

C (w

n−1

) + k

C (w

n−1

) + kV

In our N-gram implementation, the k

for each i-gram is individually assigned and tuned.

3.2 Analysis of In-Domain Text

3.2.1 Evaluation

Table 3: Trigram In-Domain Perplexity

Perplexity brown reuters gutenberg

Train 142.534 47.103 81.356

Dev 719.461 229.614 352.723

Test 728.000 225.850 340.193

Unigram Test 1604.20 1500.69 1005.79

The perplexity of our Trigram implementation as well as the text perplexity from Unigram

are shown in Table 3. Our approach signiﬁcantly outperforms the Unigram baseline in every

dataset in terms of test perplexity by up to 6.6x.

3.2.2 Hyper-Parameter Grid Search

The grid search for optimal hyper-parameters is performed ﬁrst on N and min_freq. And

after determining the best N, we then search for the k

array for Add-k smoothing. The

candidate for N is {2, 3} and for min_freq is {2, 3, 4}. k

is always zero, so we search k

from i = 2 up to i = N.

Due to the page limit for this report, we put the raw data for grid search as a Table in

Appendix A.2. The best hyperparameter after the search is N = 3 and min_freq = 4, with

k = {0, 0.01, 0.001}

3.2.3 Examples of Sampled Sentences

• Preﬁx: “I like”

– brown: “I like pear He climbed as are testimony score”

– reuters: “I like Westminster merged 607 AG CSR exports is mln vs 110 UNION

staged Consultants Celanese Circle Journal McGroarty 420 95 tendering

closes Spie extension wind Frazier M0 rose broken Avery Bhd TWA said”

– gutenberg: “I like intelligent looking have been building tides hail

sea Gresham upon greatness Letters clerks no fears churches”

Although the above generated sentences are far from grammatically correct, we can

still see the style diﬀerent resulted from the diﬀerent background across three dataset.

The ﬁrst uses words close to oral English, and the second has a lot of numbers and units

notations often found in ﬁnancial reports. The last one, although not comprehensible,

has some literature vibe to it.

• Preﬁx: none & Preﬁx: “Make sure your”: The generated sentences for these two cases

are quite long, and thus moved to Appendix A.1.

3.3 Analysis of Out-of-Domain Text

3.3.1 Evaluation

Table 4: Trigram Out-of-Domain Perplexity

Perplexity Train Dev Test

brown reuters gutenberg brown reuters gutenberg brown reuters gutenberg

brown 142.534 4096.83 1185.61 719.462 3411.92 1080.43 728 3441.48 1075.77

reuters 2871.67 47.1033 2468.61 229.614 4288.43 4833.88 2514.96 225.85 4262.99

gutenberg 1708.57 9951.86 81.3565 1498.22 7796.27 352.723 1498.11 7886.72 340.194

We present the Out-of-Domain perplexity results for our Trigram model in Table 4, and

compare the test perplexity matrix across dataset with Unigram result in Table 5.

Table 5: Trigram vs Unigram Out-of-Domain Test Perplexity

Perplexity Trigram Test Unigram Test

brown reuters gutenberg brown reuters gutenberg

brown 728 3441.48 1075.77 1604.2 6736.6 1762.01

reuters 2514.96 225.85 4262.99 3865.16 1500.69 4887.47

gutenberg 1498.11 7886.72 340.194 2626.05 12392.5 1005.79

As shown in Table 5, the diagonal line is the In-Domain performance we discussed in

Section 3.2. Excluding those numbers, we focus on the upper and lower triangular area

where the model was tested on a diﬀerent dataset it was trained on: Our Trigram model still

manages to outperform unigram in all the train-test dataset combinations. The performance

gain is mainly from the several optimizations we added to N-gram. Interpolation allows our

model to have more context to work with than Unigram, while still have the ﬂexibility to

weight between N-gram down to unigram probability. EM Algorithm further increase the

advantage by dynamically tuning the λ array according to the dev set. And Add-k smoothing

allows more conﬁgurability and space for optimization during Hyper-parameter Grid Search.

4 Adaptation

We use Brown dataset as corpus A and Reuters dataset as corpus B. And trained our model

from Section 3 with the full training set of brown (A) and N% of reuters (B)’s training set.

The result is demonstrated in Table 6.

Table 6: Adaptation: brown + reuters

Perplexity Train Dataset

Test Dataset brown(b) reuters(r) b + 10% r b + 50% r b + 75% r b + r

brown 728 3441.48 749.37 830.99 871.49 907.07

reuters 2514.96 225.85 809.35 389.81 315.62 270.94

As shown in the table, model trained on brown performs poorly on reuters, however, if we

add 10% of reuters data into the training set, the perplexity would drop quite signiﬁcantly.

And as we increase the percentage of reuters data added to the training set, the performance

of the model on reuters data set would become closer to that trained entirely on reuters data.

However, even if we add the entire reuters training set, the performance will not outperform

or even reach that of model trained solely on reuters data, and models trained with hybrid

training data perform worse on the original dataset A.

5 Conclusion

In this report, we attempt to implement a probabilitic langauge model N-gram, and manage

to outperform the baseline of Unigram in all our experiments. In our experience, Trigram

with Add-k smoothing and interpolation tuned with EM Algorithm works best in the three

dataset we have.

References

[1] William Gale and Kenneth Church. 1994. What’s wrong with adding one. Corpus-based

research into language: In honour of Jan Aarts (1994), 189–200.

[2] Frederick Jelinek. 1980. Interpolated estimation of Markov source parameters from sparse

data. Pattern recognition in practice (1980).

[3] Dan Jurafsky and James H Martin. 2019. Speech and Language Processing (3rd (draft)

ed.).

A Appendix

Due to the page and space limit, some of the raw data are placed here in the Appendix.

A.1 Examples of generated sentences

• Preﬁx: “Make sure your”

– brown: “Make sure your very low tension of them to an original unlike

stone asks checked tucked deluded matched Salvation of postwar the pool

speeding dictionary louder urban imposed him same congregation was cherish

Wednesday But bride Plan hours”

– reuters: “Make sure your takes 99 Side securitisation Liberty Orange Heymann

protein et turmoil BAC to Congress in Paris stake KK for idea annualized

641 Princeville rally stock”

– gutenberg: “Make sure your sin dawning Mink this conduct specimens argument

shone of the frenzy Growing Adventures stared tooke Mr Harvey soles in

man sitting grant Exmoor instant quadrant cavil Pindarus hatest and conceits

tones pictures baleen company expectations returneth spouted mockers

inexhaustible Ananias letters shoulder harshly Hartfield avoiding though

houghed Island rheumatic rot his mouth is too explain the Tyre rake appellation

candlestick mornings actors loue unsuitable trees 67 vase Sheshan Huzza

Morton organs disordered Madam and enormous Ho”

• Preﬁx: none

– brown: “The their for experience flowers yes Del chord formation young

keep theatrical whiskey Cultural kicked way with college girl maintaining

managers Kulturbund returning to grass patrons Germans the system of

self awareness not brings ago”

– reuters: “He said in Denmark the European Community relations Gleason

1989 that growing and 192 203 preliminary contacts Seven Canamax sought

submit Autumn companies adverse Cattlemen matching measure seeds platforms

Algemene GRAIN Sheehy Association 616 strive Fats RMJ foster 332 rated

Surinam Ekofisk SOUTHWEST alone overhang TOK Garcia actually CRZY oilseeds

exclusively HUNGARY Also equities NON Lane availability accordance relocate

Packing proxy contest retroactive Sara unless WILLIAMS NOTES comparative

retaliate ISSUE Representatives dwt Guard special rainy REMARKS Massachusetts

poultry fish Walker ingot Scan HEATING Mercer widely SUBROTO Company

644 684 gold properties REPH Versatile NIL mln dlrs vs 25 private 491

NUI”

– gutenberg: “the ark of man whose walketh father and she passeth warmed

model blasphemeth of glory and spake to Frederick build for judged in

reaping thou that knoweth quivered godly deep scented economy tares chalk

east guile is fickle hairs confused blankets announced moored Hare tambourine

conviction wish me knock he was to go Jehoram not go that number and

public enquiries proceeding lets progressive grinning scolded jury permanent

bason shewing they prophesy labourers dates times understanding Baron

Achilles nor generation obscured Norman permitted hands at the first

will set not Wentworth his favour such sheep humped depth engaged rearing

Western involve”

A.2 Grid Search Data

Perplexity on Brown min_freq 2 3 4

N = 2 703.5361 678.4425 666.3012

N = 3 682.6731 653.8977 639.1617