The numbers not on the diagonal line represents the performance evaluated on the dataset
A with model trained on B. Since these datasets are of distinct background and from various
time: modern day American English, financial news, classic literature, it would be extremely
difficult to ask the model to generalize across two different dataset. Therefore those numbers
are lower than the diagonal line.
3 Context-aware Language Model
3.1 N-Gram Implementation
For the Content-aware Language Model experiments, we implemented a generic N-gram
model with two optimizations:
• Interpolation: The probability estimates from N-gram down to unigram are mixed and
weighted (3.1.1), and the the weights λs are dynamically tuned using EM Algorithm
(3.1.2).
• Smoothing: Instead of using Laplace Smoothing (add-1), we added hyperparameter k
and implemented Add-k Smoothing, with k being tuned on a dev set (3.1.3).
• Low frequency cut-off: Taking a parameter min_freq, we remove all the rare item in
vocab and treat them as “UNK”.
In our N-gram implementation, three of the hyper-parameters: maximum N-gram length
N, minimum word frequency min_freq and smoothing fractional count (k
1
, ...k
N
) are set
using Grid Search (3.2.2), and the linear interpolation weights (λ
1
, ..., λ
N
) are tuned with
EM Algorithm.
We finally end up with a Trigram (N = 3) approach, with k
i
= {0, 0.01, 0.001} and
min_freq = 4.
3.1.1 Interpolation
In our N-gram model, sometimes we don’t have enough sample to compute the probability
for n-gram, but we can instead estimate it using the (n-1)-gram probability. Therefore, there
may be times when we need to use less context, enabling the model to generalize more on
less frequent context. There are two common ways to do this: backoff and interpolation.
To take trigram as an example, in backoff, we use trigram if the evidence is sufficient,
otherwise fallback to bigram, otherwise unigram. But in interpolation, we always compute
all probability estimates for trigram, bigram and unigram, and then compute the weighted
arithmetic mean to get the final probability [3].
We chose to use the simple Linear Intepolation, estimating the trigram probability
P (w
n
| w
n−2
w
n−1
) by mixing together the unigram, bigram, and trigram probabilities, each
weighted by a λ :
ˆ
P (w
n
| w
n−2
w
n−1
) =λ
1
P (w
n
)
+ λ
2
P (w
n
| w
n−1
)
+ λ
3
P (w
n
| w
n−2
w
n−1
)
(1)
3