we will choose this set of hyper-parameter and its corresponding accuracy as the result of
our Guide Feature Engineering attempt.
Independent Feature Engineering
Byte-Pair Encoding (BPE)
In our independent feature engineering study, we first try to replace the tokenizer to Byte-Pair
Encoding [2], with the expectation that in our scenario with a dataset focused on reviews,
it can help with discovering the similarity of words and the meaning of unseen words using
sub-word tokenization.
The data pre-processing still looks similar to baseline, where we have a tokenizer (BPE)
and a vectorizer (TF-IDF). For the tokenizer, we use the BPE Tokenizer from Huggingface,
with pre-tokenization set to Whitespace so that sentences in our training data is first split
into words. We then train our tokenizer over the entire training dataset plus the unlabeled
dataset, since we don’t need any labeling data in this stage, and we’d like to learn as much
about the words used in the review as possible.
After we trained our tokenizer, we encode our training data into tokens by running the
BPE Tokenizer over our training set. Then those tokens are fed into TF-IDF for further
ranking similar to the approach in Section 3.
Similar to what we do in the previous section, we also perform a grid search on the
hyper-parameter N for TF-IDF, and get the following result.
Table 2: Dev Accuracy of BPE + TF-IDF w.r.t N & C
N 1 2 3 4 5
Accuracy 0.7598 0.7925 0.8034 0.8013 0.8013
Therefore we take N = 3 as our results for this approach.
Word2Vec
The next things we tried is to replace the TF-IDF vectorization process with something else,
and Word2Vec [1] comes to our attention. The idea is that in addition to train our tokenizer
over the dataset to produce better tokens out of sentences, we also train our vectorizer over
the dataset to better produce feature vectors that suits this review database.
Since our BPE Tokenization provides a reasonable improvements over previous approaches,
we keep using it as our tokenizer, and append a Word2Vec implementation after that to turn
the training tokens into features. We use the gensim package for their implementation, and
leave every hyper-parameters by default. The Word2Vec model is trained with 5 epochs of
data combined from the training set and the unlabeled set to help the vectorizer know the
review dataset better. By default it uses the skip-gram algorithm, and produces a key-value
map from every word in the dataset to a fixed-length vector. We also added additional
handling for unknown words by just returning all zero vector.
3