CSE256 Assignment 1: Text Classification
Yi Rong <hi@rongyi.ai>
April 11, 2022
In this report, we discuss the various ways of data pre-processing and feature engineering
for a text classification task. We first start by giving an overview of the classification task,
the model used, and the given baseline implementation in Section 2. Then we iterate on
top that version guided by the project documentation to use TF-IDF for token weighting to
achieve better accuracy, detailed in Section 3. Finally we present our various approaches for
feature extraction and pre-processing, such as BPE [2] and Word2Vec [1] in Section 4.
We will discuss the accuracy and other performance metrics of the above approaches in
Section 5, will conclude the paper in Section 6.
Text Classification
We will work on a simple text classification task to predict whether a review is positive or
negative. The overall training and evaluation pipeline can be simplified as the following
visualization: we divide the process into four main parts: Data IO, Tokenizer, Vectorizer,
and Classifier. Among those, we will mainly focus on improving the accuracy by tweaking
Tokenizer and Vectorizer, and won’t focus too much on modifying classifier or data reading.
In the baseline implementation provided along the project documentation, a simple word
tokenization with whitespace is used to split the sentences, after which the tokens are fed into
a Bag of Words vectorizer to generate a feature vector of vocabulary occurance frequencies
to represent the review.
Labels are encoded using the scikit-learn’s default LabelEncoder, which just produces
the categorical (binary) label, and the Labels will always be encoded as such across all the
experiments we perform, as we mainly focus on transforming the features.
Running the baseline implementation would give us a dev accuracy of 0.7773.
Guided Feature Engineering
In the baseline implementation discussed above, we use Bag of Words to vectorize the tokens
by counting the number of occurance of each word in the vocabulary, and use the occurance
vector as features. In this attempt, we switched to use NumPy’s TfidfVectorizer. It works
by running the previous CountVectorizer on tokens to calculate the BoW representation,
and then perform a TfidfTransformer to return weighted frequency vector.
Using this approach, we can already see accuracy improvement compared to the baseline
with just the default hyperparameter. The TF-IDF has a parameter n, which is the maximum
n-gram feature length ( in scikit-learn, it’s ngram_range ). And in the classifier part, we
can change the regularization strength C to control how much we want the regularization
to be, or how much we trust the training data ( in scikit-learn, C is set to
, the inverse
of regularization strength ). And we’d like to further tune the model to reach its highest
possible accuracy on the dev dataset, so we set up a grid search:
Table 1: Dev Accuracy of BoW + TF-IDF w.r.t N & C
Accuracy N / C 1 2 3 4 5
0.01 0.7467 0.7576 0.7663 0.7685 0.7685
0.1 0.7554 0.7620 0.7641 0.7620 0.7620
1.0 0.7751 0.7925 0.7904 0.7904 0.7904
10.0 0.7772 0.7838 0.7816 0.7794 0.7772
100.0 0.7445 0.7773 0.7817 0.7772 0.7795
C (inverse regularization strength)
Accuracy (log scale)
0.01 0.1 1 10 100
1 2 3 4 5
Since in our experiments, setting the inverse of regularization strength C to 1 and the
maximum n-gram feature length N to 2 yields the best performance under the dev dataset,
we will choose this set of hyper-parameter and its corresponding accuracy as the result of
our Guide Feature Engineering attempt.
Independent Feature Engineering
Byte-Pair Encoding (BPE)
In our independent feature engineering study, we first try to replace the tokenizer to Byte-Pair
Encoding [2], with the expectation that in our scenario with a dataset focused on reviews,
it can help with discovering the similarity of words and the meaning of unseen words using
sub-word tokenization.
The data pre-processing still looks similar to baseline, where we have a tokenizer (BPE)
and a vectorizer (TF-IDF). For the tokenizer, we use the BPE Tokenizer from Huggingface,
with pre-tokenization set to Whitespace so that sentences in our training data is first split
into words. We then train our tokenizer over the entire training dataset plus the unlabeled
dataset, since we don’t need any labeling data in this stage, and we’d like to learn as much
about the words used in the review as possible.
After we trained our tokenizer, we encode our training data into tokens by running the
BPE Tokenizer over our training set. Then those tokens are fed into TF-IDF for further
ranking similar to the approach in Section 3.
Similar to what we do in the previous section, we also perform a grid search on the
hyper-parameter N for TF-IDF, and get the following result.
Table 2: Dev Accuracy of BPE + TF-IDF w.r.t N & C
N 1 2 3 4 5
Accuracy 0.7598 0.7925 0.8034 0.8013 0.8013
Therefore we take N = 3 as our results for this approach.
The next things we tried is to replace the TF-IDF vectorization process with something else,
and Word2Vec [1] comes to our attention. The idea is that in addition to train our tokenizer
over the dataset to produce better tokens out of sentences, we also train our vectorizer over
the dataset to better produce feature vectors that suits this review database.
Since our BPE Tokenization provides a reasonable improvements over previous approaches,
we keep using it as our tokenizer, and append a Word2Vec implementation after that to turn
the training tokens into features. We use the gensim package for their implementation, and
leave every hyper-parameters by default. The Word2Vec model is trained with 5 epochs of
data combined from the training set and the unlabeled set to help the vectorizer know the
review dataset better. By default it uses the skip-gram algorithm, and produces a key-value
map from every word in the dataset to a fixed-length vector. We also added additional
handling for unknown words by just returning all zero vector.
For each review in the dataset, the vector of that entire sample is calculated to be the
average of all the word vectors.
In this approach, we can further improve the accuracy on the dev dataset to 0.8056,
making it the highest accuracy we’ve achieved among the various approaches of feature
Here we summarize the performance metrics we get from the aforementioned approaches for
feature engineering applied to the text classification task.
Table 3: Dev Accuracy of Our Approaches
Name Tokenizer Vectorizer Classifier Train Acc. Dev Acc.
Baseline space BoW LR (C=1) 0.98210 0.77729
TF-IDF space BoW + TF-IDF (N=2) LR (C=1) 0.91532 0.79257
BPE + TF-IDF BPE TF-IDF (N=3) LR (C=1) 0.92994 0.80349
BPE + W2V BPE Word2Vec LR (C=1) 0.80641 0.80568
Baseline TF-IDF BPE + TF-IDF BPE + W2V
Dev Accuracy Train Accuracy
The TF-IDF (Assignment part 2.1) achieves better accuracy than the baseline because
of the automatic weighting of tokens helps the regression algorithm to prioritize learning
on the words that actually mean things rather than common words like “the”. The third
BPE approach (Assignment part 2.2) performs better than the second TF-IDF one, because
we use a brand new tokenizer that better understand the dataset and can do sub-word
tokenization rather just spliting a sentence by whitespaces. The final one BPE + Word2Vec
(Assignment part 2.2) got the highest accuracy percentage, due to the fact that both its
tokenizer and vectorizer part has some understanding of the underlying dataset, enabling
it to generate much better feature representation to help with LR-based classification than
other approaches.
We also tried stopword removal and stemming, and the improvements in accuracy in
those experiments is not as significant as replacing the tokenizer and vectorizer. Also due to
the space limitation of this report, we decide to only showcase the above two Independent
Feature Engineering approaches here.
In this report, we attempt to improve the feature extraction for text classification, and
manage to outperform the baseline of bag-of-words approach in all our experiments. And
in our experience, using BPE as tokenizer and Word2Vec as feature vectorizer generates the
best result on this particular dataset.
[1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation
of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
[2] Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine translation
of rare words with subword units. arXiv preprint arXiv:1508.07909 (2015).