CSE256 Assignment 1: Text Classiﬁcation

Yi Rong <hi@rongyi.ai>

April 11, 2022

Introduction

In this report, we discuss the various ways of data pre-processing and feature engineering

for a text classiﬁcation task. We ﬁrst start by giving an overview of the classiﬁcation task,

the model used, and the given baseline implementation in Section 2. Then we iterate on

top that version guided by the project documentation to use TF-IDF for token weighting to

achieve better accuracy, detailed in Section 3. Finally we present our various approaches for

feature extraction and pre-processing, such as BPE [2] and Word2Vec [1] in Section 4.

We will discuss the accuracy and other performance metrics of the above approaches in

Section 5, will conclude the paper in Section 6.

Text Classiﬁcation

We will work on a simple text classiﬁcation task to predict whether a review is positive or

negative. The overall training and evaluation pipeline can be simpliﬁed as the following

visualization: we divide the process into four main parts: Data IO, Tokenizer, Vectorizer,

and Classiﬁer. Among those, we will mainly focus on improving the accuracy by tweaking

Tokenizer and Vectorizer, and won’t focus too much on modifying classiﬁer or data reading.

In the baseline implementation provided along the project documentation, a simple word

tokenization with whitespace is used to split the sentences, after which the tokens are fed into

a Bag of Words vectorizer to generate a feature vector of vocabulary occurance frequencies

to represent the review.

Labels are encoded using the scikit-learn’s default LabelEncoder, which just produces

the categorical (binary) label, and the Labels will always be encoded as such across all the

experiments we perform, as we mainly focus on transforming the features.

Running the baseline implementation would give us a dev accuracy of 0.7773.

Guided Feature Engineering

In the baseline implementation discussed above, we use Bag of Words to vectorize the tokens

by counting the number of occurance of each word in the vocabulary, and use the occurance

vector as features. In this attempt, we switched to use NumPy’s TfidfVectorizer. It works

by running the previous CountVectorizer on tokens to calculate the BoW representation,

and then perform a TfidfTransformer to return weighted frequency vector.

Using this approach, we can already see accuracy improvement compared to the baseline

with just the default hyperparameter. The TF-IDF has a parameter n, which is the maximum

n-gram feature length ( in scikit-learn, it’s ngram_range ). And in the classiﬁer part, we

can change the regularization strength C to control how much we want the regularization

to be, or how much we trust the training data ( in scikit-learn, C is set to

, the inverse

of regularization strength ). And we’d like to further tune the model to reach its highest

possible accuracy on the dev dataset, so we set up a grid search:

Table 1: Dev Accuracy of BoW + TF-IDF w.r.t N & C

Accuracy N / C 1 2 3 4 5

0.01 0.7467 0.7576 0.7663 0.7685 0.7685

0.1 0.7554 0.7620 0.7641 0.7620 0.7620

1.0 0.7751 0.7925 0.7904 0.7904 0.7904

10.0 0.7772 0.7838 0.7816 0.7794 0.7772

100.0 0.7445 0.7773 0.7817 0.7772 0.7795

C (inverse regularization strength)

Accuracy (log scale)

0.01 0.1 1 10 100

1 2 3 4 5

Since in our experiments, setting the inverse of regularization strength C to 1 and the

maximum n-gram feature length N to 2 yields the best performance under the dev dataset,

we will choose this set of hyper-parameter and its corresponding accuracy as the result of

our Guide Feature Engineering attempt.

Independent Feature Engineering

Byte-Pair Encoding (BPE)

In our independent feature engineering study, we ﬁrst try to replace the tokenizer to Byte-Pair

Encoding [2], with the expectation that in our scenario with a dataset focused on reviews,

it can help with discovering the similarity of words and the meaning of unseen words using

sub-word tokenization.

The data pre-processing still looks similar to baseline, where we have a tokenizer (BPE)

and a vectorizer (TF-IDF). For the tokenizer, we use the BPE Tokenizer from Huggingface,

with pre-tokenization set to Whitespace so that sentences in our training data is ﬁrst split

into words. We then train our tokenizer over the entire training dataset plus the unlabeled

dataset, since we don’t need any labeling data in this stage, and we’d like to learn as much

about the words used in the review as possible.

After we trained our tokenizer, we encode our training data into tokens by running the

BPE Tokenizer over our training set. Then those tokens are fed into TF-IDF for further

ranking similar to the approach in Section 3.

Similar to what we do in the previous section, we also perform a grid search on the

hyper-parameter N for TF-IDF, and get the following result.

Table 2: Dev Accuracy of BPE + TF-IDF w.r.t N & C

N 1 2 3 4 5

Accuracy 0.7598 0.7925 0.8034 0.8013 0.8013

Therefore we take N = 3 as our results for this approach.

Word2Vec

The next things we tried is to replace the TF-IDF vectorization process with something else,

and Word2Vec [1] comes to our attention. The idea is that in addition to train our tokenizer

over the dataset to produce better tokens out of sentences, we also train our vectorizer over

the dataset to better produce feature vectors that suits this review database.

Since our BPE Tokenization provides a reasonable improvements over previous approaches,

we keep using it as our tokenizer, and append a Word2Vec implementation after that to turn

the training tokens into features. We use the gensim package for their implementation, and

leave every hyper-parameters by default. The Word2Vec model is trained with 5 epochs of

data combined from the training set and the unlabeled set to help the vectorizer know the

review dataset better. By default it uses the skip-gram algorithm, and produces a key-value

map from every word in the dataset to a ﬁxed-length vector. We also added additional

handling for unknown words by just returning all zero vector.

For each review in the dataset, the vector of that entire sample is calculated to be the

average of all the word vectors.

In this approach, we can further improve the accuracy on the dev dataset to 0.8056,

making it the highest accuracy we’ve achieved among the various approaches of feature

extractions.

Evaluation

Here we summarize the performance metrics we get from the aforementioned approaches for

feature engineering applied to the text classiﬁcation task.

Table 3: Dev Accuracy of Our Approaches

Name Tokenizer Vectorizer Classiﬁer Train Acc. Dev Acc.

Baseline space BoW LR (C=1) 0.98210 0.77729

TF-IDF space BoW + TF-IDF (N=2) LR (C=1) 0.91532 0.79257

BPE + TF-IDF BPE TF-IDF (N=3) LR (C=1) 0.92994 0.80349

BPE + W2V BPE Word2Vec LR (C=1) 0.80641 0.80568

Approaches

Accuracy

60.00%

70.00%

80.00%

90.00%

100.00%

Baseline TF-IDF BPE + TF-IDF BPE + W2V

Dev Accuracy Train Accuracy

The TF-IDF (Assignment part 2.1) achieves better accuracy than the baseline because

of the automatic weighting of tokens helps the regression algorithm to prioritize learning

on the words that actually mean things rather than common words like “the”. The third

BPE approach (Assignment part 2.2) performs better than the second TF-IDF one, because

we use a brand new tokenizer that better understand the dataset and can do sub-word

tokenization rather just spliting a sentence by whitespaces. The ﬁnal one BPE + Word2Vec

(Assignment part 2.2) got the highest accuracy percentage, due to the fact that both its

tokenizer and vectorizer part has some understanding of the underlying dataset, enabling

it to generate much better feature representation to help with LR-based classiﬁcation than

other approaches.

We also tried stopword removal and stemming, and the improvements in accuracy in

those experiments is not as signiﬁcant as replacing the tokenizer and vectorizer. Also due to

the space limitation of this report, we decide to only showcase the above two Independent

Feature Engineering approaches here.

Conclusion

In this report, we attempt to improve the feature extraction for text classiﬁcation, and

manage to outperform the baseline of bag-of-words approach in all our experiments. And

in our experience, using BPE as tokenizer and Word2Vec as feature vectorizer generates the

best result on this particular dataset.

References

[1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeﬀrey Dean. 2013. Eﬃcient estimation

of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).

[2] Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine translation

of rare words with subword units. arXiv preprint arXiv:1508.07909 (2015).