Close

Via Don Minzoni, 59 - 73025 - Martano (LE)

How Tokenization and Fine-Tuning Optimize Sentiment Analysis
Data Analysis

How Tokenization and Fine-Tuning Optimize Sentiment Analysis: A Practical Case Study

By, Alberto
  • 2 Apr, 2025
  • 94 Views
  • 0 Comment

_Why Tokenization and Fine-Tuning Are Crucial

In today’s world, automated management of large volumes of data has become indispensable for improving operational efficiency across various sectors. Tokenization and fine-tuning are two fundamental techniques in machine learning that, when applied correctly, can significantly enhance a model’s ability to interpret and analyze complex data. But why are these processes so important?

Efficiency and Accuracy

Machine learning models enable automated processing of vast amounts of text, reducing human effort and improving accuracy. Fine-tuning a model enhances its ability to understand context-specific nuances, while tokenization ensures that text is efficiently processed into a format suitable for machine learning algorithms. Without these optimizations, even powerful models can struggle with inconsistencies and ambiguity in raw text data.

Practical Applications

  • Sentiment analysis: Helps companies understand customer feedback from reviews or social media, leading to better decision-making.
  • Automatic classification: Filters and organizes textual content like emails, comments, or feedback into useful categories.
  • Personalized recommendations: Recommender systems provide users with tailored suggestions based on collected data.

_Essential Tools for Tokenization and Fine-Tuning

The following code examples are written in Python and utilize the following libraries and tools:

  • Transformers: A library developed by Hugging Face, ideal for fine-tuning pre-trained language models.
  • AutoTokenizer: A tool that efficiently tokenizes texts.
  • PyTorch: Used for model management during training and evaluation.

Tokenization and Fine-Tuning: A Deep Dive

In the previous article, we introduced sentiment analysis using the Yelp Polarity dataset. In this article, we’ll focus on the technical aspects of tokenization and fine-tuning, exploring how these techniques optimize the machine learning process to achieve higher accuracy.

Downloading and Preparing the Dataset

To begin, we use a subset of the Yelp Polarity dataset to speed up computation time. This dataset contains reviews with positive and negative labels, which we will use to train the model.

dataset = load_dataset("yelp_polarity")
dataset['train'] = dataset['train'].select(range(1000))
dataset['test'] = dataset['test'].select(range(100))

Tokenization: The Core of Preprocessing

Tokenization is a critical process that converts raw text into a format that the machine learning model can understand. We use the DistilBERT model (a smaller, optimized version of the well-known BERT model – Bidirectional Encoder Representations from Transformers) to tokenize the text. DistilBERT is known for being efficient, reducing model size while maintaining similar performance to BERT.

from transformers import AutoTokenizer
MODEL_NAME = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

def tokenize_function(examples):
    return tokenizer(examples['text'], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])

_Evaluating the Model Without Fine-Tuning

The first step is to evaluate the model without any optimization, using the pre-trained DistilBERT model. This provides a baseline for comparison with later improvements.

from transformers import Trainer
from transformers import AutoModelForSequenceClassification
import numpy as np
from sklearn.metrics import accuracy_score

model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=2).to('cuda')
trainer_no_finetune = Trainer(model=model)

predictions_no_finetune = trainer_no_finetune.predict(tokenized_datasets['test'])
preds_no_finetune = np.argmax(predictions_no_finetune.predictions, axis=1)

accuracy_no_finetune = accuracy_score(tokenized_datasets['test']['label'], preds_no_finetune)
print(f"Accuracy without fine-tuning: {accuracy_no_finetune}")

Output:

Accuracy without fine-tuning: 0.43

_Fine-Tuning on Pre-Trained Models

To improve accuracy, we use a model pre-trained on a similar dataset, SST-2 (Stanford Sentiment Treebank), and apply it to the Yelp dataset.

sst_model_name = "distilbert-base-uncased-finetuned-sst-2-english"
sst_model = AutoModelForSequenceClassification.from_pretrained(sst_model_name).to('cuda')

trainer_sst_finetuned = Trainer(model=sst_model)
predictions_sst_finetune = trainer_sst_finetuned.predict(tokenized_datasets['test'])
accuracy_sst_finetune = accuracy_score(tokenized_datasets['test']['label'], np.argmax(predictions_sst_finetune.predictions, axis=1))

print(f"Accuracy with fine-tuning on SST-2: {accuracy_sst_finetune}")

Output:

Accuracy with fine-tuning on SST-2: 0.85

_Custom Fine-Tuning on the Yelp Polarity Dataset

Next, we perform custom fine-tuning on the Yelp Polarity dataset to maximize accuracy by tailoring the model to more relevant data.

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    num_train_epochs=3,
)

trainer_finetune = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test'],
)

trainer_finetune.train()

Output:

Accuracy after custom (with Yelp-Polarity dataset) fine-tuning: 0.91

_Comparing the Results

After applying the fine-tuning, we can compare the model’s performance across different scenarios:

predictions_finetune = trainer_finetune.predict(tokenized_datasets['test'])
accuracy_finetune = accuracy_score(tokenized_datasets['test']['label'], np.argmax(predictions_finetune.predictions, axis=1))

print(f"Accuracy without fine-tuning: {accuracy_no_finetune}")
print(f"Accuracy with fine-tuning on SST-2: {accuracy_sst_finetune}")
print(f"Accuracy with custom fine-tuning on Yelp: {accuracy_finetune}")

Final Output:

Accuracy without fine-tuning: 0.43
Accuracy with fine-tuning on SST-2: 0.85
Accuracy with custom fine-tuning on Yelp: 0.91

_Conclusion and Benefits

Through tokenization and fine-tuning, we have significantly improved the performance of machine learning models. These processes apply not only to sentiment analysis but also to other areas like automatic classification and recommendation systems. In fields such as customer support or marketing, these models can be customized to meet specific needs, reducing data processing time and increasing the accuracy of analyses.

While accuracy is a useful metric, a thorough statistical evaluation would require additional metrics such as precision, recall, and F1-score. Nonetheless, this analysis demonstrates the substantial improvements achieved through fine-tuning.