0% found this document useful (0 votes)
33 views6 pages

IMDb Dataset Text Classification Guide

The document provides instructions for loading and preprocessing the IMDb dataset using HuggingFace's Datasets library and DistilBERT tokenizer. It then describes how to train a DistilBERT model for sentiment classification on the preprocessed IMDb data using either PyTorch or TensorFlow. Finally, it explains how to use the finetuned model for inference on new text examples.

Uploaded by

dilip
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views6 pages

IMDb Dataset Text Classification Guide

The document provides instructions for loading and preprocessing the IMDb dataset using HuggingFace's Datasets library and DistilBERT tokenizer. It then describes how to train a DistilBERT model for sentiment classification on the preprocessed IMDb data using either PyTorch or TensorFlow. Finally, it explains how to use the finetuned model for inference on new text examples.

Uploaded by

dilip
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd

Load IMDb dataset

Start by loading the IMDb dataset from the Datasets library:

from datasets import load_dataset

imdb = load_dataset("imdb")

Then take a look at an example:

imdb["test"][0]
{
"label": 0,
"text": "I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV
are usually underfunded, under-appreciated and misunderstood. I tried to like this,
I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the
original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that
doesn't match the background, and painfully one-dimensional characters cannot be
overcome with a 'sci-fi' setting. (I'm sure there are those of you out there who
think Babylon 5 is good sci-fi TV. It's not. It's clichéd and uninspiring.) While
US viewers might like emotion and character development, sci-fi is a genre that
does not take itself seriously (cf. Star Trek). It may treat important issues, yet
not as a serious philosophy. It's really difficult to care about the characters
here as they are not simply foolish, just missing a spark of life. Their actions
and reactions are wooden and predictable, often painful to watch. The makers of
Earth KNOW it's rubbish as they have to always say \"Gene Roddenberry's Earth...\"
otherwise people would not continue watching. Roddenberry's ashes must be turning
in their orbit as this dull, cheap, poorly edited (watching it without advert
breaks really brings this home) trudging Trabant of a show lumbers into space.
Spoiler. So, kill off a main character. And then bring him back as another actor.
Jeeez! Dallas all over again.",
}

There are two fields in this dataset:

text: the movie review text.

label: a value that is either 0 for a negative review or 1 for a positive review.

Preprocess

The next step is to load a DistilBERT tokenizer to preprocess the text field:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

Create a preprocessing function to tokenize text and truncate sequences to be no


longer than DistilBERT’s maximum input length:

def preprocess_function(examples):
return tokenizer(examples["text"], truncation=True)

To apply the preprocessing function over the entire dataset, use 🤗 Datasets map
function. You can speed up map by setting batched=True to process multiple elements
of the dataset at once:
tokenized_imdb = [Link](preprocess_function, batched=True)

Now create a batch of examples using DataCollatorWithPadding. It’s more efficient


to dynamically pad the sentences to the longest length in a batch during collation,
instead of padding the whole dataset to the maximium length.

Pytorch
Hide Pytorch content

from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

TensorFlow
Hide TensorFlow content

from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")

Evaluate
Including a metric during training is often helpful for evaluating your model’s
performance. You can quickly load a evaluation method with the 🤗 Evaluate library.
For this task, load the accuracy metric (see the 🤗 Evaluate quick tour to learn
more about how to load and compute a metric):

import evaluate

accuracy = [Link]("accuracy")
Then create a function that passes your predictions and labels to compute to
calculate the accuracy:

import numpy as np

def compute_metrics(eval_pred):
predictions, labels = eval_pred
predictions = [Link](predictions, axis=1)
return [Link](predictions=predictions, references=labels)
Your compute_metrics function is ready to go now, and you’ll return to it when you
setup your training.

Train
Before you start training your model, create a map of the expected ids to their
labels with id2label and label2id:

id2label = {0: "NEGATIVE", 1: "POSITIVE"}


label2id = {"NEGATIVE": 0, "POSITIVE": 1}

Pytorch
Hide Pytorch content
If you aren’t familiar with finetuning a model with the Trainer, take a look at the
basic tutorial here!

You're ready to start training your model now! Load DistilBERT with
[AutoModelForSequenceClassification](/docs/transformers/v4.26.1/en/model_doc/
auto#[Link]) along with the number of
expected labels, and the label mappings:
from transformers import AutoModelForSequenceClassification, TrainingArguments,
Trainer

model = AutoModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id
)

At this point, only three steps remain:

Define your training hyperparameters in TrainingArguments. The only required


parameter is output_dir which specifies where to save your model. You’ll push this
model to the Hub by setting push_to_hub=True (you need to be signed in to Hugging
Face to upload your model). At the end of each epoch, the Trainer will evaluate the
accuracy and save the training checkpoint.

Pass the training arguments to Trainer along with the model, dataset, tokenizer,
data collator, and compute_metrics function.

Call train() to finetune your model.

training_args = TrainingArguments(
output_dir="my_awesome_model",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=2,
weight_decay=0.01,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
push_to_hub=True,
)

trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_imdb["train"],
eval_dataset=tokenized_imdb["test"],
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
)

[Link]()
Trainer applies dynamic padding by default when you pass tokenizer to it. In this
case, you don’t need to specify a data collator explicitly.

Once training is completed, share your model to the Hub with the push_to_hub()
method so everyone can use your model:

trainer.push_to_hub()

TensorFlow

Hide TensorFlow content

If you aren’t familiar with finetuning a model with Keras, take a look at the basic
tutorial here!
To finetune a model in TensorFlow, start by setting up an optimizer function,
learning rate schedule, and some training hyperparameters:

from transformers import create_optimizer

import tensorflow as tf

batch_size = 16
num_epochs = 5
batches_per_epoch = len(tokenized_imdb["train"]) // batch_size
total_train_steps = int(batches_per_epoch * num_epochs)
optimizer, schedule = create_optimizer(init_lr=2e-5, num_warmup_steps=0,
num_train_steps=total_train_steps)
Then you can load DistilBERT with TFAutoModelForSequenceClassification along with
the number of expected labels, and the label mappings:

from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id
)
Convert your datasets to the [Link] format with prepare_tf_dataset():

tf_train_set = model.prepare_tf_dataset(
tokenized_imdb["train"],
shuffle=True,
batch_size=16,
collate_fn=data_collator,
)

tf_validation_set = model.prepare_tf_dataset(
tokenized_imdb["test"],
shuffle=False,
batch_size=16,
collate_fn=data_collator,
)
Configure the model for training with compile:

import tensorflow as tf

[Link](optimizer=optimizer)
The last two things to setup before you start training is to compute the accuracy
from the predictions, and provide a way to push your model to the Hub. Both are
done by using Keras callbacks.

Pass your compute_metrics function to KerasMetricCallback:

from transformers.keras_callbacks import KerasMetricCallback

metric_callback = KerasMetricCallback(metric_fn=compute_metrics,
eval_dataset=tf_validation_set)
Specify where to push your model and tokenizer in the PushToHubCallback:

from transformers.keras_callbacks import PushToHubCallback

push_to_hub_callback = PushToHubCallback(
output_dir="my_awesome_model",
tokenizer=tokenizer,
)
Then bundle your callbacks together:

callbacks = [metric_callback, push_to_hub_callback]


Finally, you’re ready to start training your model! Call fit with your training and
validation datasets, the number of epochs, and your callbacks to finetune the
model:

[Link](x=tf_train_set, validation_data=tf_validation_set, epochs=3,


callbacks=callbacks)
Once training is completed, your model is automatically uploaded to the Hub so
everyone can use it!

For a more in-depth example of how to finetune a model for text classification,
take a look at the corresponding PyTorch notebook or TensorFlow notebook.

Inference
Great, now that you’ve finetuned a model, you can use it for inference!

Grab some text you’d like to run inference on:

text = "This was a masterpiece. Not completely faithful to the books, but
enthralling from beginning to end. Might be my favorite of the three."
The simplest way to try out your finetuned model for inference is to use it in a
pipeline(). Instantiate a pipeline for sentiment analysis with your model, and pass
your text to it:

from transformers import pipeline

classifier = pipeline("sentiment-analysis", model="stevhliu/my_awesome_model")


classifier(text)
[{'label': 'POSITIVE', 'score': 0.9994940757751465}]
You can also manually replicate the results of the pipeline if you’d like:

Pytorch
Hide Pytorch content
Tokenize the text and return PyTorch tensors:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_model")
inputs = tokenizer(text, return_tensors="pt")
Pass your inputs to the model and return the logits:

from transformers import AutoModelForSequenceClassification

model =
AutoModelForSequenceClassification.from_pretrained("stevhliu/my_awesome_model")
with torch.no_grad():
logits = model(**inputs).logits
Get the class with the highest probability, and use the model’s id2label mapping to
convert it to a text label:

predicted_class_id = [Link]().item()
[Link].id2label[predicted_class_id]
'POSITIVE'
TensorFlow
Hide TensorFlow content
Tokenize the text and return TensorFlow tensors:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_model")
inputs = tokenizer(text, return_tensors="tf")
Pass your inputs to the model and return the logits:

from transformers import TFAutoModelForSequenceClassification

model =
TFAutoModelForSequenceClassification.from_pretrained("stevhliu/my_awesome_model")
logits = model(**inputs).logits
Get the class with the highest probability, and use the model’s id2label mapping to
convert it to a text label:

predicted_class_id = int([Link](logits, axis=-1)[0])


[Link].id2label[predicted_class_id]
'POSITIVE'

Common questions

Powered by AI

Finetuning a sentiment analysis model using TensorFlow involves several key components: setting up an optimizer with a learning rate schedule, converting datasets into tf.data.Dataset format, and compiling the model for training. The process includes configuring hyperparameters such as batch size and epochs, using callbacks like KerasMetricCallback to calculate accuracy during validation, and PushToHubCallback for model sharing. The finetuning process involves executing model.fit with training and validation datasets and callbacks to optimize the model for improved performance .

A trained sentiment analysis model can be integrated into a pipeline for inference in two primary ways: using the pipeline function from the transformers library for high-level simplicity, or manually by tokenizing input text, running the model to obtain logits, and applying argmax for class predictions. The pipeline function offers ease of use and quick deployment, automatically handling text processing and prediction steps. The manual approach provides greater flexibility and control over each step in the prediction process, allowing for custom optimization or adjustments .

When using the DistilBERT tokenizer for preprocessing text data, it's important to ensure that sequences are truncated to avoid exceeding the model's maximum input length. This is achieved by setting truncation=True in the tokenizer function. Additionally, using a DataCollatorWithPadding during the collation step helps dynamically pad sentences to the longest length in a batch rather than padding the entire dataset to maximum length, enhancing computational efficiency .

Hyperparameters such as learning rate, batch size, number of training epochs, and weight decay are essential in the training of a BERT model for sentiment analysis as they significantly influence the model's convergence and generalization capabilities. The learning rate (e.g., 2e-5) determines the step size during optimization, batch sizes affect the speed and stability of training, and the number of epochs dictates how long the training continues. Weight decay is used as a form of regularization to prevent overfitting. Specification of these parameters in TrainingArguments helps control the training process to ensure optimal model performance .

Uploading a trained transformer model to the Hugging Face Hub provides benefits such as ease of sharing with the community, potential contribution to collaborative projects, and simplified deployment and integration into applications. The process involves using the push_to_hub() method, which saves the model and tokenizer to the specified output directory and uploads them to the Hub. This makes the model accessible to others and easier to use for inference or further finetuning .

Mapping expected IDs to labels in a sentiment analysis model ensures that the outputs of the model predictions are interpretable. It is achieved by creating dictionaries for id2label and label2id, where each label is associated with a numeric identifier ('NEGATIVE' with 0 and 'POSITIVE' with 1). This mapping is critical for configuring the model correctly and interpreting its predictions accurately .

Dynamic padding improves training efficiency by only padding sequences to the length of the longest sequence in a batch, rather than padding all sequences to a fixed maximum length. This reduces unnecessary computation and memory usage since the model processes less padding, allowing for more efficient batch processing, especially when variation in sequence lengths is significant. Dynamic padding ensures that each batch is maximally sized without wasted space .

The batch size in training and evaluation of a DistilBERT model influences both memory consumption and convergence speed. Smaller batch sizes reduce memory usage and allow for more gradient updates per epoch, which can lead to better generalization. However, larger batch sizes may speed up training times but risk overfitting if not balanced with sufficient epochs. Maintaining a balanced batch size is crucial to ensuring adequate learning is achieved without exceeding computational resources or causing overfitting due to excessive gradient updates .

The accuracy of a DistilBERT model during training can be evaluated by defining a compute_metrics function that calculates accuracy by comparing model predictions to true labels. This function uses numpy to derive predictions by taking the argmax of the prediction logits and then computes accuracy with the Evaluate library, which compares the predictions and references to determine the accuracy score .

DistilBERT is a distilled version of BERT that offers improvements in computational efficiency, being roughly 40% faster and lighter on memory usage while retaining a performance level close to that of BERT. This efficiency makes DistilBERT suitable for real-time applications like sentiment analysis where resource constraints are significant. In contrast, full-scale BERT models potentially offer slightly better accuracy for complex tasks requiring detailed contextual understanding. However, they require more computational power and resources, which may not be practical for all applications. DistilBERT thus offers a balance between performance and efficiency, making it an appealing choice for many sentiment analysis tasks .

You might also like