Fine tune a BERT model with the use of Colab TPU.

George Soloupis
7 min readMay 28, 2021

Written by George Soloupis ML GDE.

This is a tutorial on how to fine tune a BERT model that was trained specifically on greek language to perform the downstream task of text classification, using Colab’s lightning fast TPU (v2–8).

First some info about the BERT model.

One of the biggest challenges in NLP is the lack of enough training data. Overall there is enormous amount of text data available, but if we want to create task-specific datasets, we need to split that pile into the very many diverse fields. And when we do this, we end up with only a few thousand or a few hundred thousand human-labeled training examples. Unfortunately, in order to perform well, deep learning based NLP models require much larger amounts of data — they see major improvements when trained on millions, or billions, of annotated training examples. To help bridge this gap in data, researchers have developed various techniques for training general purpose language representation models using the enormous piles of unannotated text on the web (this is known as pre-training). These general purpose pre-trained models can then be fine-tuned on smaller task-specific datasets, e.g., when working with problems like question answering and sentiment analysis. This approach results in great accuracy improvements compared to training on the smaller task-specific datasets from scratch.

BERT is a recent addition to these techniques for NLP pre-training. It caused a stir in the deep learning community because it presented state-of-the-art results in a wide variety of NLP tasks, like text classification. It is a method of pre-training language representations, meaning that we train a general-purpose “language understanding” model on a large text corpus (like Wikipedia), and then use that model for downstream NLP tasks that we care about (like text classification). BERT outperforms previous methods because it is the first unsupervised, deeply bidirectional system for pre-training NLP.

Unsupervised means that BERT was trained using only a plain text corpus, which is important because an enormous amount of plain text data is publicly available on the web in many languages.

BERT was built upon recent work in pre-training contextual representations. Contextual models generate a representation of each word that is based on the other words in the sentence. For example, in the sentence I made a bank deposit BERT represents "bank" using both its left and right context — I made a ... deposit — starting from the very bottom of a deep neural network, so it is deeply bidirectional. BERT uses a simple approach for this: We mask out 15% of the words in the input, run the entire sequence through a deep bidirectional Transformer encoder, and then predict only the masked words.

How does it work?

BERT relies on a Transformer (the attention mechanism that learns contextual relationships between words in a text). A basic Transformer consists of an encoder to read the text input and a decoder to produce a prediction for the task. Since BERT’s goal is to generate a language representation model, it only needs the encoder part. The input to the encoder for BERT is a sequence of tokens, which are first converted into vectors and then processed in the neural network. But before processing can start, BERT needs the input to be massaged and decorated with some extra metadata:

  • Token embeddings: A [CLS] token is added to the input word tokens at the beginning of the first sentence and a [SEP] token is inserted at the end of each sentence.
  • Segment embeddings: A marker indicating Sentence A or Sentence B is added to each token. This allows the encoder to distinguish between sentences.
  • Positional embeddings: A positional embedding is added to each token to indicate its position in the sentence.

Essentially, the Transformer stacks a layer that maps sequences to sequences, so the output is also a sequence of vectors with a 1:1 correspondence between input and output tokens at the same index.

Large variety of models

You can find implementations at the original Google’s repository and at the Hugging Face web site with 8 architectures with over 30 pretrained models, some in more than 100 languages. Here we are going to use the greek version of the model that was publicly released from the AUEB university of Greece at this github repository and we are going to do a code walkthrough (most important parts) of this detailed colab notebook.

Colab walkthrough

First, we have to connect to the TPU worker. The following code connects to the TPU worker and changes TensorFlow’s default device to the CPU device on the TPU worker. It also defines a TPU distribution strategy that you will use to distribute model training onto the 8 separate TPU cores available on this one TPU worker. See TensorFlow’s TPU guide for more information.

Second, to go fast on a TPU, increase the batch size. The rule of thumb is to use batches of 128 elements per core (ex: batch size of 128*8=1024 for a TPU with 8 cores). At this size, the 128x128 hardware matrix multipliers of the TPU (see hardware section below) are most likely to be kept busy. You start seeing interesting speedups from a batch size of 8 per core though. In the sample below, the batch size is scaled with the core count. (Mixed precision on TPU, bfloat16/float32 mixed precision is automatically used in TPU computations. Enabling it in Keras also stores relevant variables in bfloat16 format (memory optimization). On GPU, specifically V100, mixed precision must be enabled for hardware TensorCores to be used. XLA compilation must be enabled for this to work. (On TPU, XLA compilation is the default))

Third, we load the model and the tokenizer from Hugging Face just with the below simple lines of code.

# Install transformers and download the greek specific file for model and tokenizer
# This is done only for testing some inputs
# For fine tuning the BERT model we have to load again the model when we build it inside strategy.scope()
!pip install transformers
from transformers import AutoTokenizer, TFAutoModel

tokenizer = AutoTokenizer.from_pretrained("nlpaueb/bert-base-greek-uncased-v1")
model = TFAutoModel.from_pretrained("nlpaueb/bert-base-greek-uncased-v1")

Forth, some simple steps are following like model summarization, dataset and labels loading, and testing the model with a simple input.

Fifth, we are doing a preprocess of the text dataset as the tokenizer will not be included in the model.

and we are preparing all the input :

input_x = bert_encode(text_list, bert_preprocess_model)

Sixth, we load again the model inside the build function as this will get the hub.KerasLayer inside the strategy.scope().

Seventh, build and compile the model under strategy.scope().

# creating the model in the TPUStrategy scope places the model on the TPU
with
strategy.scope():
model = build_model()
model.compile(tf.keras.optimizers.Adam(lr=1e-5), loss='categorical_crossentropy', metrics=['accuracy'], steps_per_execution=32)

model.summary()

In the above code, starting with Tensorflow 2.4, model.compile() accepts a new steps_per_execution parameter. This parameter instructs Keras to send multiple batches to the TPU at once. In addition to lowering communications overheads, this gives the XLA compiler the opportunity to optimize TPU hardware utilization across multiple batches. With this option, it is no longer necessary to push batch sizes to very high values to optimize TPU performance. As long as you use batch sizes of at least 8 per core (>=64 for a TPUv3–8) performance should be acceptable.

Eighth, we train the model:

train_history = model.fit(
input_x, train_labels,
validation_split=0.2,
epochs=5,
batch_size=32,
verbose=1)

and we are getting the result:

with an outstanding 58ms/step during training, faster than the use of a T4 Tesla GPU with a 700ms/step.

Model saving/loading on TPUs

When loading and saving models TPU models from/to the local disk, the experimental_io_device option must be used.

Saving:

save_locally = tf.saved_model.SaveOptions(experimental_io_device='/job:localhost')model.save('./model', options=save_locally) # saving in Tensorflow's "SavedModel" format

Loading:

with strategy.scope():
load_locally = tf.saved_model.LoadOptions(experimental_io_device='/job:localhost')
model = tf.keras.models.load_model('./model', options=load_locally) # loading in Tensorflow's "SavedModel" format

The strategy scope instructs Tensorflow to instantiate all the variables of the model in the memory of the TPU. The TPUClusterResolver.connect() call automatically enters the TPU device scope which instructs Tensorflow to run Tensorflow operations on the TPU. Now if you call model.save(‘./model’) when you are connected to a TPU, Tensorflow will try to run the save operations on the TPU and since the TPU is a network-connected accelerator that has no access to your local disk, the operation will fail. Notice that saving to GCS will work though. The TPU does have access to GCS.

If you want to save a TPU model to your local disk, you need to run the saving operation on your local machine and that is what the experimental_io_device=’/job:localhost’ flag does.

Chack a detailed article on Kaggle here. And full details and acknowledgement on my github repository.

That brings us on the end of the tutorial. We did a step by step walkthrough of the Colab notebook pointing out the details of fine tuning a BERT model trained on greek text corpus for text classification with the use of TPU.

Special thanks to Sayak Paul for his help.

--

--

George Soloupis

I am a pharmacist turned android developer and machine learning engineer. Right now I’m a senior android developer at Invisalign, a ML & Android GDE.