Preparing Data for BERT Training

Overview

BERT, an encoder-only transformer model, demands specific data preparation for its pretraining phase involving Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). Unlike simpler models, BERT’s pretraining relies on a combined loss function requiring labeled data for both tasks.

Preparing data for BERT is computationally expensive and crucial for performance; inadequate preparation can lead to significant model inaccuracies and wasted training resources.

Key Insights

15% Masking: BERT masks 15% of input tokens during pretraining to learn contextual representations (Devlin et al., 2018).
NSP Task: BERT predicts whether two given sentences are consecutive in the original document, enhancing its understanding of relationships between sentences.
Parquet Format: Hugging Face datasets library supports efficient data storage and retrieval using the Parquet format, optimized for columnar data.

Working Example

import tokenizers
from datasets import load_dataset, Dataset

def create_docs(path, name, tokenizer):
    """Load wikitext dataset and extract text as documents"""
    dataset = load_dataset(path, name, split="train")
    docs = []
    for line in dataset["text"]:
        line = line.strip()
        if not line or line.startswith("="):
            docs.append([])  # new document encountered
        else:
            tokens = tokenizer.encode(line).ids
            docs[-1].append(tokens)
    docs = [doc for doc in docs if doc]  # remove empty documents
    return docs

# load the tokenizer
tokenizer = tokenizers.Tokenizer.from_file("wikitext-103_wordpiece.json")
docs = create_docs("wikitext", "wikitext-103-raw-v1", tokenizer)

dataset = Dataset.from_generator(
    lambda docs: (create_sample(sentence_a, sentence_b, is_random_next, tokenizer) for doc in docs for sentence_a, sentence_b, is_random_next in generate_samples(docs)),
    gen_kwargs={"docs": docs, "tokenizer": tokenizer}
)

dataset.to_parquet("wikitext-103_train_data.parquet")

Practical Applications

Google Search: BERT is used in Google Search to better understand user queries and deliver more relevant results.
Pitfall: Incorrectly handling sequence lengths during data preparation can lead to padding issues, impacting model performance and increasing computational cost.

References:

https://machinelearningmastery.com/preparing-data-for-bert-training/
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805.

On This Page

Overview

Key Insights

Working Example

Practical Applications

Continue reading

Related Content

Understanding and Mitigating Kafka Consumer Lag

Optimizing Power BI Performance through Advanced Data Modeling and Star Schemas

Microsoft and Overture Maps Foundation Unite to Standardize Global Spatial Data