Skip to main content

On This Page

Preparing Data for BERT Training

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

Overview

BERT, an encoder-only transformer model, demands specific data preparation for its pretraining phase involving Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). Unlike simpler models, BERT’s pretraining relies on a combined loss function requiring labeled data for both tasks.

Preparing data for BERT is computationally expensive and crucial for performance; inadequate preparation can lead to significant model inaccuracies and wasted training resources.

Key Insights

  • 15% Masking: BERT masks 15% of input tokens during pretraining to learn contextual representations (Devlin et al., 2018).
  • NSP Task: BERT predicts whether two given sentences are consecutive in the original document, enhancing its understanding of relationships between sentences.
  • Parquet Format: Hugging Face datasets library supports efficient data storage and retrieval using the Parquet format, optimized for columnar data.

Working Example

import tokenizers
from datasets import load_dataset, Dataset

def create_docs(path, name, tokenizer):
    """Load wikitext dataset and extract text as documents"""
    dataset = load_dataset(path, name, split="train")
    docs = []
    for line in dataset["text"]:
        line = line.strip()
        if not line or line.startswith("="):
            docs.append([])  # new document encountered
        else:
            tokens = tokenizer.encode(line).ids
            docs[-1].append(tokens)
    docs = [doc for doc in docs if doc]  # remove empty documents
    return docs

# load the tokenizer
tokenizer = tokenizers.Tokenizer.from_file("wikitext-103_wordpiece.json")
docs = create_docs("wikitext", "wikitext-103-raw-v1", tokenizer)

dataset = Dataset.from_generator(
    lambda docs: (create_sample(sentence_a, sentence_b, is_random_next, tokenizer) for doc in docs for sentence_a, sentence_b, is_random_next in generate_samples(docs)),
    gen_kwargs={"docs": docs, "tokenizer": tokenizer}
)

dataset.to_parquet("wikitext-103_train_data.parquet")

Practical Applications

  • Google Search: BERT is used in Google Search to better understand user queries and deliver more relevant results.
  • Pitfall: Incorrectly handling sequence lengths during data preparation can lead to padding issues, impacting model performance and increasing computational cost.

References:

Continue reading

Next article

Humans in the Loop: Engineering Leadership in a Chaotic Industry

Related Content