Preparing Data for BERT Training
These articles are AI-generated summaries. Please check the original sources for full details.
Overview
BERT, an encoder-only transformer model, demands specific data preparation for its pretraining phase involving Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). Unlike simpler models, BERT’s pretraining relies on a combined loss function requiring labeled data for both tasks.
Preparing data for BERT is computationally expensive and crucial for performance; inadequate preparation can lead to significant model inaccuracies and wasted training resources.
Key Insights
- 15% Masking: BERT masks 15% of input tokens during pretraining to learn contextual representations (Devlin et al., 2018).
- NSP Task: BERT predicts whether two given sentences are consecutive in the original document, enhancing its understanding of relationships between sentences.
- Parquet Format: Hugging Face
datasetslibrary supports efficient data storage and retrieval using the Parquet format, optimized for columnar data.
Working Example
import tokenizers
from datasets import load_dataset, Dataset
def create_docs(path, name, tokenizer):
"""Load wikitext dataset and extract text as documents"""
dataset = load_dataset(path, name, split="train")
docs = []
for line in dataset["text"]:
line = line.strip()
if not line or line.startswith("="):
docs.append([]) # new document encountered
else:
tokens = tokenizer.encode(line).ids
docs[-1].append(tokens)
docs = [doc for doc in docs if doc] # remove empty documents
return docs
# load the tokenizer
tokenizer = tokenizers.Tokenizer.from_file("wikitext-103_wordpiece.json")
docs = create_docs("wikitext", "wikitext-103-raw-v1", tokenizer)
dataset = Dataset.from_generator(
lambda docs: (create_sample(sentence_a, sentence_b, is_random_next, tokenizer) for doc in docs for sentence_a, sentence_b, is_random_next in generate_samples(docs)),
gen_kwargs={"docs": docs, "tokenizer": tokenizer}
)
dataset.to_parquet("wikitext-103_train_data.parquet")
Practical Applications
- Google Search: BERT is used in Google Search to better understand user queries and deliver more relevant results.
- Pitfall: Incorrectly handling sequence lengths during data preparation can lead to padding issues, impacting model performance and increasing computational cost.
References:
- https://machinelearningmastery.com/preparing-data-for-bert-training/
- Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805.
Continue reading
Next article
Humans in the Loop: Engineering Leadership in a Chaotic Industry
Related Content
Engineering a Unified Korean Entertainment Database Across 10 Fragmented Sources
Engineer Cara Jung builds a unified database for Korean entertainment, aggregating data from 10 sources including NAVER and KOBIS to solve metadata fragmentation.
Understanding and Mitigating Kafka Consumer Lag
A comprehensive guide to Kafka consumer lag, including its definition, causes, monitoring techniques, and strategies to reduce it for optimal performance.
Optimizing Power BI Performance through Advanced Data Modeling and Star Schemas
Master Power BI data modeling by implementing Star Schemas and efficient relationships to prevent slow, inaccurate dashboard reporting.