Applied NLP and the LLM Ecosystem

Five years ago, an NLP chapter in a data science book would have started with tokenization, walked through TF-IDF, introduced word2vec, and maybe ended with a fine-tuned BERT classifier. You would have spent most of your time on feature engineering and model selection — the same workflow as any other supervised learning problem.

That chapter is obsolete.

Today, the most impactful NLP work you will do involves no training at all. A well-constructed prompt to GPT-4 or Claude, combined with the right context, will outperform a custom-trained model on the majority of text tasks you encounter in production. The cost per query is fractions of a cent. The development time is hours, not weeks. And the quality ceiling keeps rising every few months as foundation model providers ship new versions.

This does not mean that prompting is always the answer. It means that prompting is the default — the starting point that you deviate from only when you have a concrete reason. Those reasons exist, they are common, and recognizing them is the central skill this chapter teaches.

The Three Strategies

Every NLP task you face in production maps to one of three strategies. The right choice depends on your data, your constraints, and the specific failure mode you are trying to avoid.

Strategy 1: Prompt Engineering. Send your input to a foundation model with instructions. No training, no custom data pipeline, no infrastructure beyond an API key. This is the right choice when the task is well-defined, the required knowledge is within the model’s training data, and you can tolerate the latency and cost of API calls.

Strategy 2: Retrieval-Augmented Generation (RAG). Retrieve relevant documents from your own data, inject them into the prompt, and let the model generate answers grounded in your context. This is the right choice when the model lacks domain knowledge, when accuracy on your specific documents matters, or when the underlying information changes frequently.

Strategy 3: Parameter-Efficient Fine-Tuning. Adapt the model’s weights on your data using LoRA or QLoRA. This is the right choice when you need a specific output format the model cannot learn from prompting alone, when you need to reduce inference cost by using a smaller model, or when your task requires behavior that prompting cannot reliably produce.

These are not competing philosophies. They are tools in a decision tree:

NLP Strategy Decision Tree

The Cost / Quality / Latency Triangle

Every NLP system operates inside a triangle with three vertices: cost, quality, and latency. You can optimize for two at the expense of the third.

High quality + low latency → high cost. A large model (GPT-4, Claude Opus) produces excellent output quickly, but at $10–75 per million tokens.
High quality + low cost → high latency. A fine-tuned small model can match large-model quality on a narrow task, but requires weeks of development time and evaluation infrastructure.
Low latency + low cost → lower quality. A small model with no retrieval is fast and cheap, but produces generic output that lacks domain specificity.

The practical implication: before you write any code, answer three questions. (1) What quality threshold is acceptable? If 85% accuracy is sufficient and 95% is not worth 10x the cost, prompting a mid-tier model may be the right choice. (2) What is the per-query cost budget? If you are processing 10 million documents, even $0.01 per query is $100,000. (3) What is the latency requirement? If users expect sub-second responses, you cannot afford a round trip to a large model with a 2,000-token context window.

These constraints are not abstract. They determine your architecture. A customer support chatbot with a 200ms latency SLA and 50,000 queries per day rules out GPT-4 on latency and cost grounds simultaneously. A weekly research report generator with no latency constraint and a $500/month budget can afford the best model available.

What This Chapter Covers

This chapter is structured around two sections:

Section 7.1–7.2: The Paradigm Shift and RAG Pipelines. The move from train-your-own to prompt-and-orchestrate, when prompting is enough, and when you need retrieval. A complete RAG pipeline built from first principles: chunking, embedding, vector search, context injection, and generation. Evaluation methods for retrieval quality and answer faithfulness. The failure modes that kill RAG systems in production.

Section 7.3–7.4: Evaluating LLMs and Parameter-Efficient Fine-Tuning. Why BLEU and ROUGE are useless for modern generative output, and what to use instead. LLM-as-a-judge evaluation pipelines, BERTScore, and human evaluation protocols. Then the decision to fine-tune: LoRA mathematics, QLoRA for constrained hardware, and a complete fine-tuning pipeline with the PEFT library. A decision framework for when to fine-tune versus when to keep prompting.

By the end of this chapter, you will have a practical decision framework for every NLP task you encounter, the engineering skills to build a production RAG pipeline, the evaluation methodology to measure whether your system actually works, and the fine-tuning capability to close the gap when prompting is not enough. This is the chapter where the tools from Chapters 5 and 6 converge — you will use everything you have learned about evaluation rigor, hardware constraints, and engineering discipline in a domain where the temptation to skip all of it is strongest.