From Text to Tables: Feature Engineering with LLMs for Tabular Data

This technical guide demonstrates using a Llama-3.3-70b-versatile model to extract structured JSON features from raw customer support tickets. By leveraging Pydantic schemas, engineers can programmatically convert text into numeric signals like urgency scores for downstream scikit-learn classifiers.

Why This Matters

While LLMs are primarily associated with generative tasks, their ability to act as structured feature extractors bridges the gap between unstructured natural language and classical tabular machine learning. In production environments, raw text descriptions often contain latent signals—such as customer frustration or urgency—that are difficult to capture with standard NLP techniques like TF-IDF or simple embeddings.

Technical implementation requires navigating the trade-offs between feature richness and operational overhead. Real-world applications must account for API latency and cost, as calling an LLM for every row in a large dataset can become prohibitive without batching, caching, and robust retry logic with backoff to manage rate limits.

Key Insights

Pydantic’s BaseModel can be used to define a schema that forces LLMs to output consistent, structured JSON objects for data pipelines.
The OpenAI Python client is compatible with Groq’s API, allowing engineers to switch between providers by modifying the base_url to https://api.groq.com/openai/v1.
Llama-3.3-70b-versatile models can accurately map qualitative text to quantitative features like a 1-5 urgency_score or binary is_frustrated flags.
Integrating LLM-extracted features with existing numeric data (e.g., account age) creates a hybrid dataset that provides a more holistic view for Random Forest classifiers.
Free-tier API constraints, such as a 30 RPM limit, necessitate the use of time.sleep() or asynchronous batching to prevent 429 Rate Limit errors during preprocessing.

Working Examples

Defining a Pydantic schema and a function to extract structured features from text using the Llama-3.3-70b model via the Groq API.

from pydantic import BaseModel, Field
import json

class TicketFeatures(BaseModel):
    urgency_score: int = Field(description="Urgency of the ticket on a scale of 1 to 5")
    is_frustrated: int = Field(description="1 if the user expresses frustration, 0 otherwise")

def extract_features(text: str) -> dict:
    schema_instructions = json.dumps(TicketFeatures.model_json_schema())
    response = client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=[
            {"role": "system", "content": f"You are an extraction assistant. Output ONLY valid JSON matching this schema: {schema_instructions}"},
            {"role": "user", "content": text}
        ],
        response_format={"type": "json_object"},
        temperature=0.0
    )
    return json.loads(response.choices[0].message.content)

Practical Applications

Use Case: Automated Support Routing - A system uses LLM-extracted urgency scores to prioritize hardware replacement tickets over general billing inquiries. Pitfall: Processing high-volume streams without caching leads to redundant API costs for identical or near-duplicate ticket descriptions.
Use Case: Hybrid Churn Prediction - Combining account metadata with sentiment features extracted from customer emails to improve Random Forest classifier accuracy. Pitfall: Failing to implement retries with backoff results in incomplete datasets when API rate limits are hit during the feature engineering phase.

References:

https://machinelearningmastery.com/from-text-to-tables-feature-engineering-with-llms-for-tabular-data/

On This Page

From Text to Tables: Feature Engineering with LLMs for Tabular Data