Building Scalable ML Data Pipelines for Image and Structured Data with Daft

A Coding Guide to Build a Scalable End-to-End Machine Learning Data Pipeline Using Daft for High-Performance Structured and Image Data Processing

Michal Sutter demonstrates Daft, a high-performance Python-native data engine designed for complex analytical pipelines. The system processes the MNIST dataset by transforming raw JSON-based pixel arrays into 28x28 structured images using scalable User-Defined Functions (UDFs).

Why This Matters

Traditional Python data tools often struggle with the hybrid nature of modern ML workloads that mix structured tabular data with unstructured image tensors. Engineers frequently face bottlenecks when switching between Pandas for metadata and specialized libraries for image processing, leading to fragmented and non-scalable code. Daft addresses this by providing a unified, lazy execution engine that handles both data types within a single, memory-efficient framework. This eliminates the performance limitations of standard Python libraries while maintaining a familiar API for engineers.

Key Insights

Daft versioning and environment stability are ensured by installing daft, pyarrow, and numpy in a clean environment like Google Colab.
Row-wise UDFs enable the transformation of 1D pixel arrays into 28x28 matrices using np.reshape for model-ready inputs.
Batch UDFs with batch_size=512 optimize feature extraction by processing multiple images simultaneously, reducing overhead in Python-native execution.
Integration of statistical features like pixel_mean and pixel_std (float32) allows for immediate data validation and enrichment before model training.
Persistence to Parquet format via df.write_parquet() ensures that processed features are stored in a compressed, industry-standard format for production reuse.

Working Examples

Reshaping raw pixel data into 28x28 images using Daft’s row-wise column application.

def to_28x28(pixels):
    arr = np.array(pixels, dtype=np.float32)
    if arr.size != 784:
        return None
    return arr.reshape(28, 28)

df2 = (
    df
    .with_column(
        "img_28x28",
        col("image").apply(to_28x28, return_dtype=daft.DataType.python())
    )
)

Implementing a batch UDF for optimized feature engineering with a specified batch size of 512.

@daft.udf(return_dtype=daft.DataType.list(daft.DataType.float32()), batch_size=512)
def featurize(images_28x28):
    out = []
    for img in images_28x28.to_pylist():
        if img is None:
            out.append(None)
            continue
        img = np.asarray(img, dtype=np.float32)
        row_sums = img.sum(axis=1) / 255.0
        col_sums = img.sum(axis=0) / 255.0
        total = img.sum() + 1e-6
        ys, xs = np.indices(img.shape)
        cy = float((ys * img).sum() / total) / 28.0
        cx = float((xs * img).sum() / total) / 28.0
        vec = np.concatenate([row_sums, col_sums, np.array([cy, cx, img.mean()/255.0, img.std()/255.0], dtype=np.float32)])
        out.append(vec.astype(np.float32).tolist())
    return out

Practical Applications

Use Case: MNIST Classification Pipeline: Transforming raw JSON image data into engineered features for Scikit-learn Logistic Regression models.
Pitfall: Row-wise vs. Batch UDFs: Using row-wise operations for heavy feature extraction can lead to significant performance degradation compared to Daft’s vectorized batch UDFs.
Use Case: Feature Persistence: Storing large-scale image metadata and engineered vectors in Parquet to avoid re-computing complex transformations in downstream training loops.
Pitfall: Schema Mismatch: Failing to specify return_dtype in Daft UDFs can cause execution failures when the engine attempts to optimize the lazy physical plan.

References:

https://www.marktechpost.com/2026/03/05/a-coding-guide-to-build-a-scalable-end-to-end-machine-learning-data-pipeline-using-daft-for-high-performance-structured-and-image-data-processing/

On This Page

A Coding Guide to Build a Scalable End-to-End Machine Learning Data Pipeline Using Daft for High-Performance Structured and Image Data Processing

Why This Matters

Key Insights

Working Examples

Practical Applications

Continue reading

Related Content

Building Scalable ML Pipelines on Millions of Rows with Vaex

Building an End-to-End Data Engineering and Machine Learning Pipeline with PySpark in Google Colab

How to Build Portable, In-Database Feature Engineering Pipelines with Ibis Using Lazy Python APIs and DuckDB Execution