Building Scalable ML Data Pipelines for Image and Structured Data with Daft
These articles are AI-generated summaries. Please check the original sources for full details.
A Coding Guide to Build a Scalable End-to-End Machine Learning Data Pipeline Using Daft for High-Performance Structured and Image Data Processing
Michal Sutter demonstrates Daft, a high-performance Python-native data engine designed for complex analytical pipelines. The system processes the MNIST dataset by transforming raw JSON-based pixel arrays into 28x28 structured images using scalable User-Defined Functions (UDFs).
Why This Matters
Traditional Python data tools often struggle with the hybrid nature of modern ML workloads that mix structured tabular data with unstructured image tensors. Engineers frequently face bottlenecks when switching between Pandas for metadata and specialized libraries for image processing, leading to fragmented and non-scalable code. Daft addresses this by providing a unified, lazy execution engine that handles both data types within a single, memory-efficient framework. This eliminates the performance limitations of standard Python libraries while maintaining a familiar API for engineers.
Key Insights
- Daft versioning and environment stability are ensured by installing daft, pyarrow, and numpy in a clean environment like Google Colab.
- Row-wise UDFs enable the transformation of 1D pixel arrays into 28x28 matrices using np.reshape for model-ready inputs.
- Batch UDFs with batch_size=512 optimize feature extraction by processing multiple images simultaneously, reducing overhead in Python-native execution.
- Integration of statistical features like pixel_mean and pixel_std (float32) allows for immediate data validation and enrichment before model training.
- Persistence to Parquet format via df.write_parquet() ensures that processed features are stored in a compressed, industry-standard format for production reuse.
Working Examples
Reshaping raw pixel data into 28x28 images using Daft’s row-wise column application.
def to_28x28(pixels):
arr = np.array(pixels, dtype=np.float32)
if arr.size != 784:
return None
return arr.reshape(28, 28)
df2 = (
df
.with_column(
"img_28x28",
col("image").apply(to_28x28, return_dtype=daft.DataType.python())
)
)
Implementing a batch UDF for optimized feature engineering with a specified batch size of 512.
@daft.udf(return_dtype=daft.DataType.list(daft.DataType.float32()), batch_size=512)
def featurize(images_28x28):
out = []
for img in images_28x28.to_pylist():
if img is None:
out.append(None)
continue
img = np.asarray(img, dtype=np.float32)
row_sums = img.sum(axis=1) / 255.0
col_sums = img.sum(axis=0) / 255.0
total = img.sum() + 1e-6
ys, xs = np.indices(img.shape)
cy = float((ys * img).sum() / total) / 28.0
cx = float((xs * img).sum() / total) / 28.0
vec = np.concatenate([row_sums, col_sums, np.array([cy, cx, img.mean()/255.0, img.std()/255.0], dtype=np.float32)])
out.append(vec.astype(np.float32).tolist())
return out
Practical Applications
- Use Case: MNIST Classification Pipeline: Transforming raw JSON image data into engineered features for Scikit-learn Logistic Regression models.
- Pitfall: Row-wise vs. Batch UDFs: Using row-wise operations for heavy feature extraction can lead to significant performance degradation compared to Daft’s vectorized batch UDFs.
- Use Case: Feature Persistence: Storing large-scale image metadata and engineered vectors in Parquet to avoid re-computing complex transformations in downstream training loops.
- Pitfall: Schema Mismatch: Failing to specify return_dtype in Daft UDFs can cause execution failures when the engine attempts to optimize the lazy physical plan.
References:
Continue reading
Next article
Linux Timekeeping Internals: How RTC, TSC, and Kernel Clocks Align
Related Content
Building Scalable ML Pipelines on Millions of Rows with Vaex
Learn how to build a production-style analytics and ML pipeline on 2 million rows using Vaex, featuring lazy expressions and approximate statistics without materializing data in memory.
Building an End-to-End Data Engineering and Machine Learning Pipeline with PySpark in Google Colab
A step-by-step guide to using PySpark in Google Colab for data transformations, SQL analytics, feature engineering, and machine learning model training.
How to Build Portable, In-Database Feature Engineering Pipelines with Ibis Using Lazy Python APIs and DuckDB Execution
Ibis enables building portable in-database feature engineering pipelines, executing entirely within DuckDB, and demonstrating a 100% reduction in data transfer overhead.