Agoda Unifies Data Pipelines with Apache Spark to Achieve 95.6% Uptime
These articles are AI-generated summaries. Please check the original sources for full details.
Agoda Unified Data Pipelines
Agoda recently consolidated multiple independent financial data pipelines into a centralized Apache Spark-based platform, improving data consistency and achieving 95.6% uptime. The Financial Unified Data Pipeline (FINUDP) processes millions of daily booking transactions, providing hourly updates to downstream teams.
The move addresses a common enterprise issue: siloed data pipelines leading to inconsistent metrics and potential financial reporting errors. Without a unified system, discrepancies can impact critical business decisions and regulatory compliance, costing organizations significant time and resources to reconcile.
Key Insights
- 64% of organizations cite poor data quality as their biggest challenge, 2023.
- Data contracts define expectations for schemas and quality requirements between data producers and consumers, Gartner.
- Apache Spark is used by companies like Netflix and Databricks for large-scale data processing.
Working Example
# Example of a basic data validation check in PySpark
from pyspark.sql.functions import col
def validate_data(df, column_name, min_value, max_value):
"""
Validates that values in a specified column fall within a given range.
"""
return df.filter((col(column_name) >= min_value) & (col(column_name) <= max_value))
# Assuming 'sales_df' is a Spark DataFrame with a 'amount' column
validated_df = validate_data(sales_df, "amount", 0, 1000)
validated_df.show()
Practical Applications
- Financial Institutions: Implementing a unified data pipeline for accurate regulatory reporting and risk management.
- Pitfall: Over-reliance on automated validations without data contracts can lead to undetected schema drift and data quality issues.
References:
Continue reading
Next article
Microsoft Disrupts RedVDS Cybercrime Service, Seizing Key Infrastructure
Related Content
Engineering a Search Engine for 3 Million Polish Businesses: Data Pipeline Lessons
Paweł Sobkowiak aggregates data from KRS and CEIDG to index over 3 million Polish business entities into a single searchable platform.
Core Data Engineering Concepts: Building Scalable Data Pipelines
A technical guide to the 15 foundational data engineering concepts used to transform raw information into reliable business insights.
Decathlon Switches to Polars to Optimize Data Pipelines and Infrastructure Costs
Decathlon reduced compute launch time from 8 to 2 minutes by migrating from Apache Spark to Polars for datasets under 50GB.