Engineering a Search Engine for 3 Million Polish Businesses: Data Pipeline Lessons
These articles are AI-generated summaries. Please check the original sources for full details.
I built a search engine for 3 million Polish businesses — here’s what I learned
Paweł Sobkowiak developed nipgo.pl to unify fragmented corporate data. The system aggregates records from two separate public registries covering over 3 million businesses.
Why This Matters
The project highlights the gap between idealized API documentation and the reality of legacy government data. Technical challenges such as inconsistent JSON structures for companies registered in different eras (e.g., pre- vs post-2015 PKD codes) and strict WAF rate limits on VAT whitelists demonstrate that data cleaning and pipeline resilience are more critical than UI development when dealing with large-scale public datasets.
Key Insights
- Pagination Performance: OFFSET-based pagination causes timeouts on datasets of 2.6M records, necessitating a migration to keyset (cursor-based) pagination.
- GDPR Compliance Constraints: Since 2023, KRS API returns asterisked names for natural persons, requiring authenticated PDF scraping to retrieve full identities.
- WAF Rate Limiting: The Ministry of Finance VAT whitelist employs an Imperva WAF limiting requests to ~1,400/day per IP, rendering batch endpoints ineffective.
- Schema Evolution: Industry classification (PKD codes) evolved from nested arrays (pre-2015) to flat objects, requiring logic to handle dual formats.
Practical Applications
- )Use case: B2B verification systems using Next.js and Supabase to aggregate multiple government APIs into one search interface.
- Pitfall: Prioritizing frontend development over data pipeline validation, resulting in a polished UI displaying messy or inconsistent registry data.
References:
Continue reading
Next article
Trishul SNMP Suite: An Open-Source Alternative to Expensive MIB Browsers and Tool Fragmentation
Related Content
Engineering a Unified Korean Entertainment Database Across 10 Fragmented Sources
Engineer Cara Jung builds a unified database for Korean entertainment, aggregating data from 10 sources including NAVER and KOBIS to solve metadata fragmentation.
Dinghy: Unifying DevOps Tooling with a Single CLI and Docker Engine
Dinghy unifies infrastructure, diagrams, and docs into one CLI, allowing engineers to generate 248 lines of Terraform from just 8 lines of TSX source.
Core Data Engineering Concepts: Building Scalable Data Pipelines
A technical guide to the 15 foundational data engineering concepts used to transform raw information into reliable business insights.