Engineering a Search Engine for 3 Million Polish Businesses: Data Pipeline Lessons

I built a search engine for 3 million Polish businesses — here’s what I learned

Paweł Sobkowiak developed nipgo.pl to unify fragmented corporate data. The system aggregates records from two separate public registries covering over 3 million businesses.

Why This Matters

The project highlights the gap between idealized API documentation and the reality of legacy government data. Technical challenges such as inconsistent JSON structures for companies registered in different eras (e.g., pre- vs post-2015 PKD codes) and strict WAF rate limits on VAT whitelists demonstrate that data cleaning and pipeline resilience are more critical than UI development when dealing with large-scale public datasets.

Key Insights

Pagination Performance: OFFSET-based pagination causes timeouts on datasets of 2.6M records, necessitating a migration to keyset (cursor-based) pagination.
GDPR Compliance Constraints: Since 2023, KRS API returns asterisked names for natural persons, requiring authenticated PDF scraping to retrieve full identities.
WAF Rate Limiting: The Ministry of Finance VAT whitelist employs an Imperva WAF limiting requests to ~1,400/day per IP, rendering batch endpoints ineffective.
Schema Evolution: Industry classification (PKD codes) evolved from nested arrays (pre-2015) to flat objects, requiring logic to handle dual formats.

Practical Applications

)Use case: B2B verification systems using Next.js and Supabase to aggregate multiple government APIs into one search interface.
Pitfall: Prioritizing frontend development over data pipeline validation, resulting in a polished UI displaying messy or inconsistent registry data.

References:

On This Page

I built a search engine for 3 million Polish businesses — here’s what I learned

Why This Matters

Key Insights

Practical Applications

Continue reading

Related Content

Engineering a Unified Korean Entertainment Database Across 10 Fragmented Sources

Multilingual AI Engineering: Lessons from Building k4pi for Telegram

Transforming RAG Search into an Answer Engine with Gemma 4