Skip to main content

On This Page

Engineering a Search Engine for 3 Million Polish Businesses: Data Pipeline Lessons

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

I built a search engine for 3 million Polish businesses — here’s what I learned

Paweł Sobkowiak developed nipgo.pl to unify fragmented corporate data. The system aggregates records from two separate public registries covering over 3 million businesses.

Why This Matters

The project highlights the gap between idealized API documentation and the reality of legacy government data. Technical challenges such as inconsistent JSON structures for companies registered in different eras (e.g., pre- vs post-2015 PKD codes) and strict WAF rate limits on VAT whitelists demonstrate that data cleaning and pipeline resilience are more critical than UI development when dealing with large-scale public datasets.

Key Insights

  • Pagination Performance: OFFSET-based pagination causes timeouts on datasets of 2.6M records, necessitating a migration to keyset (cursor-based) pagination.
  • GDPR Compliance Constraints: Since 2023, KRS API returns asterisked names for natural persons, requiring authenticated PDF scraping to retrieve full identities.
  • WAF Rate Limiting: The Ministry of Finance VAT whitelist employs an Imperva WAF limiting requests to ~1,400/day per IP, rendering batch endpoints ineffective.
  • Schema Evolution: Industry classification (PKD codes) evolved from nested arrays (pre-2015) to flat objects, requiring logic to handle dual formats.

Practical Applications

  • )Use case: B2B verification systems using Next.js and Supabase to aggregate multiple government APIs into one search interface.
  • Pitfall: Prioritizing frontend development over data pipeline validation, resulting in a polished UI displaying messy or inconsistent registry data.

References:

Continue reading

Next article

Trishul SNMP Suite: An Open-Source Alternative to Expensive MIB Browsers and Tool Fragmentation

Related Content