Skip to main content

On This Page

Building a Low-Cost Pipeline for U.S. Congress Trading Data

3 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

I built two Apify actors that scrape U.S. Congress trading data — directly from government sources, no QuiverQuant

Engineer Fatih İlhan developed a custom scraping pipeline using Apify actors to extract U.S. Senate and House Periodic Transaction Reports. The system replaces commercial APIs at approximately 1/10th the cost, operating for less than $1 per day.

Why This Matters

While the STOCK Act of 2012 ensures public access to congressional trades, the technical reality of accessing this data involves bypassing Akamai bot protection on the Senate’s Django-based efdsearch.senate.gov and parsing chaotic House PDF disclosures. Commercial aggregators often provide inconsistent data shapes or paywall granular transactions; a direct-to-government pipeline allows for idempotent synchronization and 95% clean data parsing without third-party reliability issues or high subscription costs.

Key Insights

  • Senate Django applications use session-based CSRF gates requiring pinned residential proxies to maintain state between the agreement POST and data retrieval (2026).
  • Marker-anchored parsing strategies are required for House PTR PDFs to handle chaotic text-extraction where transaction types and amounts lack whitespace separators.
  • The House of Representatives publishes daily-updated ZIP files containing XML indices and individual transaction PDFs at disclosures-clerk.house.gov.
  • Deterministic deduplication using SHA-256 hashes of natural keys (politician, date, asset, and amount) prevents duplicate entries across independent actor runs.
  • Axios default redirect handling can drop critical Set-Cookie headers from 302 responses, requiring manual redirect chain walking for session maintenance.

Working Examples

Pinning to a single residential exit IP to maintain the Django prohibition_agreement state.

const sessionId = `senate_${Date.now()}`;
const proxyUrl = await proxyConfig.newUrl(sessionId);

Marker-anchored regex used to identify row anchors and glued-together transaction data in House PDFs.

const MARKER_RE = /(?:\(([A-Z][A-Z0-9.\-]{0,5})\)\s*)?\[([A-Z]{2})\]/;
const TX_RE = /(S\s*\(partial\)|P|S|E)\s*(\d{1,2}\/\d{1,2}\/\d{4})\s*(\d{1,2}\/\d{1,2}\/\d{4})\s*\$([\d,]+)\s*-\s*\$([\d,]+)/;

Consuming the unified JSON schema via the Apify Node SDK.

const { items } = await client.dataset('senate-dataset-id').listItems({ limit: 200 });
const recentBuys = items.filter(t => t.type === 'buy');

Practical Applications

  • Use case: Bypassing session-locked government portals by pinning residential proxy IPs. Pitfall: Using standard rotating datacenter proxies causes session expiration and 403 errors.
  • Use case: PDF data extraction for machine-generated documents with poor text ordering. Pitfall: Relying on standard whitespace splitters when font-glyph hacks merge data columns into single strings.
  • Use case: Idempotent database synchronization using SHA-256 content hashes. Pitfall: Using auto-incrementing primary keys which cause duplicate records when scraping the same source document twice.

References:

Continue reading

Next article

IBM Releases Two Granite Speech 4.1 2B Models: High-Speed ASR and Translation

Related Content