Scraping SAM.gov and USASpending for Federal Contracts via Python
These articles are AI-generated summaries. Please check the original sources for full details.
Scraping SAM.gov + USASpending for Federal Contracts (No API Key, in Python)
SAM.gov and USASpending.gov manage over $700 billion in annual federal contracts but operate as disconnected systems with poor search interfaces. This Python-based scraper on Apify merges data from both sources for $0.02 per contract, even without a SAM.gov API key.
Why This Matters
Technical reality involves navigating two disparate data models where SAM.gov provides open solicitations and USASpending provides award history. While many scrapers miss critical attachment data like Statements of Work, this implementation extracts direct download URLs from the resourceLinks field to facilitate full proposal preparation. Furthermore, relying on LLMs for relevance scoring introduces unnecessary latency and cost, whereas local TF-IDF with synonym expansion provides high-speed, offline semantic matching.
Key Insights
- USASpending API (api.usaspending.gov) requires no authentication and provides POST endpoints for awarded contract searches without rate limit drama.
- SAM.gov API keys can take up to 10 business days to acquire, necessitating a hybrid system that functions on USASpending data alone if a key is absent.
- The resourceLinks field in the SAM.gov API response contains essential RFP documents, such as Section L instructions and evaluation criteria, often overlooked by standard scrapers.
- SAM.gov rate limits are tighter than documented, triggering 429 errors at 20 requests per minute despite a documented limit of 60.
- Agency name normalization is critical; for instance, the ‘VA’ can appear as ‘Department of Veterans Affairs’ or ‘Veterans Affairs, Department of’ in different records.
Working Examples
Core execution flow for merging USASpending and SAM.gov data sources.
async def run(self):
# Source 1: USASpending.gov (no key needed, always works)
usaspending_opps = await self._fetch_usaspending_opportunities()
self.opportunities.extend(usaspending_opps)
# Source 2: SAM.gov (optional, richer data if key provided)
sam_opps = await self._fetch_sam_opportunities()
self.opportunities.extend(sam_opps)
# Deduplicate, rank, filter, push
self._deduplicate()
self._score_relevance()
await self._push_to_dataset()
Semantic ranking using TF-IDF with synonym expansion for offline relevance scoring.
def score(self, title: str, description: str) -> float:
contract_text = f"{title} {title} {description}" # title weighted 2x
contract_tokens = _expand_synonyms(_tokenize(contract_text))
contract_tf = _compute_tf(contract_tokens)
dot_product = norm_a = norm_b = 0.0
for word in set(self._business_tf.keys()) | set(contract_tf.keys()):
idf = self._idf.get(word, 0.0)
a = self._business_tf.get(word, 0.0) * idf
b = contract_tf.get(word, 0.0) * idf
dot_product += a * b
norm_a += a * a
norm_b += b * b
if norm_a == 0 or norm_b == 0:
return 0.0
return round(dot_product / (math.sqrt(norm_a) * math.sqrt(norm_b)), 3)
Exponential backoff logic to handle SAM.gov 429 rate limit responses.
MAX_RETRIES = 3
RETRY_DELAYS = [1, 2, 4]
async def _request_with_retry(self, method, url, **kwargs):
for attempt in range(MAX_RETRIES):
response = await self.http_client.get(url, **kwargs) if method == 'GET' \
else await self.http_client.post(url, **kwargs)
if response.status_code in (429,) or response.status_code >= 500:
if attempt < MAX_RETRIES - 1:
await asyncio.sleep(RETRY_DELAYS[attempt])
continue
return response
Practical Applications
- Business Development Monitoring: Automatically pipeline new contract opportunities into CRMs via webhooks to eliminate manual spreadsheet management.
- Document Archival: Use extracted resourceLinks to automate the download of SOW and RFP documents using tools like wget for offline analysis.
- Agency Trend Prediction: Analyze historical USASpending data to identify budget allocation shifts six months before OMB report publication.
- Pitfall: Relying on keyword-only search often misses relevant bids like ‘FedRAMP-Authorized Infrastructure’ when searching for ‘cloud migration’ due to strict string matching.
References:
Continue reading
Next article
Optimizing Engineering Throughput: Why Speed Does Not Equal Velocity
Related Content
Unlocking Stable Data Collection: The Dual Strategy of AI Browsers and CAPTCHA Solvers
Achieve 99% success rates in web scraping by combining AI Browsers with CAPTCHA solving services.
Automate MongoDB Operations and Sync Workflows with VisuaLeaf
VisuaLeaf's Task Manager automates MongoDB exports and sync jobs using cron expressions and JS transformations to ensure consistent data movement.
Automate Email Workflows with Python SMTP and Gmail API
Learn to automate email workflows using Python's SMTP and Gmail API with step-by-step guidance.