Skip to main content

On This Page

Automating Hidden JSON API Discovery for Robust Web Scraping

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

I Built a Script That Finds Hidden APIs on Any Website (Here’s the Code)

Alex Spinov developed a Node.js discovery script to identify internal JSON endpoints used by modern websites. By targeting common paths like /api/v1 and /_next/data, the tool bypasses the fragility of CSS selectors and the overhead of headless browsers.

Why This Matters

Traditional web scraping relies on DOM structures that frequently break during site redesigns and are easily blocked by anti-bot systems. Transitioning to internal JSON APIs provides a more stable, structured data source that eliminates the need for complex HTML parsing. In a real-world implementation, this shift reduced a scraper’s codebase from 120 lines to just 15 lines while increasing execution speed by 10x.

Key Insights

  • 87.5% reduction in code complexity (120 lines to 15 lines) by switching from DOM parsing to JSON APIs, 2026
  • 10x performance gain achieved by bypassing headless browser requirements for data extraction
  • Discovery of hidden endpoints like /_next/data and /wp-json/wp/v2/posts for structured content access
  • Utilizing the robots.txt file to reveal restricted API paths that remain technically public
  • Framework-specific patterns such as appending .json to URLs for automatic JSON responses
  • Network tab analysis in DevTools to identify XHR/Fetch calls used by the frontend

Working Examples

Node.js script to probe common API paths and return content type and status.

const https = require("https");
async function findAPIs(domain) {
const commonPaths = [
"/api/v1",
"/api/v2",
"/api/graphql",
"/_next/data",
"/wp-json/wp/v2/posts",
"/feed.json",
"/sitemap.xml",
"/.well-known/openid-configuration",
"/robots.txt",
"/manifest.json"
];
const results = [];
for (const path of commonPaths) {
try {
const res = await fetch(`https://${domain}${path}`);
if (res.ok) {
const contentType = res.headers.get("content-type") || "";
results.push({
path,
status: res.status,
type: contentType.split(";")[0],
size: res.headers.get("content-length") || "unknown"
});
}
} catch (e) {
// Skip failed requests
}
}
return results;
}
// Usage
findAPIs("dev.to").then(apis => {
console.log(`Found ${apis.length} endpoints:\n`);
apis.forEach(api => {
console.log(` ${api.path} → ${api.type} (${api.size} bytes)`);
});
});

Practical Applications

  • Data Aggregation: Use /api/articles on dev.to or /pypi/{pkg}/json for clean, structured metadata without HTML overhead.
  • Package Management: Accessing registry.npmjs.org for complete package metadata.
  • Pitfall: Relying on CSS selectors leads to breakage during frontend updates; use internal XHR/Fetch endpoints for long-term stability.
  • Pitfall: Ignoring robots.txt; consequence is missing valuable public API documentation.

References:

Continue reading

Next article

LiteLLM Supply Chain Attack: How Unpinned Dependencies Compromised 3.4M Daily Downloads

Related Content