Engineering a Unified Korean Entertainment Database Across 10 Fragmented Sources

Building a Unified Korean Entertainment Database from 10 APIs and Web Scrapers

Engineer Cara Jung developed a unified database to centralize Korean entertainment data currently fragmented across language barriers and closed ecosystems. The system integrates 10 distinct sources, including NAVER’s undocumented JavaScript-rendered search results and official KOBIS REST APIs.

Why This Matters

While Western entertainment data is well-structured in platforms like IMDb and Spotify, Korean data remains trapped behind language-specific barriers and undocumented endpoints. Developers face a technical reality where essential metrics like Nielsen Korea viewership or verified NAVER ratings are inaccessible via standard APIs, forcing a reliance on complex headless browser automation and custom parsers to bridge the gap for AI agents and global applications.

Key Insights

Playwright with Chromium headless is required for NAVER and JustWatch to render content from JavaScript-heavy pages and Shadow DOM elements.
Nielsen Korea viewership ratings are extracted from NAVER’s interactive SVG charts by parsing SVG text elements and x-axis ticks.
The Korean Film Council (KOBIS) provides the only official government REST API for authoritative box office data.
Cross-source identity management uses TMDB IDs as primary keys to link disparate IDs from MDL, Naver, and JustWatch.
Section aliasing solves Wikipedia’s non-standard naming conventions for ‘Plot’ and ‘Ratings’ headers across different articles.

Working Examples

Headless browser setup using Playwright to handle JavaScript-rendered Korean content.

from playwright.sync_api import sync_playwright
def _get_page_html(url: str, wait_selector: str = "body") -> str:
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        context = browser.new_context(
            user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)...",
            locale="ko-KR",
        )
        page = context.new_page()
        page.goto(url, wait_until="domcontentloaded")
        page.wait_for_selector(wait_selector)
        time.sleep(2)
        html = page.content()
        browser.close()
        return html

Logic for extracting Nielsen ratings from NAVER’s interactive SVG charts.

def _parse_episode_chart(soup: BeautifulSoup) -> list[dict]:
    rating_texts = soup.select("g.bb-texts-rank text.bb-text")
    ratings = []
    for t in rating_texts:
        val = t.get_text(strip=True)
        try:
            f = float(val)
            if f > 0: ratings.append(f)
        except ValueError: pass
    x_ticks = soup.select("g.bb-axis-x g.tick")
    ep_labels = []
    for tick in x_ticks:
        tspans = tick.select("tspan")
        if len(tspans) >= 2:
            ep_num = _parse_episode_num(tspans[0].get_text(strip=True))
            date_text = tspans[1].get_text(strip=True)
            if ep_num and date_text:
                ep_labels.append({"episode": ep_num, "date": date_text})
    return [{"episode": ep["episode"], "air_date": ep["date"], "rating": ratings[i]} for i, ep in enumerate(ep_labels) if i < len(ratings)]

Query to identify discrepancies between Korean audience sentiment and Western critical reception.

SELECT title_english, naver_audience_rating, rt_tomatometer
FROM tv_shows
WHERE naver_audience_rating > 8.5
AND rt_tomatometer < 60;

Practical Applications

Use case: Querying cross-regional sentiment by comparing NAVER verified buyer ratings against TMDB international scores. Pitfall: Using generic ‘rating’ fields instead of source-specific naming, leading to ambiguous data interpretations.
Use case: Real-time streaming availability tracking via JustWatch redirect parameter parsing. Pitfall: Relying on TMDB’s streaming data, which often lags actual availability by several weeks.

References:

https://dev.to/carasjung/building-a-unified-korean-entertainment-database-from-10-apis-and-web-scrapers-3n91

On This Page

Building a Unified Korean Entertainment Database from 10 APIs and Web Scrapers

Why This Matters

Key Insights

Working Examples

Practical Applications

Continue reading

Related Content

Engineering a Search Engine for 3 Million Polish Businesses: Data Pipeline Lessons

Scalable Event Streaming: Understanding Kafka Architecture for High-Volume Data

'Zero-UI' Architecture Emerges: Engineer Builds Agent-Native Data Engine in Rust Using MCP