Skip to main content

On This Page

Engineering a Unified Korean Entertainment Database Across 10 Fragmented Sources

3 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

Building a Unified Korean Entertainment Database from 10 APIs and Web Scrapers

Engineer Cara Jung developed a unified database to centralize Korean entertainment data currently fragmented across language barriers and closed ecosystems. The system integrates 10 distinct sources, including NAVER’s undocumented JavaScript-rendered search results and official KOBIS REST APIs.

Why This Matters

While Western entertainment data is well-structured in platforms like IMDb and Spotify, Korean data remains trapped behind language-specific barriers and undocumented endpoints. Developers face a technical reality where essential metrics like Nielsen Korea viewership or verified NAVER ratings are inaccessible via standard APIs, forcing a reliance on complex headless browser automation and custom parsers to bridge the gap for AI agents and global applications.

Key Insights

  • Playwright with Chromium headless is required for NAVER and JustWatch to render content from JavaScript-heavy pages and Shadow DOM elements.
  • Nielsen Korea viewership ratings are extracted from NAVER’s interactive SVG charts by parsing SVG text elements and x-axis ticks.
  • The Korean Film Council (KOBIS) provides the only official government REST API for authoritative box office data.
  • Cross-source identity management uses TMDB IDs as primary keys to link disparate IDs from MDL, Naver, and JustWatch.
  • Section aliasing solves Wikipedia’s non-standard naming conventions for ‘Plot’ and ‘Ratings’ headers across different articles.

Working Examples

Headless browser setup using Playwright to handle JavaScript-rendered Korean content.

from playwright.sync_api import sync_playwright
def _get_page_html(url: str, wait_selector: str = "body") -> str:
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        context = browser.new_context(
            user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)...",
            locale="ko-KR",
        )
        page = context.new_page()
        page.goto(url, wait_until="domcontentloaded")
        page.wait_for_selector(wait_selector)
        time.sleep(2)
        html = page.content()
        browser.close()
        return html

Logic for extracting Nielsen ratings from NAVER’s interactive SVG charts.

def _parse_episode_chart(soup: BeautifulSoup) -> list[dict]:
    rating_texts = soup.select("g.bb-texts-rank text.bb-text")
    ratings = []
    for t in rating_texts:
        val = t.get_text(strip=True)
        try:
            f = float(val)
            if f > 0: ratings.append(f)
        except ValueError: pass
    x_ticks = soup.select("g.bb-axis-x g.tick")
    ep_labels = []
    for tick in x_ticks:
        tspans = tick.select("tspan")
        if len(tspans) >= 2:
            ep_num = _parse_episode_num(tspans[0].get_text(strip=True))
            date_text = tspans[1].get_text(strip=True)
            if ep_num and date_text:
                ep_labels.append({"episode": ep_num, "date": date_text})
    return [{"episode": ep["episode"], "air_date": ep["date"], "rating": ratings[i]} for i, ep in enumerate(ep_labels) if i < len(ratings)]

Query to identify discrepancies between Korean audience sentiment and Western critical reception.

SELECT title_english, naver_audience_rating, rt_tomatometer
FROM tv_shows
WHERE naver_audience_rating > 8.5
AND rt_tomatometer < 60;

Practical Applications

  • Use case: Querying cross-regional sentiment by comparing NAVER verified buyer ratings against TMDB international scores. Pitfall: Using generic ‘rating’ fields instead of source-specific naming, leading to ambiguous data interpretations.
  • Use case: Real-time streaming availability tracking via JustWatch redirect parameter parsing. Pitfall: Relying on TMDB’s streaming data, which often lags actual availability by several weeks.

References:

Continue reading

Next article

Developer Chris Morgan Bans Unauthorized Query Strings to Prevent URL Tracking

Related Content