Web scraping best practices for article extraction

Extracting clean, readable article content from the web is deceptively hard. A page that looks simple in a browser is actually a dense forest of navigation bars, ads, sidebars, related links, and tracking scripts — with the actual article buried somewhere in the middle. Whether you’re building a read-it-later app, a translation pipeline, or a research tool, getting reliable article extraction right requires attention to a handful of key practices.

This guide covers the techniques and principles that separate fragile scrapers from robust extraction systems.

Start with the Right Tool for the Job

Not all scraping requires a full browser. For article extraction specifically, there are three tiers of complexity:

HTTP client + parser. For the majority of news sites, blogs, and publications, a simple HTTP request followed by HTML parsing is enough. Tools like Node.js’s fetch or Python’s requests library, combined with a parser like Cheerio or BeautifulSoup, handle these cases well. This approach is fast, lightweight, and easy to deploy.

Readability algorithms. Mozilla’s Readability library (and ports like @mozilla/readability for Node.js or readability-lxml for Python) are purpose-built for article extraction. They analyze the DOM structure, score content blocks by density of text versus markup, and return the article title, byline, and clean content. This should be your default starting point — it handles 80-90% of articles correctly out of the box.

Headless browsers. Some modern sites render content entirely via JavaScript. Single-page applications, sites using heavy client-side rendering, or pages behind cookie consent walls may return empty or skeleton HTML to a plain HTTP request. For these, tools like Puppeteer or Playwright can render the page fully before extraction. Use this as a fallback, not a default — headless browsers are slower, use more memory, and are harder to run at scale.

Respect the Site and Its Rules

Sustainable scraping means being a good citizen of the web.

Check robots.txt first. Before scraping any domain, fetch and parse its robots.txt file. Respect Disallow directives and crawl-delay settings. This isn’t just about ethics — ignoring robots.txt can get your IP blocked and, in some jurisdictions, create legal exposure.

Rate limit your requests. Even if a site doesn’t specify a crawl delay, hammering a server with rapid requests is poor practice. Introduce delays between requests — one to two seconds per request is a reasonable baseline. For batch processing, use a queue with configurable concurrency rather than firing off all requests simultaneously.

Set a meaningful User-Agent. Identify your scraper with a descriptive User-Agent string that includes contact information or a URL. This lets site operators reach you if there’s a problem, and it distinguishes your traffic from malicious bots. Avoid spoofing browser User-Agent strings unless you have a specific technical reason (some sites serve different content to non-browser agents).

Handle errors gracefully. Expect and handle HTTP 429 (Too Many Requests), 403 (Forbidden), and 5xx errors. Implement exponential backoff for retries. If a site consistently blocks you, respect that signal.

Deal with HTML Structure Variability

The web is not standardized. Every site structures its HTML differently, and even a single site may use different templates for different article types.

Don’t rely on specific CSS selectors. A scraper built around .article-body > p will break the moment the site redesigns. Readability-style algorithms are more resilient because they work from general heuristics (text density, element scoring) rather than specific selectors.

Handle encoding correctly. Not every page is UTF-8. Check the Content-Type header and the HTML <meta charset> tag. Libraries like jsdom handle this automatically, but if you’re parsing raw bytes, incorrect encoding will produce garbled text — especially for non-Latin scripts.

Strip boilerplate aggressively. Navigation, footers, related article links, social sharing buttons, and comment sections are noise. Readability handles most of this, but you may need additional post-processing. A good test: if the extracted text makes sense read aloud with no context about the site layout, your extraction is clean.

Preserve meaningful structure. While you want to strip noise, don’t flatten everything to plain text. Headings, lists, blockquotes, and emphasis carry meaning. Extract both a clean text version and an HTML version that preserves semantic markup. This gives downstream consumers flexibility.

Extract Metadata, Not Just Content

A well-extracted article is more than its body text. Capture:

Title — from <title>, Open Graph tags, or the Readability result
Author/byline — from byline elements, <meta name="author">, or structured data
Publication date — from <time> elements, article:published_time meta tags, or JSON-LD
Excerpt/description — from meta description or Open Graph description
Site name — from og:site_name or the domain itself
Canonical URL — from <link rel="canonical"> to avoid duplicate content from URL variations

Structured data (JSON-LD, Microdata) is increasingly common and is often the most reliable source for metadata. Check for <script type="application/ld+json"> blocks before falling back to meta tags.

Handle Edge Cases

Real-world article extraction means dealing with the messy edges:

Paywalled content. Many sites serve truncated content to non-subscribers. Your extractor should detect this — if the extracted content is suspiciously short or ends with a “subscribe to read more” pattern, flag it rather than silently returning a partial article.

Multi-page articles. Some publications split articles across multiple pages. Look for “next page” links or pagination patterns. Decide whether to follow these automatically or report the additional URLs for separate processing.

Non-article pages. Not every URL points to an article. Your pipeline should handle landing pages, category pages, and error pages gracefully — detect them and report them rather than forcing extraction on content that isn’t an article.

Character encoding edge cases. Watch for smart quotes, em dashes, and other typographic characters that may be encoded differently across sites. Normalize Unicode where appropriate.

Build for Observability

Article extraction at any scale needs monitoring:

Log extraction quality signals. Track the ratio of extracted content length to raw HTML length. A very low ratio might indicate extraction failure; a very high ratio might mean boilerplate wasn’t stripped.
Sample and review. Periodically spot-check extracted articles against their source pages. Automated extraction will drift as sites change their templates.
Track failure rates by domain. If a particular site’s articles consistently fail extraction, you likely need a site-specific adapter or a different extraction strategy for that domain.

Conclusion

Reliable article extraction is a practice, not a one-time implementation. Start with proven tools like Mozilla’s Readability, respect the sites you scrape, handle the inevitable edge cases, and build enough observability to catch problems before your users do. The web changes constantly — your extraction pipeline should be built to adapt with it.