What format works best for LLM training data?

Clean markdown is the dominant format for LLM ingestion in 2026. It preserves headings, lists, and links — all signals that improve model understanding — while stripping presentational HTML that wastes tokens. JSON-LD complements markdown for structured retrieval tasks. Raw HTML works only when you have a downstream cleaning step you control.

How much data do I need for fine-tuning versus pretraining?

Fine-tuning typically needs hundreds to tens of thousands of high-quality examples — quality dominates quantity. Continued pretraining benefits from millions to billions of tokens of in-domain text. The shift in 2026 is that small high-quality datasets often outperform large noisy ones, so a 50,000-document curated corpus can rival a 5-million-document scrape.

Do I need a residential proxy for LLM training data collection?

For most public web content, no. Documentation sites, blogs, news, knowledge bases, and public APIs respond fine to well-behaved scrapers. Residential proxies become necessary when your corpus includes sites with aggressive anti-bot defenses — Amazon, Zillow, LinkedIn, certain travel and finance sites. Plan your source policy first, then choose proxies based on the gaps.

How do I deduplicate scraped pages effectively?

Hash the clean markdown output — not the HTML — to detect exact duplicates. For near-duplicates, use shingling or MinHash to spot pages that share most paragraphs. URL-level deduplication is necessary but not sufficient; the same content often appears at multiple URLs after canonicalization.

Should I crawl or use a sitemap to discover URLs?

Start with the sitemap when one exists — it is a curated list the site owner already maintains. Fall back to crawling only when sitemaps are incomplete or missing. Most sites publish a sitemap at /sitemap.xml; many split into nested index sitemaps for large catalogs.

How do I respect robots.txt for training data?

Read it on every host you crawl, before the first non-robots request. Honor User-agent: * directives and any User-agent: GPTBot, ClaudeBot, or specific bot directives if they apply. Many sites now use machine-readable opt-out signals like the noai meta tag — treat those as binding for training-data use.

What is the legal landscape for training-data scraping in 2026?

It remains jurisdiction-specific and use-case dependent. Publicly accessible content is generally lawful to fetch, but redistribution, model training, and commercial use have separate legal questions in different jurisdictions. Work with counsel before commercial training runs, especially for international datasets, and document your source policy.

How fast can I scrape without getting blocked?

It depends on the site. For typical content sites, one request every two to four seconds is polite and rarely triggers rate limits. Documentation and government sites tolerate higher concurrency. For sensitive sites, slow down further — one request every six to ten seconds. A managed API handles this for you across thousands of sites.

← All posts • 2026-05-16 • 14 min read

Building a web data pipeline for LLM training in 2026

A practical guide to collecting, cleaning, and shipping training data at scale — what works, what fails, and what to outsource.

AIRAGtraining data

The state of training-data pipelines

If you are reading this, you already know that a model is only as good as the data you feed it. What has shifted in 2026 is the bar for what counts as good. Two years ago you could throw a hundred million pages of mixed-quality web content at a model and call it pretraining. Today, that approach loses to a carefully curated five-million-document corpus from the same budget. Quality has eaten quantity in pretraining the same way it has eaten quantity everywhere else in machine learning.

The hard part for most teams is not deciding to care about quality. It is building the pipeline that turns a list of source URLs into a clean, deduplicated, license-clear corpus that an embedding model or a fine-tuning run can actually consume. That pipeline has at least four stages, each with its own failure modes, and most teams underestimate at least two of them.

This article walks through the full pipeline as we see customers build it in 2026, names the parts that are easy to get wrong, and points out where buying beats building. We will name competitors directly where helpful, because pretending alternatives do not exist makes the writing less useful — and frankly, the data infrastructure space in 2026 is rich enough that nobody needs to be defensive about it.

What you are actually building

The training-data pipeline has four stages: discovery, fetching, cleaning, and provenance. Some teams add a fifth — deduplication and filtering — and treat it as a peer to the others. Either way, each stage has its own dominant failure modes.

Discovery is how you decide which URLs to fetch. The cheap mistake here is treating discovery as a free byproduct of fetching — letting a generic crawler pick its own targets and hoping the link graph delivers what you want. It rarely does. Modern training pipelines start with curated source lists: documentation portals, knowledge bases, public datasets, government open-data sites, and a handful of high-quality news and reference sources. Generic crawl-the-web pipelines produce corpora that are heavy on SEO content and light on the deep technical material that actually moves model quality.

Fetching is where most teams have been burned. A page that loads fine in your browser may return an empty body to a default Python request, because the site detected a missing JavaScript runtime or an obvious datacenter IP. The honest answer here is that fetching at scale requires more infrastructure than it looks like from a five-line prototype. Browser automation, stealth profiles, proxy rotation, retry logic, and rate-limit awareness all matter once you are pulling more than a few thousand pages.

Cleaning is the stage that decides whether your tokens are signal or noise. A typical web page is 80 to 95 percent boilerplate — navigation, headers, footers, cookie banners, ads, related-content widgets, comment sections, share buttons. If those tokens land in your training set, the model spends capacity learning to predict cookie banner copy. That is not what you paid for. Readability-style extraction — identifying the main content block and discarding everything around it — is the difference between a corpus that helps and a corpus that quietly drags performance down.

Provenance is what your legal team will ask you about three weeks before you ship. For every document in your training set, you should be able to answer: what URL did this come from, when was it fetched, what was the robots.txt at the time, was there an opt-out signal we missed, what license did the source declare? Pipelines that skip provenance early end up rebuilding it expensively later.

Where the off-the-shelf options stand

You have real alternatives to building this yourself, and several of them are very good. The shape of the decision in 2026 looks like this.

Firecrawl is the team that popularized clean markdown output for AI workflows. They earned that position with quality output and a developer experience that respects how AI engineers actually work. If you need clean markdown from a known list of URLs and your volume is modest, they are an excellent choice and the learning curve is short.

Apify runs the largest community-maintained marketplace of scrapers in the industry. The breadth is impressive — actors for almost every site you can think of — and their async + webhook flow is mature. The trade-off is variance: actors are maintained by different authors, so reliability and response shape are not uniform across the catalog. For training-data pipelines that touch many verticals, this is workable but requires a maintenance layer on your side to handle catalog drift.

Bright Data leads the enterprise tier. The proxy network is the largest commercial offering in the market, and their compliance posture covers most procurement reviews out of the box. For teams that need single-tenant deployments, regional residency, or audited usage records, they are often the right pick. The pricing model rewards consistent volume more than experimental usage.

ScraperAPI built a reputation on simplicity. The API surface is minimal, the docs are clean, the onboarding takes minutes. For teams that just need raw HTML and want to do their own cleaning and structure extraction downstream, that simplicity is a real virtue.

DataSonar — that is us — takes the same simplicity ScraperAPI is known for and adds two things on top that matter for training-data pipelines specifically. One: every endpoint returns LLM-ready markdown by default, with the structured signals (JSON-LD, OpenGraph, Twitter Card, microdata) returned alongside in the same call. Two: a built-in intelligence layer means you can pull the SSL, DNS, WHOIS, and tech-stack metadata for every source URL in the same pipeline, which is useful for provenance documentation and for filtering corpora by source category.

None of these is the wrong choice in isolation. The right choice depends on what your downstream stages look like, what your legal team needs to see, and how much pipeline maintenance you want to own.

Discovery: getting the URL list right

The single highest-leverage decision in a training-data pipeline is which URLs you choose to fetch. Spending an extra week on discovery saves months downstream.

Start with curated lists. For technical pretraining, the obvious sources are documentation sites for major open-source projects, technical books with permissive licenses, government technical archives, and high-quality educational publishers that offer text APIs. For general-purpose corpora, Common Crawl and the Pile-style aggregations remain the volume baseline, but a curated supplement of the top thousand reference and educational sites lifts model quality on benchmarks that matter.

Sitemaps are an underused resource. Most reasonable sites publish a /sitemap.xml with every public URL, often split into nested index sitemaps for large catalogs. Fetching the sitemap is one HTTP request that can replace days of crawler tuning. The DataSonar sitemap endpoint flattens nested index sitemaps into a single response, which is convenient when you are building a URL list for an LLM training run that spans hundreds of sites.

For sites without sitemaps, crawl conservatively. A breadth-first crawl with a budget of a few thousand pages and a depth limit of three or four levels gets you most of what is worth having. Past that the marginal page tends to be archival, tag pages, paginated lists, or boilerplate. Stopping early saves bandwidth and improves the average quality of your corpus.

One more discovery principle worth naming: write down what you are not going to fetch. Maintain an exclude list of social media sites, link aggregators, low-quality SEO farms, and known machine-generated content sites. The exclusions matter as much as the inclusions for corpus quality.

Fetching: the part that bites you

Fetching looks easy in a notebook. A line of Python pulls a page, you parse it with BeautifulSoup, you move on. At scale, the failures stack up fast.

The first failure mode is JavaScript-rendered content. A growing share of pages return an empty body or a thin skeleton to a default fetch and rely on client-side JavaScript to render the actual content. If your fetcher is not running a browser, those pages contribute nothing useful. The mitigation is a stealth headless browser that waits for the page to render before extraction. Running one yourself is workable for a few hundred URLs per minute; past that, the operational overhead is significant.

The second failure mode is anti-bot defenses. Sites that detect crawlers — Amazon, Zillow, LinkedIn, many travel and finance sites — return captcha walls, soft 403s, or shadowbanned responses that look fine but contain reduced or fake content. Detecting shadowbans requires comparing the response to a known-good fingerprint, which most pipelines never bother to do. The result is a corpus that quietly contains thousands of bot-wall pages mistaken for legitimate content.

The third failure mode is rate limiting and IP reputation. Datacenter IPs are flagged faster than residential IPs. A pipeline that runs from a single cloud region against a target site will hit rate limits within minutes. Rotation across IPs helps; residential proxies help more; respecting the site's own rate-limit headers helps most.

The fourth failure mode is retries. Naive retry-on-failure logic creates cascading failures when a target site goes briefly degraded. Exponential backoff with jitter, per-host concurrency limits, and dead-letter queueing for repeated failures are table stakes for a fetch pipeline that runs unattended overnight.

Most teams underestimate how much of their engineering time goes into fetcher reliability. By month six of a serious pipeline, the fetcher is the most complex service in the stack and the one most likely to wake somebody up at 2 a.m. If you have an option to outsource it, take it.

Cleaning: the difference between a good corpus and a great one

Once the bytes are in hand, the cleaning stage decides what fraction of them survive into the training set. The basic operations every pipeline needs:

Main-content extraction. Identify the article body and discard navigation, ads, footers, related-content widgets, share buttons, comment threads, cookie banners. Readability-style algorithms work well here.
Markdown normalization. Convert the extracted content to clean markdown. Preserve headings, lists, links, and code blocks. Strip presentational attributes and inline styles.
Language detection and filtering. Drop pages outside your target languages early, before tokenization budget burns on them.
Quality filtering. Filter on length, readability score, ratio of code to prose, presence of meaningful punctuation. The exact filters depend on your training objective, but every serious pipeline has them.
Deduplication. Hash-based exact dedup catches the obvious cases. MinHash or shingling catches near-duplicates. Both matter — exact dedup alone leaves 10 to 30 percent duplicate content in a typical web corpus.

One observation that surprises teams new to this: the cleaning stage is where most of the bias in a corpus enters. If your cleaning filter prefers long-form English prose with consistent paragraph structure, you will end up with a corpus that overrepresents formal publications and underrepresents technical documentation, code, scientific writing, and non-English content. Audit your filters with the same care you audit your sources.

Provenance: the stage everyone skips and regrets

Every document in your training set should carry, at minimum, the source URL, the fetch timestamp, the licence declaration the source made (if any), the robots.txt state at the time of fetch, and any opt-out signals (the noai meta tag, X-Robots-Tag, machine-readable rights statements). This is not optional for commercial training in 2026. Regulators, publishers, and increasingly customers expect to see the audit trail.

Build the provenance schema first, before you fetch a single page. Retrofitting it after the corpus is built is painful and often impossible — the source pages will have changed, robots.txt may have been updated, opt-out signals may have been added or removed.

How the pipelines we see at customers actually look

The shapes vary, but the most successful patterns share a few traits.

First, they decouple discovery from fetching. The URL list lives in a versioned manifest — a git repo, a database table, an S3 file — that is reviewed by humans before any new sources go into production. This is the single most leveraged hygiene practice we see.

Second, they outsource fetching. The teams that try to run their own browser farm at scale eventually conclude the work is not differentiated. Whether they buy from Firecrawl, Apify, ScraperAPI, Bright Data, DataSonar, or some combination depends on volume, vertical mix, and procurement preference. The point is that they buy.

Third, they keep cleaning in-house. Cleaning is where their proprietary judgement lives — the filters, the bias audits, the dedup heuristics. Outsourcing the bytes is fine; outsourcing the editorial decisions usually is not.

Fourth, they treat provenance as a first-class concern with a dedicated engineer or compliance partner. Not a JIRA ticket to address later.

What a single pipeline call looks like with DataSonar

For the fetching and cleaning stages combined, a single call to /v1/scrape with format: "markdown" returns a clean markdown body and a metadata block. The same call, with format: "html", gives you the raw page if your cleaning stage prefers to do its own extraction. /v1/extract/structured returns JSON-LD, OpenGraph, and microdata for structured retrieval downstream. /v1/crawl walks a full site and webhooks back a complete result set so your ingestion pipeline does not hold connections open. /v1/intel/page attaches the tech-stack, SSL, and contact metadata to each source for your provenance schema.

None of this is unique to DataSonar in the abstract — every vendor in the comparison above can produce similar outputs in similar shapes. The reason we built it the way we did is so that all of it lives behind one API key, one billing line, and one schema, which removes a meaningful integration tax for teams that would otherwise stitch together three vendors.

Where to go from here

If you are early in your pipeline build, the highest-leverage move is to write down your source policy and your provenance schema before you fetch anything. If you are mid-build and the fetcher is the part eating your engineering hours, the highest-leverage move is to decide whether you want to keep owning that complexity.

If DataSonar fits the shape of what you are building, the free tier covers 1,000 requests per month, every endpoint, no credit card. Plenty to evaluate on a real subset of your corpus before committing to anything larger.