Crawl whole sites.
In one job.

From a 500-page documentation site to a 100,000-page knowledge base, the crawler walks the link graph, respects robots.txt by default, and streams clean results back to your webhook when it is done.

curl -X POST https://api.datasonar.dev/v1/crawl \
  -H "Authorization: Bearer osk_..." \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.example.com",
    "max_pages": 500,
    "depth": 3,
    "concurrency": 10,
    "respect_robots": true,
    "webhook_url": "https://yourapp.com/datasonar-callback"
  }'

# Returns: { "status": "queued", "job_id": "..." }

Use cases for site-wide crawling

LLM training corpora

Crawl entire documentation sites, knowledge bases, or product catalogs. Receive clean markdown for every page, ready to embed.

Competitor monitoring

Snapshot a competitor's site on a schedule. Diff structural changes, new pages, removed pages, modified pricing.

Internal search indexing

Build full-text search over content you do not own. The crawler returns a structured map of every page it discovers, ready for ingestion into your index.

Archival and compliance

Capture full-site snapshots for legal hold, regulatory archive, or pre-acquisition due diligence. Webhook delivery means no long-held connections.

Crawler questions

How big can a crawl be?
The default cap is 500 pages and depth 3. You can raise both — production customers regularly crawl tens of thousands of pages per job. For million-page crawls, talk to us about enterprise capacity.
Does the crawler respect robots.txt?
Yes by default. Each crawl honors the robots.txt of the target site. You can override with the respect_robots: false flag in cases where you have explicit permission to crawl, such as your own site or a partner's.
Can I scope the crawl to a single subdomain?
Yes. By default the crawler stays within the same host as the seed URL. Pass same_host: false to follow links across subdomains, or use include_patterns and exclude_patterns for finer control.
How does webhook delivery work?
Provide a webhook_url with the request. When the job completes, we send a single POST to your URL with the full result body and a header containing the job id. Webhooks are signed so you can verify the origin.
What happens to a crawl if I hit my monthly quota midway?
The crawler pauses and returns a partial result with everything collected so far plus a quota_exceeded flag. You can upgrade your plan and resume the job with the same job id.
How fast is a crawl?
Throughput depends on target site responsiveness, concurrency, and whether pages need JavaScript rendering. A typical documentation site crawls at 5-15 pages per second; aggressive crawling against single-server sites is automatically slowed to be polite.

Start your first crawl free.

The free tier covers 1,000 pages a month. Plenty to prove value.