Docs · Extraction

Clean article extraction

Readability extraction — strips nav, ads, sidebars.

POST /v1/extract/clean

Returns the main article body of a page along with the title, word count, and estimated reading time. Built for LLM ingestion and content indexing.

Parameters

Name Type Required Default Description
url string yes URL of the article.
stealth boolean no true Apply stealth countermeasures.
timeout integer no 30 Per-request timeout.

Request

curl -X POST https://api.datasonar.dev/v1/extract/clean \
  -H "Authorization: Bearer osk_..." \
  -d '{"url": "https://en.wikipedia.org/wiki/Web_scraping"}'

Response

{
  "status": "success",
  "title": "Web scraping - Wikipedia",
  "content_html": "<div>...</div>",
  "content_text": "Method of extracting data from websites...",
  "word_count": 3873,
  "reading_time_min": 16
}

Related