Question 1

How is DataSonar's markdown different from raw HTML or BeautifulSoup output?

Accepted Answer

We run a Readability-style extraction that identifies the main content block and discards everything around it. The resulting markdown is typically 80 to 95 percent smaller than the original HTML while preserving headings, links, and prose structure. For LLM ingestion this means lower token costs and higher signal-to-noise in your embeddings.

Question 2

Can I use DataSonar as a tool in agent frameworks?

Accepted Answer

Yes. The scrape, batch scrape, and intelligence endpoints all return predictable JSON with stable schemas. Wrapping them as a tool for Anthropic's tool-use, OpenAI's function calling, LangChain, or LlamaIndex is a one-function exercise. We publish example wrappers in the docs.

Question 3

Does DataSonar respect robots.txt?

Accepted Answer

Our crawler endpoint respects robots.txt by default and exposes a flag to override for cases where you have explicit permission. Individual scrape calls fetch a single page as a user agent would, which is the same pattern used by every browser, search engine bot, and link previewer.

Question 4

What about copyright and fair use for training data?

Accepted Answer

Training data legality varies by jurisdiction and use case. We give you the tools to fetch public web content; the licensing decisions remain with you. For commercial training runs we recommend working with your legal team on a source policy and using our crawl endpoint's exclude rules to honor it at scale.

Question 5

Do you support streaming responses?

Accepted Answer

Individual scrape responses are JSON, returned in one shot when the page is fully rendered. For very large jobs we recommend the async pattern with webhook delivery — your server gets a clean POST when each batch completes, no long-held connections.

Question 6

Is the output deterministic enough for embedding-based search?

Accepted Answer

Yes for content. The markdown extraction is deterministic given the same HTML — the same URL fetched twice returns the same markdown body unless the source page itself changes. Dynamic content like timestamps will naturally vary.

Question 7

Can I scrape PDFs and other non-HTML formats?

Accepted Answer

We focus on HTML and JavaScript-rendered pages today. PDFs are on the roadmap — talk to us if you have specific needs.

Question 8

How does this compare to Firecrawl for LLM workflows?

Accepted Answer

Firecrawl popularized this category and produces excellent markdown output. DataSonar matches the markdown quality and adds two things on top: a domain intelligence layer that returns DNS, WHOIS, SSL, and tech-stack alongside the content, and a growing catalog of vertical actors for sites like Amazon and Zillow where generic scraping returns inconsistent data.

Feed your model the
cleanest data on the web.

Markdown that LLMs actually want

Structured signals alongside the text

Tool-use ready

Crawl at training scale

Drop into any AI stack.

Used by AI teams for

Retrieval-augmented generation

Pretraining and continued pretraining

Agentic workflows

Evaluation and grounding

AI engineering questions, answered

Ship your first RAG pipeline today.

Feed your model the cleanest data on the web.