Scraping Zillow in 2026: what works, what fails, what to do about it
An honest look at the realities of Zillow extraction in 2026 — what really works, where pipelines break, and how to build something that holds up in production.
Why this is harder than it looks
Zillow tracks over 110 million properties in the United States. For real estate analytics teams, off-market deal-flow tools, proptech startups, and individual investors, that dataset is irreplaceable — and there is no sanctioned API path to it for new developers. Which means everyone who needs the data is scraping it. Which means Zillow has spent the last several years building one of the most aggressive bot defenses on the public web.
This post is about what realistically works in 2026, where pipelines break, and how to build something that holds up. We will be specific about the trade-offs and honest about what each approach costs in real money and engineering time.
The official API path closed
Zillow used to operate a public Property API for licensed Real Estate Professionals and approved partners. Access has been progressively restricted since 2021. For most new use cases — proptech analytics, investor tooling, lead-gen for real estate services — the API path is no longer practical. Some legacy partner integrations remain, but Zillow has signaled clearly that data redistribution at scale is not the future of their partner program.
The practical result is that everyone serious about property data at scale is now scraping, licensing a third-party dataset, or partnering directly with MLS systems for the segments of the market where that is feasible. Each path has trade-offs we will get into below.
What you are up against
Zillow's bot defense is a stack, not a single product. Several things happen on a typical request, and any of them can decide to challenge or block you.
Behavioral fingerprinting. The primary defender analyzes browser fingerprints, mouse movements, scroll velocity, and timing patterns. Reputation databases catalog known automation tools and known bad IPs. A single suspicious signal triggers a soft challenge; multiple triggers a block.
Network-layer checks. Front-line CDN protection adds TLS fingerprinting, HTTP header order checks, and basic rate-limiting. This alone catches naive scrapers that use default HTTP clients — their TLS fingerprints are very different from real browsers.
JavaScript fingerprinting. Once you are past the network layer, the page runs JavaScript that checks for headless-browser indicators: missing plugins, missing fonts, inconsistent timezone or language settings, canvas-rendering inconsistencies. Each check is fast; the combination is hard to fully defeat.
Client-side rendering requirements. The full property page is rendered client-side from a JavaScript application. If your scraper does not execute JavaScript, you get a thin shell with no useful content. A full browser is required for traditional DOM scraping — and a full browser is exactly where the fingerprinting has the most signal to work with.
IP reputation. Datacenter IP ranges are flagged aggressively. Traffic from major cloud providers is treated as suspect by default and often blocked at the network layer before the higher-level checks even run.
The cumulative effect is that a naive scraper gets a 403 within seconds. A more sophisticated scraper using a headless browser on a datacenter IP gets a few requests through before triggering captcha challenges. A scraper using a real browser through a residential proxy with reasonable timing gets a much higher success rate but is still occasionally challenged.
The three working strategies
Most production teams pick one of three approaches, sometimes in combination.
Strategy 1: Headless browser plus residential proxy. Drive a real browser through a residential proxy pool, with realistic mouse-movement simulation and reasonable per-IP rate limits. This is the highest-fidelity approach and the most expensive to operate. The proxy cost dominates — a serious Zillow pipeline can spend $0.50 to $2.00 per 1,000 successful requests on proxies alone, before any browser infrastructure cost. Success rates with a good setup are typically 90 to 98 percent.
Strategy 2: Direct payload extraction. Instead of rendering the page in a browser and scraping the DOM, fetch the HTML and parse the embedded data payload, which contains the full property data as JSON. This is more efficient per request and more stable across UI redesigns. The catch is that you still need to get past the bot defenses to fetch the HTML in the first place. Once you have the HTML, parsing is straightforward.
Strategy 3: A managed scraping API. Outsource the anti-bot dance entirely. Several vendors handle the browser orchestration, proxy rotation, retry logic, and fingerprinting on their side. Your code becomes a single API call. The cost shifts from infrastructure and engineering time to per-request pricing. For most teams below 100,000 properties per month, this is the cheapest overall path once you account for engineering time.
Where DataSonar fits
The DataSonar Zillow actor at POST /v1/actors/zillow uses Strategy 2 (payload extraction) as the primary path, with the network-level work needed to reach the HTML in the first place. The actor returns address, city, state, zipcode, price, Zestimate, bedrooms, bathrooms, living area in square feet, and year built in a clean schema.
When the bot defense returns its captcha wall, the actor detects the specific signature and returns an explicit error: "Zillow Anti-Bot Wall detected. Please ensure you are passing a high-quality residential proxy via the 'proxy' parameter." This is deliberate. Silently retrying or returning a partial result is worse than telling the caller what happened, because pipelines that get partial silent results end up with corrupted databases.
The honest trade-off: in our internal benchmarks, the actor extracts cleanly from Zillow URLs at a high success rate when paired with a residential proxy. Without a proxy, success rates fall significantly as challenges accumulate. We do not provide proxies as part of the base actor; production users supply their own. This is a deliberate choice for pricing transparency and customer flexibility — you can bring your existing proxy contract instead of paying us a margin on top.
How to think about proxy spend
For teams new to scraping at this level, proxy spend is the cost that surprises everyone. Residential proxy providers charge per gigabyte transferred or per request — typically in the range of $5 to $15 per gigabyte for residential traffic. A single Zillow property page is a few hundred kilobytes of HTML plus assets. Properly configured to fetch only the HTML and not load images or CSS, a property page is roughly 300 KB.
The math: at $10 per gigabyte and 300 KB per page, each Zillow extraction costs roughly $0.003 in proxy traffic. At 100,000 properties per month, that is $300 in proxy spend, plus failed-request overhead which can double or triple that depending on configuration. The takeaway: residential proxies are not the dominant cost at low to medium volumes; engineering time is. At high volumes (millions of properties), proxy spend becomes meaningful and warrants negotiation with the provider.
The map-pagination problem
One Zillow detail that surprises new pipelines: there is no traditional URL-based pagination for search results. Zillow uses a map-based interface where the visible results are bounded by the current map viewport. To enumerate all properties in a city, you have to programmatically pan and zoom the map, capturing the results from each viewport.
The straightforward workaround is to walk a grid of small map viewports across the geography you care about. For Manhattan, a 200-meter grid produces roughly 60 viewports and covers every property. For a full state, the grid count grows quickly and proxy spend rises accordingly. Some teams instead start from a list of known addresses (from county assessor records, MLS feeds, or public data sources) and look up each property individually, which sidesteps the map-pagination problem entirely.
A realistic production pipeline
What the most successful Zillow pipelines look like in 2026:
- Seed URL list. Start from a curated list of property URLs, either by walking the map grid or by joining against public address data. Maintain the list in a versioned manifest so reruns are deterministic.
- Extract via API. Send each URL to a managed scraping API or actor. Supply your own residential proxy if your vendor expects it; many vendors bundle proxies but charge for the bundle.
- Handle the wall explicitly. When the response indicates a bot wall, route the URL into a retry queue with backoff. Do not retry immediately on the same IP — that wastes your proxy spend on requests that will fail.
- Validate the extraction. Cross-check the returned price and address against expected ranges. Outliers are a common signal that the response was a soft challenge, not a real listing.
- Refresh on a schedule. Property data changes — prices, status, agent, photos. Most production users refresh active listings weekly and full inventory monthly.
- Respect robots.txt. Zillow's robots.txt evolves over time. Read it at the start of each crawl and avoid paths it disallows.
What about Redfin, Realtor.com, and Trulia?
The other major listing sites have their own bot defenses. Some are generally somewhat more permissive than Zillow; others ship their own detection stacks. Trulia is owned by Zillow and shares much of the same infrastructure, which means the techniques that work on Zillow tend to work on Trulia and vice versa.
For comprehensive market data, most production teams scrape multiple sites and reconcile the records. The same property often appears on Zillow, Redfin, and Realtor.com with slightly different metadata; cross-referencing improves data quality significantly.
The honest bottom line
Scraping Zillow at production scale is hard, expensive in proxies, and operationally non-trivial. The teams that do it successfully have either invested heavily in their own browser-plus-proxy infrastructure or outsourced the anti-bot problem to a managed API and brought their own residential proxy contract. The teams that fail are the ones that underestimate the difficulty, deploy a naive scraper on a single cloud region, and spend a month watching success rates collapse from 90 percent to 5 percent as their IP range gets reputation-flagged.
If you are evaluating options, run a real benchmark — 100 randomly chosen property URLs, three different vendors or in-house approaches, measure success rate and data accuracy. Vendors who are confident will let you do this on a free or trial tier. DataSonar's free tier covers 1,000 actor calls a month; enough to run a meaningful benchmark on Zillow and three other actors before committing to anything.