Building a Resilient Web Scraper in Python

Anyone who has scraped more than a few hundred pages knows the real challenge isn't parsing HTML — it's staying connected. Targets rate-limit you, ban IPs, throw 429s and 503s, and silently serve CAPTCHA pages with a 200 status. A scraper that ignores this works in a demo and dies in production.

This is a practical pattern for a scraper that survives: rotating proxies, sticky sessions when you need them, exponential backoff with jitter, and ban detection that goes beyond status codes. All in plain requests — no heavy framework.

Disclosure: I work on Proxya, a proxy provider, and use it in the examples. The code is provider-agnostic — any proxy that speaks the standard user:pass@host:port format works identically (Bright Data, Oxylabs, IPRoyal, etc.). Swap the gateway and you're done.

1. Why a single IP fails

Make 50 fast requests from one IP and most protected sites will start returning 429 (Too Many Requests) or quietly degrade your responses. The fix is to spread requests across many IPs so no single address looks abusive. That's what a rotating residential pool does — each request can exit from a different real IP.

There are two modes you'll actually use:

Rotating — a new IP per request. Best for broad crawls where requests are independent.
Sticky session — the same IP for a sequence of requests. Required when a site ties a session/cart/login to an IP.

Most providers expose both by tweaking the proxy username. Here's a small helper that builds the right credential string:

from dataclasses import dataclass

GATEWAY = "gw.proxya.co:8000"

@dataclass
class ProxyConfig:
    username: str
    password: str

    def url(self, *, session: str | None = None, country: str | None = None) -> str:
        user = self.username
        if country:                 # geo-target, e.g. "us", "de", "gb"
            user += f"-country-{country}"
        if session:                 # reuse the same IP across requests
            user += f"-session-{session}"
        return f"http://{user}:{self.password}@{GATEWAY}"

Rotating = call url() with no session. Sticky = pass a stable session id.

2. Detecting a ban (status codes lie)

The most common scraping bug: trusting response.status_code. Plenty of anti-bot systems return 200 OK with a CAPTCHA or "access denied" body. So define what "blocked" means for your target:

BLOCK_SIGNALS = ("captcha", "access denied", "unusual traffic", "are you a robot")

def looks_blocked(response) -> bool:
    if response.status_code in (403, 429, 503):
        return True
    body = response.text[:2000].lower()
    return any(sig in body for sig in BLOCK_SIGNALS)

Tune BLOCK_SIGNALS to the strings the site actually serves on a block — inspect a real blocked response once and you'll see them.

3. Retries with exponential backoff + jitter

When you hit a block, don't hammer the target. Back off exponentially, and add jitter so concurrent workers don't retry in lockstep (the "thundering herd" problem). Each retry also rotates to a fresh IP:

import random
import time
import requests

def fetch(url: str, cfg: ProxyConfig, *, max_retries: int = 4, country: str | None = None) -> requests.Response:
    last_exc = None
    for attempt in range(max_retries):
        proxy = cfg.url(country=country)  # fresh IP each attempt
        try:
            resp = requests.get(
                url,
                proxies={"http": proxy, "https": proxy},
                timeout=30,
                headers={"User-Agent": _random_ua()},
            )
            if not looks_blocked(resp):
                return resp
        except requests.RequestException as exc:
            last_exc = exc  # network error, timeout, proxy hiccup

        # exponential backoff: 1s, 2s, 4s, 8s ... + up to 1s of jitter
        sleep = (2 ** attempt) + random.random()
        time.sleep(sleep)

    raise RuntimeError(f"Failed after {max_retries} retries: {url}") from last_exc

4. Rotate User-Agents too

Rotating IPs while sending an identical User-Agent on every request is a giveaway. Pair IP rotation with header rotation:

_UAS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Safari/605.1.15",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0 Safari/537.36",
]

def _random_ua() -> str:
    return random.choice(_UAS)

In 2026, fingerprinting goes deeper than the UA string (TLS/JA3, header order, browser behavior). For JS-heavy or heavily protected targets you'll graduate to a real browser via Playwright — but for most JSON/HTML endpoints, rotating IP + UA + sane pacing gets you a long way.

5. A sticky-session example

Some flows break if your IP changes mid-sequence — add to cart, paginate behind a session cookie, anything stateful. Pin one IP for the whole sequence by reusing a session id:

def scrape_paginated(base_url: str, cfg: ProxyConfig, pages: int = 5):
    session_id = f"job-{random.randint(1000, 9999)}"  # one IP for this run
    proxy = cfg.url(session=session_id)
    s = requests.Session()
    s.proxies.update({"http": proxy, "https": proxy})

    results = []
    for page in range(1, pages + 1):
        resp = s.get(f"{base_url}?page={page}", timeout=30, headers={"User-Agent": _random_ua()})
        if looks_blocked(resp):
            break
        results.append(resp.text)
        time.sleep(1 + random.random())  # be polite
    return results

6. Putting it together

if __name__ == "__main__":
    cfg = ProxyConfig(username="USERNAME", password="PASSWORD")

    # Independent crawl — rotate IPs, retry on blocks
    for url in ["https://example.com/a", "https://example.com/b"]:
        try:
            html = fetch(url, cfg, country="us").text
            print(f"OK {url}: {len(html)} bytes")
        except RuntimeError as e:
            print(f"GAVE UP {url}: {e}")

Takeaways

Don't trust status codes alone — check the body for block signals.
Backoff with jitter beats fixed-delay retries and avoids self-inflicted bans.
Rotate IP and User-Agent together; rotating only one is a pattern.
Sticky sessions for stateful flows, rotating for independent requests.
Keep the proxy layer abstracted (one url() helper) so switching providers is a one-line change.

The proxy gateway in these snippets is from Proxya's residential proxies, but the patterns are universal — point them at whatever pool you already pay for. The resilience logic is what keeps a scraper alive, not the brand on the IPs.

Questions or a pattern you'd add? Drop a comment.

Building a Resilient Web Scraper in Python: Rotating Proxies, Retries, and Backoff

1. Why a single IP fails

2. Detecting a ban (status codes lie)

3. Retries with exponential backoff + jitter

4. Rotate User-Agents too

5. A sticky-session example

6. Putting it together

Takeaways

Comments

Command Palette

1. Why a single IP fails

2. Detecting a ban (status codes lie)

3. Retries with exponential backoff + jitter

4. Rotate User-Agents too

5. A sticky-session example

6. Putting it together

Takeaways

Comments