Building a Resilient Web Scraper in Python: Rotating Proxies, Retries, and Backoff

Anyone who has scraped more than a few hundred pages knows the real challenge isn't parsing HTML — it's staying connected. Targets rate-limit you, ban IPs, throw 429s and 503s, and silently serve CAPTCHA pages with a 200 status. A scraper that ignores this works in a demo and dies in production.
This is a practical pattern for a scraper that survives: rotating proxies, sticky sessions when you need them, exponential backoff with jitter, and ban detection that goes beyond status codes. All in plain requests — no heavy framework.
Disclosure: I work on Proxya, a proxy provider, and use it in the examples. The code is provider-agnostic — any proxy that speaks the standard
user:pass@host:portformat works identically (Bright Data, Oxylabs, IPRoyal, etc.). Swap the gateway and you're done.
1. Why a single IP fails
Make 50 fast requests from one IP and most protected sites will start returning 429 (Too Many Requests) or quietly degrade your responses. The fix is to spread requests across many IPs so no single address looks abusive. That's what a rotating residential pool does — each request can exit from a different real IP.
There are two modes you'll actually use:
Rotating — a new IP per request. Best for broad crawls where requests are independent.
Sticky session — the same IP for a sequence of requests. Required when a site ties a session/cart/login to an IP.
Most providers expose both by tweaking the proxy username. Here's a small helper that builds the right credential string:
from dataclasses import dataclass
GATEWAY = "gw.proxya.co:8000"
@dataclass
class ProxyConfig:
username: str
password: str
def url(self, *, session: str | None = None, country: str | None = None) -> str:
user = self.username
if country: # geo-target, e.g. "us", "de", "gb"
user += f"-country-{country}"
if session: # reuse the same IP across requests
user += f"-session-{session}"
return f"http://{user}:{self.password}@{GATEWAY}"
Rotating = call url() with no session. Sticky = pass a stable session id.
2. Detecting a ban (status codes lie)
The most common scraping bug: trusting response.status_code. Plenty of anti-bot systems return 200 OK with a CAPTCHA or "access denied" body. So define what "blocked" means for your target:
BLOCK_SIGNALS = ("captcha", "access denied", "unusual traffic", "are you a robot")
def looks_blocked(response) -> bool:
if response.status_code in (403, 429, 503):
return True
body = response.text[:2000].lower()
return any(sig in body for sig in BLOCK_SIGNALS)
Tune BLOCK_SIGNALS to the strings the site actually serves on a block — inspect a real blocked response once and you'll see them.
3. Retries with exponential backoff + jitter
When you hit a block, don't hammer the target. Back off exponentially, and add jitter so concurrent workers don't retry in lockstep (the "thundering herd" problem). Each retry also rotates to a fresh IP:
import random
import time
import requests
def fetch(url: str, cfg: ProxyConfig, *, max_retries: int = 4, country: str | None = None) -> requests.Response:
last_exc = None
for attempt in range(max_retries):
proxy = cfg.url(country=country) # fresh IP each attempt
try:
resp = requests.get(
url,
proxies={"http": proxy, "https": proxy},
timeout=30,
headers={"User-Agent": _random_ua()},
)
if not looks_blocked(resp):
return resp
except requests.RequestException as exc:
last_exc = exc # network error, timeout, proxy hiccup
# exponential backoff: 1s, 2s, 4s, 8s ... + up to 1s of jitter
sleep = (2 ** attempt) + random.random()
time.sleep(sleep)
raise RuntimeError(f"Failed after {max_retries} retries: {url}") from last_exc
4. Rotate User-Agents too
Rotating IPs while sending an identical User-Agent on every request is a giveaway. Pair IP rotation with header rotation:
_UAS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Safari/605.1.15",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0 Safari/537.36",
]
def _random_ua() -> str:
return random.choice(_UAS)
In 2026, fingerprinting goes deeper than the UA string (TLS/JA3, header order, browser behavior). For JS-heavy or heavily protected targets you'll graduate to a real browser via Playwright — but for most JSON/HTML endpoints, rotating IP + UA + sane pacing gets you a long way.
5. A sticky-session example
Some flows break if your IP changes mid-sequence — add to cart, paginate behind a session cookie, anything stateful. Pin one IP for the whole sequence by reusing a session id:
def scrape_paginated(base_url: str, cfg: ProxyConfig, pages: int = 5):
session_id = f"job-{random.randint(1000, 9999)}" # one IP for this run
proxy = cfg.url(session=session_id)
s = requests.Session()
s.proxies.update({"http": proxy, "https": proxy})
results = []
for page in range(1, pages + 1):
resp = s.get(f"{base_url}?page={page}", timeout=30, headers={"User-Agent": _random_ua()})
if looks_blocked(resp):
break
results.append(resp.text)
time.sleep(1 + random.random()) # be polite
return results
6. Putting it together
if __name__ == "__main__":
cfg = ProxyConfig(username="USERNAME", password="PASSWORD")
# Independent crawl — rotate IPs, retry on blocks
for url in ["https://example.com/a", "https://example.com/b"]:
try:
html = fetch(url, cfg, country="us").text
print(f"OK {url}: {len(html)} bytes")
except RuntimeError as e:
print(f"GAVE UP {url}: {e}")
Takeaways
Don't trust status codes alone — check the body for block signals.
Backoff with jitter beats fixed-delay retries and avoids self-inflicted bans.
Rotate IP and User-Agent together; rotating only one is a pattern.
Sticky sessions for stateful flows, rotating for independent requests.
Keep the proxy layer abstracted (one
url()helper) so switching providers is a one-line change.
The proxy gateway in these snippets is from Proxya's residential proxies, but the patterns are universal — point them at whatever pool you already pay for. The resilience logic is what keeps a scraper alive, not the brand on the IPs.
Questions or a pattern you'd add? Drop a comment.
