Web Crawler¶

Interview Time: 60-90 min | Difficulty: Hard
Key Focus: URL frontier, politeness policy, distributed crawling

Step 1: Functional & Non-Functional Requirements¶

Functional Requirements¶

Crawl websites starting from seed URLs
Discover and follow links recursively to other pages
Extract page content (title, metadata, text)
Respect robots.txt and politeness policies
Handle sitemap.xml for efficient crawling
Deduplicate URLs (avoid revisiting same page)
Store crawled content with metadata (timestamp, status code)
Support re-crawling of existing pages (incremental updates)
Parse and normalize URLs (handle query params, fragments)
Handle redirects (301/302/307) correctly
Extract structured data from pages (schema.org, microdata)

Non-Functional Requirements¶

Requirement	Target	Notes
Scale	Crawl 1B+ pages, 10 billion URLs per day	Distributed across 100s of crawlers
Latency	Crawl 1000 pages/day per crawler	Network + parsing time ~2-3s per page
Freshness	Re-crawl every 7-30 days depending on update frequency	News pages <1 day, evergreen content <30 days
Politeness	Max 1 request/sec per domain, respect robots.txt	Avoid DoS-ing target servers
Robustness	99.9% uptime for crawler cluster	Failures don't lose URLs in queue
Deduplication	100% unique URLs (no millions of duplicates)	Bloom filters or distributed cache
Compliance	Honor robots.txt, User-Agent header	Legal: CFAA, GDPR, data retention policies

Step 2: API Design, Data Model & High-Level Design¶

Core API Endpoints (Crawler Coordinator)¶

POST /crawler/seeds
  {urls: ["http://example.com", ...]}
  → {job_id, status: LAUNCHED}

GET /crawler/jobs/{job_id}/status
  → {total_urls: 1B, crawled: 500M, in_progress: 100K, failed: 10M, eta_hours: 24}

POST /crawler/stop
  {job_id}
  → {status: STOPPED}

GET /crawler/stats
  → {pages_per_second, urls_in_queue, cache_hit_ratio, politeness_delays_ms}

POST /storage/get-page
  {url: string}
  → {status_code, content_hash, headers, body_size, crawled_at}

Entity Data Model¶

CRAWL_JOBS
├─ job_id (PK)
├─ created_at, started_at, completed_at
├─ total_urls_discovered, crawled_count, failed_count
├─ status (QUEUED|RUNNING|COMPLETED|FAILED|PAUSED)

URL_FRONTIER
├─ url (PK)
├─ domain_hash (for sharding)
├─ priority (higher = crawl sooner, based on importance)
├─ crawled_at (NULL if not yet crawled)
├─ next_crawl_time (for re-crawling)
├─ visit_count

CRAWLED_PAGES
├─ url (PK)
├─ content_hash (SHA-256, for deduplication)
├─ status_code (200, 404, 403, etc.)
├─ headers (JSON: {Content-Type, Last-Modified, Expires})
├─ body (compressed: gzip or zstd)
├─ extracted_links (array of URLs found in page)
├─ crawled_at, updated_at
├─ error_message (if failed)

CRAWL_METADATA
├─ url (PK)
├─ robots_txt_rules (cached parsing of /robots.txt)
├─ robots_txt_cached_at
├─ last_modified (from HTTP header)
├─ etag (for conditional requests)
├─ crawl_delay_ms (politeness delay per domain)
├─ disallow_patterns (array of patterns from robots.txt)

URL_DEDUP_CACHE (Bloom filter or HyperLogLog)
├─ fingerprint (hash of canonical URL)
├─ seen: boolean (probabilistic set membership)

High-Level Architecture¶

graph TB
    Seed["Seed URLs"]
    Coordinator["URL Coordinator<br/>(Priority Queue)"]
    CrawlQueue["Frontier Queue<br/>(Redis)"]

    LB["Load Balancer"]
    Crawler1["Crawler 1"]
    Crawler2["Crawler 2"]
    CrawlerN["Crawler N"]

    DomainCache["Domain Cache<br/>(robots.txt)"]
    DedupFilter["Dedup Filter<br/>(Bloom)"]

    HTTPClient["HTTP Client<br/>(timeout · retry)"]
    Parser["HTML Parser<br/>(extract links)"]

    PageDB["Page Storage<br/>(HBase)"]
    PageIndex["URL Index"]
    ContentIndex["Content Index<br/>(Elasticsearch)"]
    Monitoring["Monitoring<br/>(Crawl Rate)"]

    Seed --> Coordinator
    Coordinator --> CrawlQueue
    CrawlQueue --> LB
    LB --> Crawler1
    LB --> Crawler2
    LB --> CrawlerN

    Crawler1 --> DomainCache
    Crawler1 --> DedupFilter
    Crawler1 --> HTTPClient
    HTTPClient --> Parser
    Parser --> PageDB
    Parser --> PageIndex
    Parser --> ContentIndex
    Crawler1 --> Monitoring

Step 3: Concurrency, Consistency & Scalability¶

🔴 Problem: Duplicate URL Crawling¶

Scenario: Same URL discovered from multiple crawlers/pages simultaneously. Without deduplication, we crawl it N times, wasting bandwidth.

Solutions:

Approach	Implementation	Pros	Cons
Bloom Filter	Hash URL → check before crawl	Fast (O(1)), memory-efficient (1 bit/URL × 10B URLs = 1.2 GB)	False positives (~1-2%), must re-crawl some pages
Distributed Cache	Redis SET with URL hash TTL 7 days	Zero false positives, works across crawlers	Network latency, harder to scale to 10B URLs
Canonicalization	Normalize URL (sort params), use canonical header	Catches URL variations	Complex logic, edge cases
Content Hash	Deduplicate by SHA-256(content) not URL	Catches copy-paste content	Late dedup (already crawled), wastes bandwidth

Recommended: Bloom Filter + Distributed Cache Hybrid

class WebCrawler:
    def crawl(self, url):
        # Quick check: local Bloom filter
        if url in self.local_bloom_filter:
            return  # Skip (might be false positive)

        # Distributed check: Redis atomic set-if-not-exists
        if redis.set(f"url:{url_hash}", 1, nx=True, ex=86400*7):
            # Won the race; this thread crawls
            self.fetch_and_parse(url)
        else:
            # Another crawler already crawling or crawled
            return

    def fetch_and_parse(self, url):
        # Check robots.txt
        if self.is_robots_allowed(url):
            response = http_get(url, timeout=10)
            content = parse_html(response)
            self.store_page(content)

            # Add discovered URLs to frontier
            for link in extract_links(content):
                canonical_url = self.canonicalize(link)
                if canonical_url not in redis_cache:
                    self.add_to_frontier(canonical_url)

def canonicalize_url(url: str) -> str:
    """Normalize URL to avoid duplicates"""
    # Remove fragment: example.com/page#section → example.com/page
    url = url.split('#')[0]

    # Sort query params: ?b=2&a=1 → ?a=1&b=2
    parsed = urlparse(url)
    if parsed.query:
        params = parse_qs(parsed.query, keep_blank_values=True)
        sorted_params = '&'.join(f"{k}={v[0]}" for k, v in sorted(params.items()))
        url = urlunparse((parsed.scheme, parsed.netloc, parsed.path, 
                         parsed.params, sorted_params, ''))

    # HTTPS preferred
    if url.startswith('http://'):
        url = 'https://' + url[7:]

    return url.lower()

Bloom Filter Statistics: - URLs to track: 10B - False positive rate: 1% = 100M false crawls - Memory: 10B × 1.2 bits / 8 = 1.5 GB (fits single machine)

🟡 Problem: Respecting Politeness Policies (DoS Prevention)¶

Scenario: Crawler hammers server with 10K requests/sec → target blocks us (429, IP ban), legal issues.

Solution: Per-domain rate limiting + robots.txt parsing

class PolitenessController:
    def should_crawl(self, url: str) -> bool:
        domain = extract_domain(url)

        # Parse robots.txt (cached 7 days)
        rules = self.get_robots_txt(domain)

        # Check User-Agent specific rules
        if not rules.is_allowed_for_useragent(url, USER_AGENT):
            return False

        # Enforce Crawl-Delay (default 1 sec)
        crawl_delay_ms = rules.crawl_delay_ms or 1000

        # Get last crawl time for this domain
        last_crawl = redis.get(f"last_crawl:{domain}")
        if last_crawl is None:
            return True

        elapsed_ms = now() - last_crawl
        if elapsed_ms < crawl_delay_ms:
            # Not enough time elapsed; push to delay queue
            requeue_time = last_crawl + crawl_delay_ms + jitter(50)
            priority_queue.push(url, priority=requeue_time)
            return False

        # OK to crawl
        redis.set(f"last_crawl:{domain}", now())
        return True

    def get_robots_txt(self, domain: str) -> RobotsTxt:
        cached = redis.get(f"robots:{domain}")
        if cached:
            return cached

        # Fetch robots.txt
        response = http_get(f"https://{domain}/robots.txt", timeout=5)
        rules = parse_robots_txt(response.text)
        redis.set(f"robots:{domain}", rules, ex=604800)  # 7 days
        return rules

Politeness Thresholds: - Standard: 1 request/sec per domain - News sites: 5-10 requests/sec - CDNs: No crawl-delay (cached) - Overloaded: Detect 503 → exponential backoff (1s → 2s → 4s → 8s)

💾 Problem: Efficient Re-Crawling & Change Detection¶

Scenario: Need to update stale content. Which pages to re-crawl first?

Solution: Priority queue based on update frequency + age

class CrawlPriority:
    def calculate_priority(self, url: str) -> float:
        """
        Lower = higher priority = crawl sooner
        Combines: staleness + importance
        """
        metadata = self.get_metadata(url)
        now = time.time()

        # Age in seconds since last crawl
        age_sec = now - metadata.crawled_at

        # Estimate frequency from past crawls
        update_freq = self.estimate_update_frequency(url)

        # Decay period (e.g., hourly = 1 day, monthly = 30 days)
        decay_period_sec = 1 / update_freq
        staleness_score = age_sec / decay_period_sec

        # PageRank-like importance
        importance = len(inbound_links[url]) * quality_score

        priority = staleness_score - log(importance + 1)
        return priority

    def estimate_update_frequency(self, url: str) -> float:
        """Returns updates per second"""
        last_5_crawls = self.get_last_n_crawls(url, 5)
        if len(last_5_crawls) < 2:
            return 1 / (7 * 86400)  # Default: weekly

        change_times = []
        for i in range(len(last_5_crawls) - 1):
            if last_5_crawls[i].content_hash != last_5_crawls[i+1].content_hash:
                change_times.append(last_5_crawls[i+1].crawled_at - last_5_crawls[i].crawled_at)

        if not change_times:
            return 0  # Never changes

        avg_interval = sum(change_times) / len(change_times)
        return 1 / avg_interval

Step 4: Persistence Layer, Caching & Monitoring¶

Database Design (HBase/Cassandra)¶

-- Crawled pages storage (time-series)
CREATE TABLE crawled_pages (
    url TEXT PRIMARY KEY,
    status_code INT,
    content BLOB COMPRESSED,
    headers MAP<TEXT, TEXT>,
    content_hash BLOB,
    extracted_links SET<TEXT>,
    crawled_at TIMESTAMP,
    updated_at TIMESTAMP
) WITH COMPRESSION = 'ZstdCompressor'
  AND compaction = 'LeveledCompactionStrategy';

-- Row key: (domain, next_crawl_time, url) for time-based queries
CREATE TABLE url_frontier (
    domain TEXT,
    next_crawl_time TIMESTAMP,
    url TEXT,
    priority FLOAT,
    crawl_count INT,
    PRIMARY KEY (domain, next_crawl_time, url)
);

-- Inverted index: links_from[source] → [targets]
CREATE TABLE link_graph (
    from_url TEXT,
    to_url TEXT,
    anchor_text TEXT,
    PRIMARY KEY (from_url, to_url)
);

CREATE INDEX idx_frontier_domain_time ON url_frontier(domain, next_crawl_time);

Caching Strategy¶

TIER 1: In-Memory Bloom Filter (each crawler)
├─ Local Bloom filter of 100M URLs
├─ False positive rate: 1%
└─ Refreshed every 1 hour

TIER 2: Distributed Cache (Redis Cluster)
├─ url_set:{hash} → 1 (TTL: 7 days)
├─ domain:robots_txt:{domain} → rules (TTL: 7 days)
├─ domain:last_crawl:{domain} → timestamp
└─ url_metadata:{url} → {status, crawled_at}

Invalidation:
- robots.txt changes → invalidate domain:robots_txt
- Page re-crawled → update url_metadata
- Content changed → signal other crawlers

Monitoring & Alerts¶

Key Metrics:

Crawl Rate — Pages/sec (target: 10K/sec)
Average Age — Days since last crawl (target: <14 days)
404 Rate — Broken links %
5xx Errors — Server failures (target: <1%)
Robots Blocks — robots.txt denials (should be <5%)

- alert: CrawlRateLow
  expr: rate(crawled_pages_total[5m]) < 10000
  annotations: "Crawl rate below {{$value}} pages/sec"

- alert: FreshnessDegraded
  expr: avg(page_age_seconds) > 30 * 86400
  annotations: "Average page age >30 days"

- alert: HighRobotsBlockRate
  expr: rate(robots_blocked_total[5m]) / rate(requests_total[5m]) > 0.10
  annotations: "Robots block rate: {{$value | humanizePercentage}}"

- alert: TimeoutRate
  expr: rate(http_timeout_total[5m]) / rate(http_requests_total[5m]) > 0.01
  annotations: "HTTP timeout rate >1%"

- alert: DuplicateUrlsHigh
  expr: rate(duplicate_urls_skipped_total[5m]) / rate(discovered_total[5m]) > 0.05
  annotations: "Duplicate rate: {{$value | humanizePercentage}}"

⚡ Quick Reference Cheat Sheet¶

When to Use What¶

Need	Technology	Why
Deduplication	Bloom Filter + Redis SET	O(1), memory-efficient for 10B+ URLs
Rate Limiting	Token bucket + Redis	Atomic, distributed per-domain tracking
Large storage	HBase/Cassandra	Write-heavy, high throughput, built-in compression
Priority Queue	Sorted Redis Set	Time-based scheduling, staleness-aware ordering
HTML parsing	Beautiful Soup (Python)	Extract links, text, metadata reliably
URL normalization	Custom parser + regex	Handle params, fragments, redirects

Critical Design Decisions¶

Bloom Filter → Redis atomic check: Reduces false positives from 1% to ~0.01%
Per-domain rate limit: Respect robots.txt, avoid legal issues + IP bans
Content hashing: Detect duplicate content even if URLs differ
Staleness-based priority: Crawl changing content frequently (news hourly, static yearly)
Compression (ZSTD): Reduce storage 90x for HTML, make 10B URLs practical
Distributed crawlers: Scale horizontally, no single bottleneck

Tech Stack Summary¶

Frontier Queue: Redis Sorted Set
Deduplication: Bloom Filter + Redis SET
Rate Limiting: Token bucket (Redis atomic ops)
Storage: HBase/Cassandra
Indexing: Elasticsearch (inverted links)
Parsing: Beautiful Soup (Python)
Monitoring: Prometheus + Grafana

🎯 Interview Summary (5 Minutes)¶

Deduplication: Use Bloom Filter (O(1), 1 bit/URL) + distributed Redis SET (atomic check)
Politeness: Per-domain rate limiter (1 req/sec) + parse robots.txt to avoid IP bans
Frontier priority: Rank URLs by staleness (age) + importance (PageRank) to crawl relevant/fresh first
Large-scale storage: HBase/Cassandra for 10B pages; compress HTML with ZSTD (90% reduction)
Change detection: Track content hash + estimated update frequency to intelligently re-crawl
Distributed crawling: Multiple workers respect robots.txt delay; Redis coordinates dedup + rate limiting
Canonical URLs: Sort params, remove fragments/sessions to identify true duplicates