Design a Flash Sale System¶

Interview Time: 60 min | Difficulty: Medium
Key Focus: High concurrency inventory management, fairness, surge handling

Step 1: Functional & Non-Functional Requirements¶

Functional Requirements¶

Merchants create flash sales (limited time, limited quantity, steep discount)
Timer countdown visible to users
Users add items to cart at sale price
Checkout with order guarantee (reserved quantity)
Real-time inventory updates across all users
Fair allocation (lottery if more users than stock)
Queue system for surge traffic (users wait fairly)
Order confirmation and cancellation
Admin dashboard (sales metrics, inventory, user queue)
Prevent common exploits (bot traffic, scalping)

Non-Functional Requirements¶

Requirement	Target	Notes
Scale	1M concurrent users on single sale	100K purchases/sec
Latency	Add to cart <100ms, checkout <500ms	User experience critical
Availability	99.9% during sale	High visibility, high stakes
Consistency	Strong for inventory	No overselling
Fairness	Equal access, no bots	Lottery if oversold
Queue	Manage 10M waiting users	Fair ordering

Step 2: API Design, Data Model & High-Level Design¶

Core API Endpoints¶

# Sales
GET /flash-sales/{sale_id}
  → {sale_id, product_name, stock, remaining_time, sale_price, original_price}

# Cart
POST /carts/{user_id}/items
  {sale_id, quantity_requested}
  → {quantity_reserved, expires_at: (now + 5 min)}

GET /carts/{user_id}
  → {items: [{sale_id, quantity}], total, expires_in_seconds}

# Checkout
POST /orders
  {user_id, cart_items}
  → {order_id, status: success|failed (out of stock)}

# Queue (surge traffic)
GET /flash-sales/{sale_id}/queue-position
  {user_id}
  → {position, wait_time_seconds, estimated_start_time}

POST /flash-sales/{sale_id}/join-queue
  {user_id, device_id}  -- device prevents multi-account fraud
  → {queue_token, position}

# Admin
GET /admin/sales/{sale_id}/metrics
  → {stock_remaining, orders_count, queue_length, conversion_rate}

Entity Data Model¶

FLASH_SALES
├─ sale_id (ULID, PK)
├─ product_id (FK)
├─ product_name, description
├─ original_price (DECIMAL)
├─ sale_price (DECIMAL, lower)
├─ start_time (TIMESTAMP)
├─ end_time (TIMESTAMP)
├─ total_stock (int, immutable)
├─ stock_remaining (int, decrements atomically)
├─ status (scheduled, live, ended)
├─ created_at

FLASH_SALE_ORDERS
├─ order_id (ULID, PK)
├─ sale_id (FK)
├─ user_id (FK)
├─ quantity_requested, quantity_allocated
├─ status (pending, confirmed, cancelled)
├─ created_at
├─ expires_at (5 min from creation, auto-cancel if unpaid)

CARTS (temporary, session-like)
├─ user_id (PK)
├─ sale_id (PK + composite)
├─ quantity_reserved
├─ expires_at

QUEUE_ENTRIES
├─ queue_id (ULID, PK, sortable by time)
├─ sale_id (FK)
├─ user_id (FK)
├─ device_id (FK, fraud detection)
├┠ joined_at (TIMESTAMP, determines order)
├─ status (waiting, active, served, expired)
├─ expires_at (auto-kick idle users after 1 hour)

High-Level Architecture¶

graph TB
    USER1["👤 User 1"] 
    USER2["👤 User 2"]
    USERM["👤 User M<br/>(millions)"]

    LB["Load Balancer<br/>(distribute across<br/>regions)"]

    QUEUE_SVC["Queue Service<br/>(join queue,<br/>get position)"]

    INVENTORY["Inventory Service<br/>(atomic stock<br/>decrement)"]

    CART["Cart Service<br/>(hold reservation<br/>5 min)"]

    CHECKOUT["Checkout Service<br/>(finalize order)"]

    REDIS["Redis<br/>(queue, inventory,<br/>reservations)"]

    DB["PostgreSQL<br/>(orders, audit log)"]

    FRAUD["Anti-Fraud<br/>(device ID,<br/>buy limits)"]

    USER1 --> LB
    USER2 --> LB
    USERM --> LB

    LB --> QUEUE_SVC
    QUEUE_SVC --> REDIS

    LB --> CART
    CART --> REDIS
    CART --> INVENTORY

    LB --> CHECKOUT
    CHECKOUT --> INVENTORY
    CHECKOUT --> DB

    CHECKOUT --> FRAUD

    INVENTORY --> REDIS

Step 3: Concurrency, Consistency & Scalability¶

🔴 Problem: Inventory Oversell During Surge¶

Scenario: 100K users add item to cart, all see "5 in stock". Everyone checks out. 100K orders created but only 5 items available!

Solution: Atomic Reservation with Rollback

Cart Reservation (atomic):

1. User requests: Add 1 to cart
   POST /add-to-cart {sale_id, quantity: 1}

2. Atomic inventory decrement:
   Lua script (atomic in Redis):

   remaining = GET stock:{sale_id}

   IF remaining >= quantity_requested:
     remaining -= quantity_requested
     SET stock:{sale_id} remaining

     -- Create reserved slot
     SET reserved:{user_id}:{sale_id} {
       quantity: 1,
       reserved_at: now,
       expires_at: now + 5_min
     }
     RETURN {success: true, remaining: remaining}
   ELSE:
     RETURN {success: false, remaining: remaining}

Result:
  - "5 in stock" means exactly 5 can reserve it
  - 6th user gets rejected immediately
  - No overselling in checkout phase

3. Cleanup (expiry):
   Background job every 1 min:
   FOR each expired reserved:{user_id}:{sale_id}:
     quantity = GET reserved:{user_id}:{sale_id}.quantity
     IF expired:
       DELETE reserved entry
       SET stock:{sale_id} += quantity  -- return to pool

Benefit:
  - Immediate feedback (user knows if can buy)
  - Accurate inventory (no double-selling)
  - Self-healing (abandoned carts auto-return stock)

🟡 Problem: Queue Fairness Under Surge Traffic¶

Scenario: 10M users want to buy 100 items. How do you fairly allocate? Lottery fairness vs first-come fairness?

Solution: Fair Queuing + Lottery

Queue Architecture:

1. Join Queue (timestamp-based)
   User arrives at sale page:
   POST /queue {user_id, device_id}

   Redis ZADD queue:{sale_id} 
   <current_timestamp> 
   <user_id>
   --> Sorted set, ordered by join time

2. Batched Serve (every 100ms)
   Background job:
   ZRANGE queue:{sale_id} 0 999  -- next 1000 in line

   For these 1000:
     TRY to reserve from inventory
     Success: Move to "active" state, show product page
     Fail: Keep in queue, next batch

   ZREM queue:{sale_id} <served_users>

3. Fair Distribution (if oversold at end)
   At sale end:
   remaining_stock = 10
   waiting_users = 2000

   Allocate using weighted lottery:
   eligible = users still in queue (not expired)

   FOR i in range(remaining_stock):
     winner = random_pick from eligible
     allocated[winner] = 1
     eligible.remove(winner)

   Others get notification: "Not allocated, sold out"

Benefit:
  - First-come advantage (earlier = earlier served)
  - Fair lottery at end (no sniping)
  - Queue always moving (updates every 100ms)

🔷 Problem: Cache Stampede on Same Product¶

Scenario: Inventory falls to 5 items. Cache expires. All 1M concurrent requests hit DB simultaneously. DB crashes!

Solution: Probabilistic Early Expire + Lock

Cache Stampede Prevention:

1. Probabilistic refresh (before expiry):
   Key: "stock:{sale_id}"
   Value: {count: 50, fetched_at: now}
   TTL: 5 mins

   Get stock:
   value = REDIS.GET stock:{sale_id}
   age_seconds = now - value.fetched_at

   TTL_remaining = 5 min - age_seconds

   IF age_seconds > 3 min:  -- 60% through TTL
     probability = (age_seconds - 3_min) / (5_min - 3_min)
                 = (current - 3) / 2

     IF random() < probability:
       -- Refresh early (avoid thundering herd)
       new_value = DB.query()  -- single DB hit
       REDIS.SET stock:{sale_id} new_value EX 5_min

   RETURN value

2. Locking (prevent concurrent refreshes):
   LOCK = REDIS.SET refresh:{sale_id} 
          "locked" 
          NX  -- only if not exists
          EX 10  -- auto-release

   Only 1 thread gets the lock, refreshes DB
   Others see cached value (slightly stale OK)

Benefit:
  - Smooth refresh curve (not cliff expiry)
  - Single DB refresh per key
  - No thundering herd

Step 4: Persistence Layer, Caching & Monitoring¶

Database Design¶

CREATE TABLE flash_sales (
  sale_id BIGSERIAL PRIMARY KEY,
  product_id BIGINT NOT NULL,
  product_name VARCHAR(255),
  original_price DECIMAL(10,2),
  sale_price DECIMAL(10,2),
  start_time TIMESTAMP NOT NULL,
  end_time TIMESTAMP NOT NULL,
  total_stock INT NOT NULL,
  status VARCHAR(50),
  created_at TIMESTAMP DEFAULT NOW()
);

CREATE INDEX idx_sales_time 
  ON flash_sales(start_time DESC);

CREATE TABLE flash_sale_orders (
  order_id BIGSERIAL PRIMARY KEY,
  sale_id BIGINT NOT NULL REFERENCES flash_sales(sale_id),
  user_id BIGINT NOT NULL,
  quantity_allocated INT,
  status VARCHAR(50),
  created_at TIMESTAMP DEFAULT NOW(),
  expires_at TIMESTAMP  -- order holds for 5 min
);

CREATE INDEX idx_orders_sale_time 
  ON flash_sale_orders(sale_id, created_at DESC);
CREATE INDEX idx_orders_user 
  ON flash_sale_orders(user_id);

Caching Strategy¶

Redis:

1. Stock (inventory counter)
   Key: "stock:{sale_id}"
   Value: {count: 100, fetched_at: now}
   TTL: 5 min (with probabilistic early refresh)
   Purpose: Avoid DB queries, atomic decrements

2. Queue (fair ordering)
   Key: "queue:{sale_id}"
   Value: ZSET {user_id: timestamp, ...}
   TTL: sale duration + 1 hour
   Purpose: Track join order, fair allocation

3. Reservations (cart holds)
   Key: "reserved:{user_id}:{sale_id}"
   Value: {quantity, reserved_at, expires_at}
   TTL: 5 minutes (auto-cleanup)
   Purpose: Hold inventory while user shops

4. Active users (surge metrics)
   Key: "active_users:{sale_id}"
   Value: Set of user_ids currently browsing
   TTL: 1 minute
   Purpose: Real-time metrics for admins

Monitoring¶

- alert: StockHistoryDivergence
  expr: cache_stock != db_stock
  annotations: "Inventory mismatch – data corruption or race condition!"

- alert: OversellDetected
  expr: allocated_quantity > total_stock
  annotations: "Allocated more than total stock – critical bug!"

- alert: QueueBacklogTooLarge
  expr: queue_length > 10_000_000
  annotations: "Queue too large (10M+) – consider splitting sale"

- alert: CheckoutLatencyHigh
  expr: checkout_latency_p95 > 5000
  annotations: "Checkout latency > 5s – infrastructure bottleneck"

- alert: ReservationExpiredRate
  expr: expired_reservations / total_reservations > 0.3
  annotations: "30%+ reservation expiry – users abandoning carts"

- alert: BotTrafficDetected
  expr: same_device_multiple_orders > 100
  annotations: "Bot detected – same device with many orders"

Key Metrics: - Inventory accuracy (allocated + available = total) - Queue move rate (positions served per second) - Checkout conversion (orders completed / reservations) - Fairness Gini coefficient (measure allocation inequality)

⚡ Quick Reference Cheat Sheet¶

Critical Decisions¶

Atomic reservation – Lua script atomically decrement + create hold
5-min expiry – Auto-release unreserved inventory
Probabilistic refresh – Avoid cache stampede, early expire
Fair queue – Timestamp-ordered, batched serve
Lottery at end – Random allocation if oversold
Device-based fraud – Limit orders per device

Tech Stack¶

API: Go/Rust (ultra-low latency)
Inventory: Redis (atomic operations)
Queue: Redis Sorted Set (fair ordering)
Database: PostgreSQL (audit, order history)
Anti-fraud: ML model + device fingerprinting

🎯 Interview Summary (5 Minutes)¶

Atomic reservation → Lua script: check stock, decrement, create hold
Reservation expiry → Auto-cleanup returns unsold inventory
Probabilistic refresh → Avoid cache expiry thundering herd
Fair queueing → Redis ZSET by join timestamp
Lottery → Random pick from eligible if oversold
Device fraud → Limit orders per device_id
Monitoring → Check allocated <= total_stock, catch bugs early