Web scraping 1M pages/day: architecture on multiple VPSs

calendar_month May 08, 2026 schedule 7 min read visibility 8 views
person
Valebyte Team
Web scraping 1M pages/day: architecture on multiple VPSs
To organize scraping 1m pages (parsing 1 million pages) per day, a distributed architecture of 5 VPS (4 vCPU, 8 GB RAM each) is required, using Scrapy-Redis for queue coordination, headless Chrome via Browserless Docker containers for JavaScript processing, and a combined Postgres + ClickHouse storage for metadata and raw data. This configuration ensures a stable speed of about 12-15 pages per second, minimizes the risk of blocking through residential proxy rotation, and allows the system to scale by simply adding new worker nodes.

What server resources are required for scraping 1m pages per day?

When planning large-scale scraping, the primary task is calculating CPU and RAM resources, as working with headless browsers consumes 10-20 times more resources than standard HTTP requests. To process 1 million pages in 24 hours, the system must handle an average of 700 pages per minute. If the target sites are heavy with client-side code (React, Vue, Angular), using headless Chrome VPS becomes mandatory, which imposes strict requirements on RAM. The optimal strategy is to separate roles between servers. One server is dedicated to management (Redis, Postgres, RabbitMQ/Celery), while the other four act as compute nodes. Node Type VPS Configuration Qty Role in Architecture Approx. Price (per unit) Master Node 4 vCPU, 8 GB RAM, 160 GB NVMe 1 Redis (queues), Postgres (metadata), ClickHouse $12-15/mo Worker Node 4 vCPU, 8 GB RAM, 80 GB NVMe 4 Scrapy, Celery Workers, Browserless (Chrome) $10-12/mo Total 20 vCPU, 40 GB RAM 5 Scraping 1m pages / day ~$60/mo For complex projects, such as scraping Wildberries/OZON/Avito on VPS, it is critical to have RAM overhead, as each Chrome tab consumes between 150 and 400 MB. On a server with 8 GB RAM, you can comfortably run up to 15-20 parallel browser instances without hitting swap.

Distributed Architecture: Scrapy Distributed and Redis

The central element of the system is the Scrapy distributed approach, where the standard Scrapy scheduler is replaced by a Redis queue. This allows multiple independent VPS to connect to a single URL list and process tasks as resources become available. Redis here works not just as a cache, but as a high-performance message broker, storing both the current queue (Lpush/Rpop) and a set of already visited pages to eliminate duplicates (DupeFilter). The advantage of this scheme is fault tolerance: if one worker crashes due to lack of memory or an IP block, the tasks remain in Redis and will be picked up by other nodes. The configuration in the Scrapy project's `settings.py` looks like this:

# Using the Scrapy-Redis scheduler
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

# Master server connection settings
REDIS_HOST = '10.0.0.1' # Internal IP of the Master VPS
REDIS_PORT = 6339

# The queue is not cleared after stopping (allows resuming scraping)
SCHEDULER_PERSIST = True
Using Celery in conjunction with Scrapy is justified when scraping is triggered by external events or requires complex post-processing (e.g., uploading images to S3 or calling neural networks). Celery workers on distributed VPS can accept tasks to scrape specific entities and launch the `crawler.crawl()` process, returning the result to the shared database.

Looking for a reliable server for your projects?

VPS from $10/mo and dedicated servers from $9/mo with NVMe, DDoS protection, and 24/7 support.

View offers →

Headless Chrome on VPS: Optimization via Browserless

Running "bare" Puppeteer or Playwright on each server is a path to memory leaks and difficulties with managing zombie processes. For stable large-scale scraping, it is recommended to use Browserless—an open-source solution that packages Chrome into a Docker container and provides an API to manage a pool of tabs. Using a Puppeteer cluster within Browserless allows for efficient CPU utilization. Instead of launching a new browser process for every request, you use already open tabs (Contexts), which reduces response time by 300-500 ms. Key Browserless optimization parameters on VPS:
  • MAX_CONCURRENT_SESSIONS: Limit the number of tabs according to RAM (e.g., 15 for 8GB).
  • PREBOOT_CHROME: Keep the browser running to avoid wasting time on "cold starts."
  • CHROME_REFRESH_SECONDS: Force a process restart once an hour to clear accumulated memory leaks.
Command to run a worker via Docker:

docker run -d \
  --name browserless \
  -e "MAX_CONCURRENT_SESSIONS=15" \
  -e "MAX_QUEUE_LENGTH=30" \
  -p 3000:3000 \
  --restart always \
  browserless/chrome:latest
This approach allows the Scrapy script to connect to the browser via the WebSocket protocol (`ws://10.0.0.x:3000`), offloading the rendering load to a dedicated service. If you need to collect data for analytics, such as analyzing trends in messengers, this setup will also be useful for other tasks, like a 24/7 Telegram bot on VPS that notifies you of important changes on websites.

Proxy Pool and Anti-Fraud Bypass Logic

Upon reaching a volume of 1 million pages per day, standard data center proxies stop working. Security systems (Cloudflare, Akamai, DataDome) quickly identify hosting provider subnets. For successful scraping of 1m pages, it is necessary to implement a multi-level IP rotation system. Recommended proxy management structure: 1. **Residential Proxies**: Used only for pages behind a login or complex checkouts. Since you pay for traffic, they are used sparingly. 2. **Mobile Proxies (4G/5G)**: Ideal for bypassing the strictest limits due to their high trust score. 3. **ISP Proxies**: Static IPs from home providers, combining data center speed with the anonymity of residential addresses. The Retry Logic algorithm should be intelligent. If Scrapy receives a 403 or 429 error, the task is not just returned to the queue but marked with a `proxy_fail` flag. Upon retry, the system should switch the proxy type to a more "elite" one. To track such errors, it is useful to deploy Self-hosted Sentry to see in real-time which URLs and proxy types are failing.

Data Storage: A Hybrid of Postgres and ClickHouse

Writing a million rows per day to a classic relational database can create a performance bottleneck (I/O Wait). For efficient data management in large-scale scraping, a separation of concerns is used: Postgres (OLTP): Used for storing task states, queues, user data, and metadata. Transaction support and ID indexing are important here. If you work with vector data representations (e.g., for finding similar products), consider a Vector DB on VPS as an addition to Postgres. ClickHouse (OLAP): Used for storing "raw" scraping results (HTML tags, prices, texts). ClickHouse compresses data 10-20 times more efficiently than Postgres and allows analytical queries across millions of records to be executed in milliseconds. Example ClickHouse table schema for storing results:

CREATE TABLE scraped_data (
    url String,
    domain String,
    price Float64,
    content String,
    scraped_at DateTime DEFAULT now(),
    status_code UInt16
) ENGINE = MergeTree()
ORDER BY (domain, scraped_at);
Writing to ClickHouse should occur in batches of 5,000 - 10,000 records, which fits perfectly into Scrapy Pipelines logic.

Worker Configuration and Automation via Docker Compose

Managing 5 servers manually would take too much time. The optimal automation stack includes Docker Compose for local execution and Ansible for deployment across the entire VPS cluster. Each worker should contain a Scrapy instance and a local Browserless instance (if the sites require JS). Example `docker-compose.yml` for a worker node:

version: '3.8'
services:
  worker:
    build: .
    command: scrapy crawl general_spider
    environment:
      - REDIS_URL=redis://10.0.0.1:6339/0
      - BROWSERLESS_URL=http://browserless:3000
    depends_on:
      - browserless

  browserless:
    image: browserless/chrome:latest
    ports:
      - "3000:3000"
    environment:
      - MAX_CONCURRENT_SESSIONS=10
    restart: always
To automate task scheduling or integration with other services (e.g., sending reports to a CRM), you can use Self-hosted n8n. This allows you to visually configure a workflow: "Scraping complete -> ClickHouse aggregation -> Telegram report."

Performance Monitoring and Benchmarks

When scraping 1m pages, it is critical to monitor the "health" of each VPS. Key metrics to monitor:
  1. CPU Steal Time: If this metric rises, it means your VPS is sharing a physical core with an overly active neighbor, which slows down scraping.
  2. Memory Usage: Chrome tends to "eat" all available memory. Set limits in Docker (mem_limit).
  3. Network Bandwidth: 1 million pages represent 50 to 200 GB of traffic per day. Ensure your VPS provider offers a sufficient pipe (from 100 Mbps to 1 Gbps).
Benchmarks show that when using Scrapy without a browser (pure HTTP requests), a single 4 vCPU node can process up to 200 pages per second. However, with headless Chrome, performance drops to 3-5 pages per second per CPU core. This is why a distributed architecture across multiple VPS is the only economically viable solution.

Conclusions

For stable scraping of 1m pages per day, deploy a cluster of 5 VPS with 4 vCPU and 8 GB RAM, using Scrapy-Redis for task distribution and Browserless Docker containers to manage headless Chrome. Store operational data in Postgres and dump scraping results in batches into ClickHouse to ensure maximum write speed and data compression.

Ready to choose a server?

VPS and dedicated servers in 72+ countries with instant activation and full root access.

Start Now →

Share this post:

support_agent
Valebyte Support
Usually replies within minutes
Hi there!
Send us a message and we'll reply as soon as possible.