Warming up the neural circuits...
By the end of this chapter you will:
Move from a 2 CPU / 4 GB instance to 8 CPU / 32 GB. Buys you maybe 3–5×. Cheap to do, capped by physics, and a single point of failure.
Run 10 copies of the same app behind a . Each request goes to whichever instance is least busy. The path that scales to billions of requests.
You will end up doing both. But horizontal is the strategic answer — design for it from day one. Most of this chapter is "what does it take to be horizontally scalable".
A stateless app is one where any request can be handled by any instance, and each instance can be killed and replaced without losing data.
| Anti-pattern | Why it kills horizontal scaling |
|---|---|
In-memory session store (Express MemoryStore) | User logged in via instance A, request lands on B → 401 |
| Local disk caching | hits on A, misses on B; restart = wipe |
| Cron job inside the app | 10 instances = job runs 10 times |
| In-memory rate limiter | Each instance has its own count |
| Sticky sessions | Locks a user to one instance — that instance dies, user logs out |
The fix for all of them: push to shared infrastructure.
| Was | Should be |
|---|---|
| In-memory sessions | (stateless) or Redis-backed sessions |
| Local file cache | Redis, Memcached, or CDN |
| In-process cron | A scheduler service (BullMQ, EventBridge, K8s CronJob) running on one node |
| In-memory rate limit | Redis-backed (@nestjs/throttler with Redis storage) |
Once your app is stateless, scaling is just: "run more containers".
Internet
│
▼
[CDN / WAF — Cloudflare / CloudFront]
│
▼
[Load balancer — ALB / Nginx]
│
├──► API container (instance 1)
├──► API container (instance 2)
└──► API container (instance N) ← auto-scaling group
│
├──► Postgres primary (writes)
├──► Postgres read replica(s)
├──► Redis (cache, queue, rate limit)
├──► S3 (files)
└──► SQS / RabbitMQ (background jobs)
[Worker container] (instance 1..M) ──► reads from queueRun the on ECS Fargate or . Auto-scale on CPU > 70% for 5 minutes. Each instance is identical and disposable.
This is the boring, scalable, production architecture for 99% of products. Don't reach for serverless until you have a reason.
Your API can scale horizontally indefinitely. PostgreSQL cannot — there's exactly one primary, all writes go there. So:
Promote one or more replicas. Route reads of non-critical, slightly-stale data (reports, search, profiles) to a replica. Writes and read-after-write paths stay on primary.
// Sequelize with read replica
const sequelize = new Sequelize('db', 'user', 'pass', {
dialect: 'postgres',
replication: {
write: { host: 'primary.rds.amazonaws.com' },
read
Replica lag is real — typically 100 ms to a few seconds. Reads immediately after a write may not see the write. For “show user their just-created order” → use primary.
Each PostgreSQL connection costs ~10 MB of RAM on the server. 100 instances × 20 connections = 2,000 connections = 20 GB just for idle connections. PostgreSQL falls over around 500 concurrent connections.
Put PgBouncer (or AWS RDS Proxy) in front:
API instances (each opens 5 connections to PgBouncer)
│ × 100 instances = 500 connections to PgBouncer
▼
PgBouncer (multiplexes onto 50 real DB connections)
▼
PostgreSQL (only sees 50 connections — happy)This is the single highest-leverage thing you can do for DB scaling.
Hit Redis before hitting Postgres. A 1 ms Redis lookup beats a 30 ms DB query.
async getProduct(id: string) {
const cached = await redis.get(`product:${id}`);
if (cached) return JSON.parse(cached);
const
Patterns: cache-aside (above), read-through, write-through, write-behind. Cache-aside is the simplest and what 95% of apps need.
The hardest problem in caching is invalidation. Prefer short TTLs over manual invalidation when you can — it’s tolerant to bugs.
When one table has 5 billion rows, even with indexes things slow down. PostgreSQL native partitioning by date or tenant_id helps:
CREATE TABLE transactions (
id uuid, created_at timestamp, ...
) PARTITION BY RANGE (created_at);
CREATE TABLE transactions_2026_05 PARTITION OF transactions
FOR VALUES FROM ('2026-05-01') TO ('2026-06-01');Old partitions can be dropped instantly (no expensive DELETE). Queries that filter on created_at only touch the relevant partition.
AWS Lambda runs your code only when an event happens. No servers, no containers, no autoscaling configuration. You pay per millisecond of execution.
| Problem | Why |
|---|---|
| Long-running tasks (>15 min) | Hard limit — Lambda kills the function at 15 min |
| Always-on websockets | Stateless invocation, no persistent connection model |
| Heavy CPU work | Pricier than an EC2 box doing the same |
| DB-heavy APIs | Each invocation opens a new DB connection — exhausts connection pool fast |
| Tight latency (<100 ms p99) | Cold starts can add 500–2000 ms |
| Predictable steady traffic | A reserved EC2/ECS instance is much cheaper at constant load |
When a Lambda hasn't run for a while, the first invocation has to:
This is the cold start. For a Node Lambda that's 100–500 ms; with a fat dependency tree it can hit 2 seconds.
Mitigations:
node_modules).Every Lambda invocation can open a fresh DB connection. Burst traffic = thousands of connections = Postgres falls over.
Solutions:
Be careful — the whole point of NestJS is rich , DI, and complex business logic. If your Lambda cold-starts in 2 seconds because it loads 40 modules, you’ve defeated the cost model. For a NestJS-style app, ECS Fargate is usually the better target.
// lambda.ts
import { NestFactory } from '@nestjs/core';
import { ExpressAdapter } from '@nestjs/platform-express';
import serverlessExpress from '@vendia/serverless-express';
import express from 'express';
import { AppModule
You can deploy a NestJS app to Lambda using @vendia/serverless-express:
Use Lambda for:
Use ECS / EKS for the main API.
Visual orchestration of multiple Lambdas with retry, timeout, branching. Great for "a multi-step process that takes minutes/hours and must survive failures".
StepFunction:
1. Validate input (Lambda)
2. Fan out: process each row (Lambda — parallel)
3. Aggregate results (Lambda)
4. Send notification (Lambda)
retries: 3, on-failure: route to DLQPub-sub between AWS services. "When an order is created, fire 5 things" — payment, email, fraud check, audit log, analytics. Each subscriber is independent and can fail without breaking the others.
Reliable queue. Messages persist until consumed and acknowledged. The default for "I want to do this work later, but can't lose it." Pair with Lambda or ECS workers.
Key-value store with single-digit-millisecond reads, infinite horizontal scaling, no schema. Great for session stores, leaderboards, IoT-style write-heavy workloads. Bad for ad-hoc queries — you must design access patterns up front.
Sits in front of Lambdas. Handles auth, throttling, custom domains, request transformation. Adds 30–50 ms latency vs ALB → Lambda direct, but is the standard pattern.
CDN. Cache static assets (JS, , images) at edge locations near users. Also Lambda@Edge for tiny per-request transformations.
Rather than reaching for Kubernetes on day one, scale up the simplest thing that works:
| Stage | Traffic | Stack |
|---|---|---|
| MVP | <1 req/sec | One ECS Fargate task or one EC2 box; one RDS; one S3. Deploy via GitHub Actions. |
| Early growth | 1–100 req/sec | 2–4 Fargate tasks behind ALB; RDS with PITR; Redis ElastiCache (1 node); CloudFront over S3. |
| Steady growth | 100–1k req/sec | Fargate auto-scaling 4–20 tasks; RDS with read replica; PgBouncer / RDS Proxy; SQS workers for work. |
| Scale | 1k–10k req/sec | Multi-AZ everything; multi-region read replicas; partitioned tables; aggressive Redis caching; queues for non-critical writes. |
| Hyperscale | >10k req/sec | Sharded DB; CQRS (separate read & write models); event-sourced core; Kafka for events; some traffic on DynamoDB; multi-region active-active. |
Almost no product ever reaches the bottom row. Most stay at "Steady growth". Pick the right tool for the actual traffic, not the imagined.
A monolithic API does everything in the request:
POST /orders
▶ insert order
▶ charge payment
▶ send email
▶ notify partner
▶ update analytics
▶ return 201Each step adds latency. If "send email" is slow, the user waits. If "notify partner" fails, the whole request fails.
Event-driven splits the work:
POST /orders
▶ insert order → in DB
▶ publish "OrderCreated" event → EventBridge / Kafka / SQS
▶ return 201 (50 ms)
Subscribers (independent, can be slow, can fail):
▶ payment service charges
▶ email service sends confirmation
▶ partner integration notifies
▶ analytics service recordsBenefits:
Costs:
correlation_id (Chapter 12).Most modern fintech / e-commerce backends look like this internally.
Command Query Responsibility Segregation: separate the model that handles writes from the model that handles reads. Writes go to a normalised SQL DB. Reads come from a denormalised read model (Elasticsearch, materialised views, Redis projection) updated asynchronously from events.
You don't need CQRS until you have it: a real read/write asymmetry (1000 reads per write), or a search/analytics surface that's hard to serve from your transactional DB. When you hit it, the answer is to project events into a purpose-built read store.
If you read one resource per topic, you'll be in the top 5% of your peers:
| Topic | Resource |
|---|---|
| Stateless apps & 12-factor | The 12-factor app website (1 hour) |
| AWS fundamentals | "AWS Solutions Architect — Associate" study guide |
| Caching strategies | AWS docs — Caching patterns whitepaper |
| Read replicas, PgBouncer | PostgreSQL docs — Replication chapter |
| Lambda gotchas | AWS docs — Lambda best practices + Operating Lambda series |
| Event-driven patterns | Martin Fowler — Event-Driven Architecture |
| DynamoDB modelling | Alex DeBrie's DynamoDB Book |
| Distributed systems | Designing Data-Intensive Applications — Kleppmann |
❌ Reaching for Kubernetes on day one — the operational burden is enormous. ❌ Sticky sessions — locks you out of horizontal scaling. ❌ Serverless for everything — cold starts, DB connection limits, vendor lock-in. Use it where it fits. ❌ No connection pooling — DB falls over at 500 connections. ❌ Caching with no invalidation strategy — stale data is sometimes worse than slow data. ❌ Microservices for a 5-engineer team — Conway's Law (Chapter 17). You need the team to justify it. ❌ Premature sharding — sharding is irreversible and slows everything down for years. ❌ In-memory anything — sessions, rate limits, caches, cron, locks. ❌ Same DB instance for OLTP and reporting — reporting queries lock your live API. ❌ Skipping observability before scaling — you can't fix what you can't see.
"Scalable" is not a property of any single technology. It's a property of how your app handles state. Push state out, make every component disposable, and any single piece can be replaced or multiplied. Lambda, ECS, K8s, RDS, DynamoDB — they're all just delivery mechanisms for that one idea.
Build the boring, stateless, horizontally-scalable monolith first. Add event-driven pieces as you find real bottlenecks. Reach for serverless and exotic distributed-systems patterns only when the problem genuinely demands it.
The teams that ship the most are the ones that resisted complexity until it was forced on them.
“Scalable” is not a property of any single technology — it’s a property of how your app handles state. Push state out, make every component disposable. Build the boring, stateless, horizontally-scalable monolith first. Add event-driven pieces as you find real bottlenecks. Reach for serverless and CQRS only when the problem genuinely demands it.