Warming up the neural circuits...
By the end of this chapter you will:
A bulk upload is the single most destructive endpoint your backend will ever expose. One CSV with 50,000 rows can:
You will get bulk imports right or you will spend a long Sunday writing a rollback script.
| Rows | Pattern | UX |
|---|---|---|
| 1–500 | Synchronous (request blocks until done) | Spinner, then result |
| 500+ | Asynchronous (job + worker) | Upload → "we'll email you" / progress bar |
The threshold is fuzzy, but the rule is firm: if it can take more than ~5 seconds, make it async. Synchronous endpoints holding open connections are how you DOS yourself.
1. Frontend uploads CSV → S3 (presigned URL, Chapter 20)
2. Frontend POSTs /imports { s3_key, type: "transactions" }
3. Backend creates a row in `import_jobs`:
id, owner_id, s3_key, type, status='queued', progress=0,
total_rows=null, error_count=null, error_report_s3_key=null
4. Backend enqueues a job onto the worker queue (Chapter 15)
5. Worker:
- Downloads file from S3 (streamed)
- Parses & validates row by row
- Writes batches of 500 rows in transactions
- Updates progress on the job record every batch
- Collects errors → writes an error_report.csv to S3
- Marks status='completed' (or 'failed')
6. Frontend polls GET /imports/:id (or receives a notification)
7. If errors exist, user downloads the error_report.csv to fix and re-uploadThe user gets immediate feedback ("we got your file") and the long work happens out of band.
A 100 MB CSV loaded into a string is roughly 200 MB in heap. Five concurrent imports is a 1 GB out-of-memory crash. Stream the file row by row.
csv-parse)import { parse } from 'csv-parse';
import { Readable } from 'stream';
async function processCsv(stream: Readable, onRow: (row: any) => Promise<void>)
exceljs streaming reader)import { stream as xlsxStream } from 'exceljs';
const workbookReader = new xlsxStream.xlsx.WorkbookReader(s3ReadStream, {
entries: 'emit',
sharedStrings: 'cache',
}
Never use XLSX.readFile() (the synchronous xlsx library reader) in a worker — it loads the whole workbook into memory. For large files use exceljs streaming or convert to CSV first.
Don't bail on the first error. The user wants ONE report listing all 137 problems, not 137 sequential "fix and re-upload" cycles.
const errors: { row: number; column: string; value: any; message: string }[] = [];
let rowNum = 1; // row 1 = header
for await (
The error report ends up as a CSV the user can download:
row,column,value,message
17,email,not-an-email,Must be a valid email
24,amount,-50,Must be positive
102,partner_ref,abc-123,Already imported (duplicate)This is the difference between "useful tool" and "frustrating tool".
Inserting one row at a time = 50,000 round-trips to PostgreSQL. Inserting all 50,000 in one = one massive lock and rollback risk. The middle ground is batches of 200–1,000.
await this.sequelize.transaction(async (tx) => {
await this.transactionModel.bulkCreate(batch, {
transaction: tx,
validate: false, // we already validated above
returning:
For very large imports, PostgreSQL's COPY is 5–10× faster than INSERT:
import { from as copyFrom } from 'pg-copy-streams';
const stream = pgClient.query(copyFrom(
`COPY transactions(partner_ref, amount, currency) FROM STDIN WITH (FORMAT csv)`,
));
csvStream.pipe(stream);COPY is for trusted, pre-validated data only — it doesn't run model hooks, validators, or set timestamps automatically.
Users get bored watching a progress bar, refresh the page, and click "Upload" again. Without idempotency, every transaction in their CSV exists twice in your database.
Each row has an external reference the partner provides:
ALTER TABLE transactions
ADD CONSTRAINT uniq_partner_ref UNIQUE (partner_id, partner_ref);Use INSERT ... ON CONFLICT DO NOTHING:
await this.transactionModel.bulkCreate(batch, {
ignoreDuplicates: true,
});The user can supply the same file twice and get the same result. Hash the file before processing:
const fileSha = createHash('sha256').update(buffer).digest('hex');
const existing = await this.importJobModel.findOne({
where: { owner_id, file_sha256:
Both patterns combined is best: file-level for “same upload twice”, row-level for partial overlaps between two different files.
Excel saves CSVs with a 3-byte UTF-8 BOM (EF BB BF) at the start. The first column header reads as name instead of name and your validator rejects every row. Set bom: true in csv-parse.
Excel on Windows defaults to Windows-1252 (or even Shift-JIS in Japan), not UTF-8. Names with é, ñ, accents, or any non-ASCII character become ?. Either:
chardet and convert with iconv-litePhone numbers, account numbers, ZIPs:
account_number
00123456789 ← in CSV
123456789 ← what Excel does to it on saveExcel strips leading zeros, converts 1.23E+12 for long numbers, and can mangle dates. Tell users: paste with text-format columns OR use XLSX so the type is preserved.
A comma inside a value must be quoted: "Mumbai, India". A newline inside a value must be quoted too. Hand-rolled CSV splitters that do line.split(',') are wrong — use a library.
Excel dates are stored as days-since-1900 (a number). When the cell type is "date" you read 45417 instead of "2024-04-15". Always coerce explicitly:
const parsed = parseISO(row.date) ;
if (!isValid(parsed)) errors.push({ row: rowNum, ... });A .xlsx file is a zip archive of XML files. It can carry formula injection, macros, and XXE vulnerabilities. Never blindly trust user-uploaded XLSX content.
A .xlsx file is a zip archive of XML files. It can carry:
=SUM(A1:A10) is benign. =cmd|'/C calc.exe'!A0 (DDE — Dynamic Data Exchange) opens an attacker's program when the file is reopened in Excel. If you re-export user-uploaded data, never start a cell with =, +, -, @, tab, or carriage return. Prefix with a single quote ' to neutralise it.
function sanitiseCell(v: string): string {
if (typeof v !== 'string') return v;
if (/^[=+\-@\t\r]/.test(v)) return `'${v
This is CSV injection / formula injection. It's a real OWASP-listed vuln.
.xlsm files contain VBA macros. Reject anything ending .xlsm, .xlsb outright unless you have a very good reason.
Old XML parsers used by XLSX libraries had XXE vulnerabilities (external entity expansion → file read or DoS). Use a maintained library (exceljs, xlsx-populate) and keep it updated.
A 1 KB .xlsx can decompress to 5 GB of XML. Cap decompressed size:
import JSZip from 'jszip';
const zip = await JSZip.loadAsync(buffer, { checkCRC32: true });
let total = 0;
for (const name of Object.keys(zip
Max file size: 20 MB (CSV) / 10 MB (XLSX)
Max rows per file: 50,000
Max jobs per user: 1 active at a time
Max imports per day: 10
Worker timeout: 30 minutes per jobWithout these, one user can fill your queue, your DB, and your storage. With them, abuse is bounded.
Your bulk insert holds row-level locks. If those rows are also being updated by live traffic, you create a queue:
Mitigations:
❌ Synchronous endpoint that takes 30 seconds — the will kill it before it finishes.
❌ Loading the whole CSV into memory — works on your 200-row test file, OOMs on real data.
❌ First-error-and-stop — users hate this. Collect all errors, return all errors.
❌ One huge transaction for 50k rows — long lock, hard rollback, replay all on failure.
❌ Insert one row at a time — 50k round-trips, 30× slower than batches.
❌ No idempotency — duplicate imports happen weekly in real life.
❌ Trusting = in user-exported reports — you just shipped formula injection.
❌ Echoing original filename back to the user as a download link — XSS / phishing vector.
import_jobs row created BEFORE work begins, status = queuedreadFile)=, +, -, @ are sanitised on exportCollect all errors from every row before failing. One upload = one error report. Never stop at the first bad row — users want to fix everything in one pass. And always build for idempotency: the user will upload twice.
Collect all errors from every row before failing. One error = one error report. Never stop at the first bad row — users want to fix everything in one pass. And always idempotency: the user will upload twice.