Overview
As applications grow from hundreds to millions of transactions, the database layer becomes the most critical bottleneck. What works fine at 1,000 requests per day can grind systems to a halt at 100,000. This post covers the practical, production-tested techniques we applied to a high-traffic financial application — covering query optimization, composite indexing strategies, connection pool tuning, and architectural decisions that dramatically reduced CPU load and improved response times.
The patterns discussed here apply broadly to any relational database system (PostgreSQL, MySQL, Aurora), especially in cloud-native and serverless architectures where connection management and query efficiency have an outsized impact on both cost and reliability.
Problem Statement
Our application handled high-frequency financial transactions — account balance updates, transaction history queries, and multi-entity ledger operations running concurrently across thousands of users.
As load grew, we began experiencing:
- High CPU utilization on the database host during peak traffic windows
- Slow dashboard and reporting queries that scanned millions of rows unnecessarily
- Query timeouts on pagination-heavy list endpoints
- Connection exhaustion under burst traffic due to improper pool configuration
- Lock contention on high-write tables, causing cascading latency spikes
The core issue wasn’t the database itself — it was the combination of missing indexes, unoptimized queries, and poor connection management that collectively pushed CPU and I/O far beyond what the workload actually required.
Initial Approaches Tried (What Didn’t Work)
1. Vertical Scaling the Database
The first instinct was to scale up — more CPU, more RAM. While this bought temporary relief, it was expensive and didn’t address the root cause. Within weeks, CPU crept back to 80%+ during peak hours. Throwing hardware at a query problem is not a solution.
2. Adding Single-Column Indexes Reactively
Early indexing was reactive — when a query was slow, we added an index on the column in the WHERE clause. This led to:
- Dozens of redundant single-column indexes that the query planner ignored in favor of sequential scans
- Increased write overhead from maintaining unused indexes
- No improvement on multi-condition queries (e.g., filtering by
status AND created_at AND user_id)
3. Using SELECT * and Fetching Data in Application Code
Some services fetched entire rows and filtered fields in application memory. For wide tables with JSON columns, this meant transferring kilobytes per row when only 3-4 columns were needed. Under load, this saturated network I/O and memory on both ends.
4. Naive Connection Handling in Serverless Functions
In a Lambda-based architecture, each function invocation was initially creating a new database connection. With concurrent invocations, this quickly exhausted the database’s max_connections limit, resulting in connection refused errors and cascading failures.
5. Running Count Queries and Data Queries Sequentially
Paginated list endpoints ran a COUNT(*) query first, then fetched the page of results. For large tables, this doubled the database round-trips unnecessarily.
Why This Final Approach
After profiling slow queries using EXPLAIN ANALYZE and examining index usage with pg_stat_user_indexes, a clear picture emerged: the database was doing far more work than necessary on every query.
The solution was built on four pillars:
- Purposeful composite and covering indexes — designed around actual query patterns, not individual columns
- Query builder discipline — enforcing parameterized, index-aware queries at the code level
- Connection pool configuration — tuned for the serverless execution model
- Read/write separation — routing read-heavy queries to read replicas
This approach addressed the root causes rather than symptoms, and the improvements compounded — better indexes reduced CPU, which reduced latency, which reduced connection hold time, which reduced pool pressure.
Implementation Details
1. Composite Index Design
The most impactful change was replacing scattered single-column indexes with composite indexes designed around query access patterns.
Principle: Index Column Order Matters
The column order in a composite index must match how queries filter data. Place high-cardinality, equality-filtered columns first, and range-filtered columns (like created_at) last.
-- Example: Queries that filter by status and sort/filter by date
-- BAD: Single column indexes (often ignored for multi-condition queries)
CREATE INDEX idx_txns_status ON transactions(status);
CREATE INDEX idx_txns_created_at ON transactions(created_at);
-- GOOD: Composite index matching query pattern
CREATE INDEX idx_txns_status_created_at ON transactions(status, created_at);
-- BETTER: Include user context for user-scoped queries
CREATE INDEX idx_txns_user_status_created_at ON transactions(user_id, status, created_at);
Covering Indexes for Read-Heavy Queries
For frequently accessed tables, a covering index includes all columns needed by a query, eliminating the need for a heap fetch entirely. The query planner can satisfy the entire query from the index alone.
-- Covering index example: entity lookups returning multiple columns
CREATE INDEX idx_accounts_entity_covering
ON accounts(entity_id, group_type)
INCLUDE (id, name, balance, currency, status);
-- Queries filtering on (entity_id, group_type) return all INCLUDE columns
-- directly from the index — zero table page reads
Expression Indexes for Computed Lookups
When queries filter on a computed or concatenated value, an expression index enables index scans on what would otherwise be a full-table expression evaluation.
-- Phone number lookups using country_code + number
-- BAD: Forces full scan
SELECT * FROM phone_numbers WHERE country_code || phone_number = '+93701234567';
-- GOOD: Expression index makes this index-scannable
CREATE INDEX idx_phone_full_number ON phone_numbers((country_code || phone_number));
Descending Indexes for Latest-First Queries
For “get most recent N records” patterns, a descending index avoids a sort operation:
-- Opening/closing balance lookups — most recent date first
CREATE INDEX idx_balances_account_date
ON account_balances(account_id, snapshot_date DESC);
2. Query Optimization Patterns
DO: Use Indexed Columns in WHERE — In Index Column Order
-- Index: (user_id, status, created_at)
-- GOOD: Matches index prefix
SELECT id, amount, status
FROM transactions
WHERE user_id = $1
AND status = 'SUCCESS'
AND created_at >= NOW() - INTERVAL '30 days';
-- BAD: Skips leading column — index cannot be used efficiently
SELECT id, amount, status
FROM transactions
WHERE status = 'SUCCESS'
AND created_at >= NOW() - INTERVAL '30 days';
DO: Use Range Expressions That Preserve Indexability
-- GOOD: Range condition on indexed column
WHERE created_at >= '2024-01-01' AND created_at < '2024-02-01'
-- BAD: Function call on indexed column forces full scan
WHERE YEAR(created_at) = 2024
WHERE DATE_TRUNC('month', created_at) = '2024-01-01'
DO: Prefer LIKE 'prefix%' Over LIKE '%suffix'
-- GOOD: Prefix match uses index
WHERE reference_code LIKE 'TXN-%'
-- BAD: Leading wildcard forces full scan
WHERE reference_code LIKE '%TXN'
DO: Use IN (...) for Multi-Value Equality on Indexed Columns
-- GOOD: Uses composite index efficiently
WHERE (entity_id, group_type) IN (
('user-123', 'WALLET'),
('user-123', 'COMMISSION')
)
-- BAD: OR conditions on different columns — may not use indexes
WHERE entity_id = 'user-123' AND group_type = 'WALLET'
OR entity_id = 'user-123' AND group_type = 'COMMISSION'
DO: Always Specify Columns in SELECT
-- GOOD: Only fetches needed data
SELECT id, user_id, amount, status, created_at FROM transactions WHERE ...
-- BAD: Fetches all columns including large JSON fields
SELECT * FROM transactions WHERE ...
3. Connection Pool Configuration for Serverless
In Lambda-based (or any short-lived process) environments, connection pool misconfiguration is a silent killer. The key insight: the pool must be sized per-instance, not per-application.
// Knex.js pool configuration for Lambda environments
const knexConfig = {
client: 'pg',
connection: {
host: process.env.DB_HOST,
database: process.env.DB_NAME,
ssl: { rejectUnauthorized: false },
},
pool: {
min: 0, // Don't hold idle connections between invocations
max: 5, // Low ceiling — Lambda scales horizontally, not per instance
idleTimeoutMillis: 30000, // Release idle connections quickly
acquireTimeoutMillis: 5000,// Fail fast on connection stalls
reapIntervalMillis: 1000, // Frequent pool cleanup
},
acquireConnectionTimeout: 5000,
};
Key principles:
min: 0— Prevents idle Lambda containers from holding open connectionsmax: 5— With 100 concurrent Lambdas, this means 500 max connections — plan accordingly- Use RDS Proxy or PgBouncer as a connection multiplexer to absorb connection burst from serverless workloads
Separate Read Replica Pool
// Route read-heavy queries to replica
const writePool = new KnexDatabaseService({ host: process.env.DB_WRITE_HOST, maxPool: 5 });
const readPool = new KnexDatabaseService({ host: process.env.DB_READ_HOST, maxPool: 10 });
// Usage
const reportData = await readPool.query(heavyReportingQuery); // hits replica
await writePool.transaction(async (trx) => { /* balance update */ }); // hits primary
4. ACID Transactions for Financial Data Integrity
Multi-step financial operations (balance update → ledger entry → transaction record) must be wrapped in a single database transaction to ensure atomicity.
// Pattern: atomic multi-table financial operation
await db.transaction(async (trx) => {
// Step 1: Debit sender account
await trx('accounts')
.where({ id: senderId })
.decrement('balance', amount);
// Step 2: Credit receiver account
await trx('accounts')
.where({ id: receiverId })
.increment('balance', amount);
// Step 3: Record ledger entry
await trx('account_transactions').insert({
account_id: senderId,
transaction_type: 'DEBIT',
amount,
reference_id: txnId,
created_at: new Date(),
});
// Step 4: Update daily closing balance snapshot
await trx.raw(`
INSERT INTO daily_balances (account_id, snapshot_date, closing_balance)
VALUES (?, CURRENT_DATE, (SELECT balance FROM accounts WHERE id = ?))
ON CONFLICT (account_id, snapshot_date) DO UPDATE
SET closing_balance = EXCLUDED.closing_balance
`, [senderId, senderId]);
});
// All steps committed atomically — or all rolled back on error
5. Parallel Query Execution for Paginated Endpoints
Avoid sequential count + data queries on paginated list endpoints:
// GOOD: Run count and data queries in parallel
const [countResult, dataResult] = await Promise.all([
db('transactions')
.where({ user_id: userId, status: 'SUCCESS' })
.count('* as total'),
db('transactions')
.select('id', 'amount', 'status', 'created_at')
.where({ user_id: userId, status: 'SUCCESS' })
.orderBy('created_at', 'desc')
.limit(pageSize)
.offset((page - 1) * pageSize),
]);
return {
data: dataResult,
total: Number(countResult[0].total),
page,
pageSize,
};
6. Retry Logic for Transient Failures
Database connections in cloud environments experience transient failures. Implement exponential backoff for retryable errors:
async function queryWithRetry<T>(
queryFn: () => Promise<T>,
maxRetries = 3,
initialDelayMs = 1000
): Promise<T> {
let lastError: Error;
for (let attempt = 0; attempt <= maxRetries; attempt++) {
try {
return await queryFn();
} catch (err) {
lastError = err as Error;
if (!isRetryableError(err) || attempt === maxRetries) throw err;
const delay = initialDelayMs * Math.pow(2, attempt);
await new Promise(resolve => setTimeout(resolve, delay));
}
}
throw lastError!;
}
function isRetryableError(err: unknown): boolean {
// Retry on connection errors, not on constraint violations or syntax errors
const retryableCodes = ['ECONNRESET', 'ECONNREFUSED', 'ETIMEDOUT', '57P03'];
return retryableCodes.some(code => String(err).includes(code));
}
7. Database Migration Management with Alembic
For teams using Python/SQLAlchemy, Alembic provides a structured migration workflow:
# Each migration file includes upgrade AND downgrade functions
def upgrade():
op.create_index(
'idx_transactions_user_status_created_at',
'transactions',
['user_id', 'status', 'created_at'],
unique=False
)
def downgrade():
op.drop_index('idx_transactions_user_status_created_at', table_name='transactions')
Migration best practices:
- Always implement
downgrade()— production incidents require fast rollback - Add indexes in separate migrations from schema changes for cleaner history
- Use
--autogenerateagainst model metadata but always review the generated diff - Run
alembic upgrade headas part of your CI/CD pipeline before deploying new code
Outcome / Benefits
After implementing these optimizations across all major query paths:
| Metric | Before | After |
|---|---|---|
| Database CPU (peak) | 75–85% | 20–30% |
| Slow query count (>500ms) | Frequent | Rare (near zero under normal load) |
| Average query latency | 300–800ms | 20–80ms |
| Connection errors under burst | Occasional | Eliminated |
| Paginated list endpoint P95 | ~1.2s | ~120ms |
| Report/dashboard query time | 4–8s | 400ms–1.2s |
Beyond raw performance:
- Predictable scaling: Adding load no longer caused exponential latency growth
- Cost reduction: Lower CPU utilization reduced RDS instance costs
- Developer confidence: Query guidelines reduced the chance of introducing regressions
- Operational clarity: Covering indexes made slow query analysis straightforward
Lessons Learned / Pitfalls
1. Don’t Index Everything — Indexes Have Write Costs
Every index you add must be maintained on every INSERT/UPDATE/DELETE. A table with 15 indexes on it will have noticeably slower writes. Audit your indexes regularly with pg_stat_user_indexes and drop indexes with zero or near-zero scans.
2. Index Column Order Is Non-Negotiable
A composite index (a, b, c) cannot efficiently serve a query filtering only on b or c. Design indexes around your most frequent, most costly query patterns — not around individual column mentions.
3. Query Planner Doesn’t Always Use Your Index
Even with the right index, the query planner may choose a sequential scan if statistics are stale or if the result set is large. Run ANALYZE on tables after large bulk loads, and use EXPLAIN (ANALYZE, BUFFERS) to verify index usage.
4. Connection Pools in Serverless Need Special Attention
Traditional pool sizing (min: 5, max: 50) is designed for long-running processes. In Lambda, each new container brings its own pool — 100 concurrent invocations × 50 max connections = 5,000 connection attempts. Always pair serverless workloads with a connection pooler (RDS Proxy, PgBouncer).
5. Wrapping Everything in Transactions Has a Cost
Transactions hold locks. Long-running transactions on high-write tables cause lock contention and replication lag. Keep transactions short and targeted — only wrap what truly needs atomicity.
6. SELECT * Is More Expensive Than It Looks
On tables with JSON or TEXT columns, fetching all columns can be 10–50x more data than needed. Always specify columns explicitly in production queries.
7. Test Your Indexes With Production-Scale Data
An index that looks optimal on 10,000 rows may behave differently on 50 million rows. Test index effectiveness with representative data volumes before relying on it in production.
8. Monitor Index Bloat Over Time
High-write tables accumulate index bloat from dead tuples. Schedule periodic REINDEX CONCURRENTLY or configure autovacuum aggressively enough to keep indexes lean.
Future Improvements
1. Materialized Views for Heavy Reporting
Dashboard and reporting queries that aggregate millions of rows are prime candidates for materialized views — precomputed result sets refreshed on a schedule or triggered by writes. This moves expensive aggregation off the critical path.
CREATE MATERIALIZED VIEW daily_transaction_summary AS
SELECT
DATE_TRUNC('day', created_at) AS day,
user_id,
COUNT(*) AS transaction_count,
SUM(amount) AS total_amount
FROM transactions
WHERE status = 'SUCCESS'
GROUP BY 1, 2;
-- Refresh periodically
REFRESH MATERIALIZED VIEW CONCURRENTLY daily_transaction_summary;
2. Explicit Cache Layer for Frequently Read, Rarely Written Data
Static or slow-changing data (configuration, rate tables, product catalogs) is currently read from the database on every request. Adding a Redis cache layer with a short TTL would eliminate database reads for these high-frequency, low-volatility lookups.
3. Query Performance Regression Testing in CI
Add pg_stat_statements snapshots and EXPLAIN plan assertions to the CI pipeline to catch query regressions before they reach production. Tools like pganalyze or custom plan shape tests can flag when an index scan degrades to a sequential scan.
4. Partitioning for Time-Series Tables
High-volume transaction tables grow unboundedly. Table partitioning by time range (monthly or quarterly) keeps partition sizes manageable, enables partition pruning in queries, and makes archival/deletion straightforward.
CREATE TABLE transactions (
id TEXT,
created_at TIMESTAMPTZ NOT NULL,
amount DECIMAL(15, 2),
status TEXT
) PARTITION BY RANGE (created_at);
CREATE TABLE transactions_2024_q1
PARTITION OF transactions
FOR VALUES FROM ('2024-01-01') TO ('2024-04-01');
5. Read-Replica Routing Based on Query Cost
Currently, read replica routing is manual (call the right pool). Automated query routing based on estimated cost (via EXPLAIN) or explicit query tagging would ensure expensive analytical queries never hit the primary without developer discipline.
Conclusion
Database performance optimization is not a one-time task — it’s an ongoing discipline. The biggest wins come from understanding your access patterns first, then designing indexes and queries around those patterns, not the other way around.
The techniques covered here — composite indexes aligned to query predicates, covering indexes for hot read paths, expression indexes for computed lookups, connection pool tuning for serverless, parallel query execution, and ACID-safe transaction patterns — collectively represent the difference between a database that struggles under load and one that scales gracefully.
Start with EXPLAIN ANALYZE. Measure before you optimize. And remember: the best query is the one that does the least work to return the right answer.
Have questions or want to share your own optimization experiences? Drop them in the comments below.