Bahaa Abbas | Full Stack Developer

Introduction

PostgreSQL is a powerful database, but without proper optimization, even simple queries can become bottlenecks. Whether you're dealing with slow queries, high CPU usage, or connection issues, understanding PostgreSQL's internals and optimization techniques is crucial. This comprehensive guide covers everything from basic indexing to advanced query optimization, helping you build fast, scalable database architectures that can handle millions of records efficiently.

Understanding EXPLAIN and Query Plans

Before optimizing, you need to understand what's slow. EXPLAIN and EXPLAIN ANALYZE show you exactly how PostgreSQL executes your queries. The query planner chooses the most efficient execution path, but you need to read these plans to identify bottlenecks. Look for sequential scans on large tables, expensive sorts, and nested loops on large datasets—these are often optimization opportunities.

Using EXPLAIN

-- Basic EXPLAIN
EXPLAIN SELECT * FROM users WHERE email = 'test@example.com';

-- EXPLAIN ANALYZE (actually runs the query)
EXPLAIN ANALYZE
SELECT u.name, COUNT(o.id) as order_count
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
GROUP BY u.id, u.name
HAVING COUNT(o.id) > 5;

-- Look for these red flags:
-- ❌ Seq Scan on large tables
-- ❌ High cost values
-- ❌ Nested Loop on large datasets
-- ✅ Index Scan or Index Only Scan
-- ✅ Hash Join or Merge Join

Indexing Strategies

Indexes are your first line of defense against slow queries. B-tree indexes (default) work for most cases, but specialized indexes can dramatically improve specific query types. Use partial indexes for frequently filtered subsets, composite indexes for multi-column queries, and covering indexes to enable index-only scans. Remember: indexes speed up reads but slow down writes, so index strategically based on your query patterns.

Index Types and Usage

-- Basic B-tree index (most common)
CREATE INDEX idx_users_email ON users(email);

-- Composite index (order matters!)
CREATE INDEX idx_orders_user_created 
ON orders(user_id, created_at DESC);

-- Partial index (smaller, faster)
CREATE INDEX idx_active_users 
ON users(email) 
WHERE active = true;

-- Covering index (index-only scan)
CREATE INDEX idx_users_email_name 
ON users(email) INCLUDE (name);

-- GIN index for full-text search
CREATE INDEX idx_posts_content 
ON posts USING GIN(to_tsvector('english', content));

-- BRIN index for sequential data
CREATE INDEX idx_logs_created 
ON logs USING BRIN(created_at);

Query Optimization Techniques

Writing efficient queries is an art. Avoid SELECT *, use explicit column lists to enable covering indexes and reduce data transfer. Replace subqueries with JOINs when possible, as they're usually faster. Use EXISTS instead of IN for large datasets, and leverage CTEs (Common Table Expressions) for complex queries to improve readability and sometimes performance. Always filter early in your query to reduce the working dataset.

Query Optimization Examples

-- ❌ Bad: Subquery in SELECT
SELECT 
  u.name,
  (SELECT COUNT(*) FROM orders WHERE user_id = u.id) as order_count
FROM users u;

-- ✅ Good: Use JOIN
SELECT 
  u.name,
  COUNT(o.id) as order_count
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
GROUP BY u.id, u.name;

-- ❌ Bad: Using IN with subquery
SELECT * FROM products 
WHERE category_id IN (
  SELECT id FROM categories WHERE active = true
);

-- ✅ Good: Use EXISTS or JOIN
SELECT p.* FROM products p
WHERE EXISTS (
  SELECT 1 FROM categories c 
  WHERE c.id = p.category_id AND c.active = true
);

-- Use LIMIT with pagination
SELECT * FROM posts
ORDER BY created_at DESC
LIMIT 20 OFFSET 0;  -- First page

Connection Pooling

Database connections are expensive to create. Connection pooling reuses connections, dramatically improving performance under load. In Node.js, use pg-pool or Prisma's built-in pooling. Configure pool size based on your database's max connections and application's concurrency needs. A common formula: pool size = (core_count * 2) + effective_spindle_count. Monitor connection usage and adjust as needed.

Connection Pool Setup

import { Pool } from 'pg';

const pool = new Pool({
  host: process.env.DB_HOST,
  database: process.env.DB_NAME,
  user: process.env.DB_USER,
  password: process.env.DB_PASSWORD,
  // Pool configuration
  max: 20,                    // Maximum pool size
  min: 5,                     // Minimum pool size
  idleTimeoutMillis: 30000,   // Close idle connections
  connectionTimeoutMillis: 2000, // Connection timeout
});

// Using the pool
const getUser = async (id: string) => {
  const client = await pool.connect();
  try {
    const result = await client.query(
      'SELECT * FROM users WHERE id = $1',
      [id]
    );
    return result.rows[0];
  } finally {
    client.release(); // Always release!
  }
};

Vacuuming and Maintenance

PostgreSQL uses MVCC (Multi-Version Concurrency Control), which creates dead tuples over time. VACUUM reclaims this space and updates statistics for the query planner. Autovacuum runs automatically, but for high-write tables, you may need manual vacuuming. ANALYZE updates statistics without cleaning up, helping the planner choose better query plans. Run VACUUM ANALYZE regularly, especially after large data changes.

Database Maintenance

-- Vacuum and analyze specific table
VACUUM ANALYZE users;

-- Full vacuum (locks table, use carefully)
VACUUM FULL users;

-- Check table bloat
SELECT
  schemaname,
  tablename,
  pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as size,
  n_dead_tup,
  n_live_tup,
  round(n_dead_tup * 100.0 / NULLIF(n_live_tup + n_dead_tup, 0), 2) as dead_pct
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
ORDER BY n_dead_tup DESC;

-- Enable autovacuum (in postgresql.conf)
autovacuum = on
autovacuum_naptime = 1min

Monitoring and Metrics

You can't optimize what you don't measure. Monitor key metrics: slow queries (log queries over 100ms), cache hit ratio (should be >99%), connection count, and lock waits. Use pg_stat_statements extension to identify problematic queries. Set up alerts for connection pool exhaustion, high replication lag, and disk space. Tools like pgAdmin, DataDog, or Prometheus can help visualize these metrics.

Monitoring Queries

-- Enable pg_stat_statements
CREATE EXTENSION pg_stat_statements;

-- Find slowest queries
SELECT
  query,
  calls,
  total_exec_time,
  mean_exec_time,
  max_exec_time
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;

-- Check cache hit ratio
SELECT
  sum(heap_blks_read) as heap_read,
  sum(heap_blks_hit) as heap_hit,
  sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) as ratio
FROM pg_statio_user_tables;

-- Active connections
SELECT count(*) FROM pg_stat_activity 
WHERE state = 'active';

Partitioning for Scale

When tables grow to millions of rows, partitioning can significantly improve performance. Partition by range (dates), list (categories), or hash (distributed load). Queries that filter on the partition key only scan relevant partitions, dramatically reducing I/O. Common use case: partitioning logs or time-series data by month. Partitioning also makes archiving old data easier.

Conclusion

PostgreSQL optimization is an ongoing process. Start with proper indexing and query optimization, implement connection pooling, and establish monitoring to identify issues early. Use EXPLAIN to understand query execution, vacuum regularly to maintain performance, and consider partitioning for very large tables. Remember that premature optimization is still a concern—measure first, then optimize based on real bottlenecks. With these techniques, your PostgreSQL database can efficiently handle millions of records while maintaining sub-second query times.