Database Design Fundamentals

Why Database Design Matters

A poorly designed database is technical debt that compounds daily. Queries get slower. Data anomalies creep in. Migrations become terrifying. Developers work around the schema instead of with it.

Good database design is not about following rules blindly. It is about understanding the trade-offs between data integrity, query performance, and development velocity, then making intentional choices for your specific workload.

This article covers relational and NoSQL databases, normalization theory with practical examples, indexing fundamentals, and when breaking the rules (denormalization) is the right call.

Relational vs NoSQL Databases

The first decision is choosing your database paradigm. This is not relational vs NoSQL — it is about matching your data model to your access patterns.

Relational Databases (SQL)

Relational databases store data in tables with rows and columns. Relationships between tables are defined through foreign keys. Data integrity is enforced at the database level.

-- Relational schema for an e-commerce platform
CREATE TABLE users (
    id          SERIAL PRIMARY KEY,
    email       VARCHAR(255) UNIQUE NOT NULL,
    name        VARCHAR(100) NOT NULL,
    created_at  TIMESTAMP DEFAULT NOW()
);

CREATE TABLE orders (
    id          SERIAL PRIMARY KEY,
    user_id     INTEGER NOT NULL REFERENCES users(id),
    status      VARCHAR(20) DEFAULT 'pending',
    total       DECIMAL(10, 2) NOT NULL,
    created_at  TIMESTAMP DEFAULT NOW()
);

CREATE TABLE order_items (
    id          SERIAL PRIMARY KEY,
    order_id    INTEGER NOT NULL REFERENCES orders(id),
    product_id  INTEGER NOT NULL REFERENCES products(id),
    quantity    INTEGER NOT NULL CHECK (quantity > 0),
    unit_price  DECIMAL(10, 2) NOT NULL
);

Strengths: ACID transactions, enforced relationships, complex queries with JOINs, mature tooling, strong consistency.

Best for: Financial data, inventory systems, anything requiring strict consistency and complex queries across related entities.

NoSQL Databases

NoSQL databases come in several flavors, each optimized for different access patterns.

Document stores (MongoDB, CouchDB):

// Document model — same e-commerce data
{
  "_id": "order_12345",
  "userId": "user_789",
  "status": "pending",
  "total": 149.99,
  "items": [
    {
      "productId": "prod_001",
      "name": "Wireless Mouse",
      "quantity": 2,
      "unitPrice": 29.99
    },
    {
      "productId": "prod_002",
      "name": "USB-C Hub",
      "quantity": 1,
      "unitPrice": 89.99
    }
  ],
  "shippingAddress": {
    "street": "123 Main St",
    "city": "Portland",
    "state": "OR",
    "zip": "97201"
  },
  "createdAt": "2026-04-25T10:30:00Z"
}

Key-value stores (Redis, DynamoDB):

Key: "session:abc123"
Value: {"userId": "789", "cart": [...], "expires": 1714060800}

Key: "user:789:profile"
Value: {"name": "Alice", "email": "[email protected]", "tier": "premium"}

Wide-column stores (Cassandra, HBase):

Row Key: "user_789"
Column Family: "orders"
  ├── "order_001": {"total": 149.99, "status": "completed"}
  ├── "order_002": {"total": 79.50, "status": "pending"}
  └── "order_003": {"total": 220.00, "status": "shipped"}

Column Family: "profile"
  ├── "name": "Alice"
  ├── "email": "[email protected]"
  └── "tier": "premium"

Graph databases (Neo4j, Amazon Neptune):

// Graph model — social connections
CREATE (alice:User {name: "Alice"})
CREATE (bob:User {name: "Bob"})
CREATE (post:Post {content: "Hello world"})
CREATE (alice)-[:FOLLOWS]->(bob)
CREATE (alice)-[:AUTHORED]->(post)
CREATE (bob)-[:LIKED]->(post)

// Query: Find friends-of-friends who liked Alice's posts
MATCH (alice:User {name: "Alice"})-[:FOLLOWS]->()-[:FOLLOWS]->(fof)
WHERE fof <> alice
MATCH (fof)-[:LIKED]->(post)<-[:AUTHORED]-(alice)
RETURN fof.name, post.content

Comparison Table

Aspect	Relational (PostgreSQL)	Document (MongoDB)	Key-Value (Redis)	Wide-Column (Cassandra)
Schema	Strict, predefined	Flexible, per-document	Schema-less	Column families
Consistency	Strong (ACID)	Configurable	Eventual (single-node strong)	Tunable
Scaling	Vertical primarily	Horizontal (sharding)	Horizontal	Horizontal
Joins	Native, efficient	Manual (application-level)	None	Limited
Best for	Complex relations, transactions	Varied document structures	Caching, sessions	Time-series, high write throughput
Query language	SQL	MQL / aggregation pipeline	GET/SET commands	CQL
Transactions	Full ACID	Multi-document (since 4.0)	Single-key atomic	Lightweight transactions

Normalization

Normalization is the process of organizing data to reduce redundancy and prevent data anomalies. Each normal form builds on the previous one.

First Normal Form (1NF)

Rule: Every column must contain atomic (indivisible) values. No repeating groups or arrays.

VIOLATION of 1NF:
┌────┬──────────┬───────────────────────────┐
│ id │ name     │ phone_numbers             │
├────┼──────────┼───────────────────────────┤
│ 1  │ Alice    │ 555-0101, 555-0102        │
│ 2  │ Bob      │ 555-0201                  │
└────┴──────────┴───────────────────────────┘

FIXED (1NF compliant):
┌────┬──────────┬──────────────┐
│ id │ name     │ phone_number │
├────┼──────────┼──────────────┤
│ 1  │ Alice    │ 555-0101     │
│ 1  │ Alice    │ 555-0102     │
│ 2  │ Bob      │ 555-0201     │
└────┴──────────┴──────────────┘

BETTER (separate table):
users:                    phones:
┌────┬──────────┐         ┌─────────┬──────────────┐
│ id │ name     │         │ user_id │ phone_number │
├────┼──────────┤         ├─────────┼──────────────┤
│ 1  │ Alice    │         │ 1       │ 555-0101     │
│ 2  │ Bob      │         │ 1       │ 555-0102     │
└────┴──────────┘         │ 2       │ 555-0201     │
                          └─────────┴──────────────┘

Second Normal Form (2NF)

Rule: Must be in 1NF, and every non-key column must depend on the entire primary key (no partial dependencies). This only applies to tables with composite primary keys.

VIOLATION of 2NF:
Table: order_items (composite PK: order_id + product_id)
┌──────────┬────────────┬──────────────┬───────────┬───────────┐
│ order_id │ product_id │ product_name │ quantity  │ unit_price│
├──────────┼────────────┼──────────────┼───────────┼───────────┤
│ 1        │ 101        │ Mouse        │ 2         │ 29.99     │
│ 1        │ 102        │ Keyboard     │ 1         │ 79.99     │
│ 2        │ 101        │ Mouse        │ 3         │ 29.99     │
└──────────┴────────────┴──────────────┴───────────┴───────────┘

Problem: product_name depends only on product_id, not on
(order_id + product_id). If "Mouse" is renamed, you must
update every row containing that product.

FIXED (2NF compliant):
order_items:                      products:
┌──────────┬────────────┬────────┐ ┌────────────┬──────────────┬───────┐
│ order_id │ product_id │quantity│ │ product_id │ product_name │ price │
├──────────┼────────────┼────────┤ ├────────────┼──────────────┼───────┤
│ 1        │ 101        │ 2     │ │ 101        │ Mouse        │ 29.99 │
│ 1        │ 102        │ 1     │ │ 102        │ Keyboard     │ 79.99 │
│ 2        │ 101        │ 3     │ └────────────┴──────────────┴───────┘
└──────────┴────────────┴────────┘

Third Normal Form (3NF)

Rule: Must be in 2NF, and no non-key column depends on another non-key column (no transitive dependencies).

VIOLATION of 3NF:
┌────┬──────────┬─────────────┬──────────────────┐
│ id │ name     │ department  │ department_head   │
├────┼──────────┼─────────────┼──────────────────┤
│ 1  │ Alice    │ Engineering │ Carol             │
│ 2  │ Bob      │ Engineering │ Carol             │
│ 3  │ Eve      │ Marketing   │ Dave              │
└────┴──────────┴─────────────┴──────────────────┘

Problem: department_head depends on department, not on the PK.
If Carol leaves, you update multiple rows.

FIXED (3NF compliant):
employees:                        departments:
┌────┬──────────┬────────────┐    ┌────────────┬──────────────────┐
│ id │ name     │ dept_id    │    │ dept_id    │ department_head   │
├────┼──────────┼────────────┤    ├────────────┼──────────────────┤
│ 1  │ Alice    │ 1          │    │ 1          │ Carol             │
│ 2  │ Bob      │ 1          │    │ 2          │ Dave              │
│ 3  │ Eve      │ 2          │    └────────────┴──────────────────┘
└────┴──────────┴────────────┘

Boyce-Codd Normal Form (BCNF)

Rule: Must be in 3NF, and every determinant must be a candidate key. BCNF handles edge cases that 3NF misses when there are multiple overlapping candidate keys.

VIOLATION of BCNF:
Table: student_courses
A student can have one advisor per subject.
An advisor teaches only one subject.

┌─────────┬─────────┬──────────┐
│ student │ subject │ advisor  │
├─────────┼─────────┼──────────┤
│ Alice   │ Math    │ Prof. X  │
│ Alice   │ Physics │ Prof. Y  │
│ Bob     │ Math    │ Prof. X  │
│ Bob     │ Physics │ Prof. Z  │
└─────────┴─────────┴──────────┘

Candidate key: (student, subject)
But advisor -> subject (advisor determines subject)
advisor is a determinant but NOT a candidate key = BCNF violation

FIXED (BCNF compliant):
advisor_subjects:              student_advisors:
┌──────────┬─────────┐        ┌─────────┬──────────┐
│ advisor  │ subject │        │ student │ advisor  │
├──────────┼─────────┤        ├─────────┼──────────┤
│ Prof. X  │ Math    │        │ Alice   │ Prof. X  │
│ Prof. Y  │ Physics │        │ Alice   │ Prof. Y  │
│ Prof. Z  │ Physics │        │ Bob     │ Prof. X  │
└──────────┴─────────┘        │ Bob     │ Prof. Z  │
                              └─────────┴──────────┘

Primary Keys, Foreign Keys, and Indexes

Primary Keys

Every table needs a primary key — a column (or combination of columns) that uniquely identifies each row.

-- Auto-incrementing integer (traditional)
CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    email VARCHAR(255) UNIQUE NOT NULL
);

-- UUID (better for distributed systems)
CREATE TABLE users (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    email VARCHAR(255) UNIQUE NOT NULL
);

-- Composite primary key
CREATE TABLE order_items (
    order_id INTEGER REFERENCES orders(id),
    product_id INTEGER REFERENCES products(id),
    quantity INTEGER NOT NULL,
    PRIMARY KEY (order_id, product_id)
);

Integer vs UUID trade-offs:

Aspect	Auto-increment Integer	UUID
Size	4-8 bytes	16 bytes
Index performance	Better (sequential)	Worse (random)
Distributed generation	Requires coordination	No coordination needed
Predictability	Predictable (security concern)	Unpredictable
URL readability	`/users/42`	`/users/550e8400-e29b...`

Foreign Keys

Foreign keys enforce referential integrity — you cannot reference a row that does not exist.

CREATE TABLE orders (
    id SERIAL PRIMARY KEY,
    user_id INTEGER NOT NULL,

    -- Foreign key with cascade behavior
    CONSTRAINT fk_user
        FOREIGN KEY (user_id)
        REFERENCES users(id)
        ON DELETE CASCADE      -- delete orders when user is deleted
        ON UPDATE CASCADE      -- update user_id if user.id changes
);

-- Alternative ON DELETE behaviors:
-- ON DELETE RESTRICT  — prevent deleting user if orders exist
-- ON DELETE SET NULL  — set user_id to NULL when user deleted
-- ON DELETE SET DEFAULT — set user_id to default value

Indexes

Indexes speed up reads at the cost of slower writes and additional storage. Think of them as a book's index — instead of scanning every page, you look up the topic and jump to the right page.

-- B-tree index (default, works for =, <, >, BETWEEN, ORDER BY)
CREATE INDEX idx_users_email ON users(email);

-- Composite index (column order matters)
CREATE INDEX idx_orders_user_status ON orders(user_id, status);
-- Supports: WHERE user_id = 1
-- Supports: WHERE user_id = 1 AND status = 'pending'
-- Does NOT support: WHERE status = 'pending' (without user_id)

-- Partial index (only indexes rows matching a condition)
CREATE INDEX idx_active_orders ON orders(created_at)
    WHERE status = 'active';

-- GIN index (for full-text search, arrays, JSONB)
CREATE INDEX idx_products_tags ON products USING GIN(tags);

-- Expression index
CREATE INDEX idx_users_lower_email ON users(LOWER(email));

How to decide what to index:

-- Step 1: Identify slow queries
EXPLAIN ANALYZE
SELECT * FROM orders
WHERE user_id = 789 AND status = 'pending'
ORDER BY created_at DESC;

-- If you see "Seq Scan" with high cost, add an index

-- Step 2: Check existing indexes
SELECT indexname, indexdef
FROM pg_indexes
WHERE tablename = 'orders';

-- Step 3: Create the index and verify
CREATE INDEX idx_orders_user_status_created
    ON orders(user_id, status, created_at DESC);

-- Step 4: Verify improvement
EXPLAIN ANALYZE
SELECT * FROM orders
WHERE user_id = 789 AND status = 'pending'
ORDER BY created_at DESC;
-- Should now show "Index Scan" with lower cost

Schema Design Principles

Design for Your Queries

The most common mistake is designing a schema based on the data model without considering how the data will be queried.

-- Bad: Normalized but requires 4 JOINs for the most common query
-- "Show user's recent orders with items and product details"
SELECT u.name, o.id, o.total, oi.quantity, p.name
FROM users u
JOIN orders o ON u.id = o.user_id
JOIN order_items oi ON o.id = oi.order_id
JOIN products p ON oi.product_id = p.id
WHERE u.id = 789
ORDER BY o.created_at DESC
LIMIT 10;

-- Consider: if this query runs 10,000 times per second,
-- the 4-table JOIN is expensive. Denormalization may be warranted.

Use Appropriate Data Types

-- Use the most specific type possible
CREATE TABLE events (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),

    -- Timestamps: always use timestamptz (timezone-aware)
    created_at TIMESTAMPTZ DEFAULT NOW(),

    -- Money: use integers (cents) or DECIMAL, never FLOAT
    amount_cents INTEGER NOT NULL,           -- store as cents
    -- OR
    amount DECIMAL(10, 2) NOT NULL,          -- fixed precision

    -- Status: use enums or check constraints
    status VARCHAR(20) NOT NULL CHECK (
        status IN ('pending', 'active', 'completed', 'cancelled')
    ),

    -- IP addresses: use inet type (PostgreSQL)
    ip_address INET,

    -- JSON data: use JSONB (binary, indexable) not JSON
    metadata JSONB DEFAULT '{}'
);

Design for Evolving Requirements

-- Flexible metadata without schema changes
CREATE TABLE products (
    id SERIAL PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    price DECIMAL(10, 2) NOT NULL,
    category VARCHAR(100) NOT NULL,

    -- JSONB column for variable attributes
    attributes JSONB DEFAULT '{}',
    -- Electronics: {"brand": "Sony", "warranty_months": 24}
    -- Clothing:    {"size": "M", "color": "blue", "material": "cotton"}
    -- Books:       {"isbn": "978-0-13-468599-1", "pages": 464}

    created_at TIMESTAMPTZ DEFAULT NOW()
);

-- You can still index JSONB fields
CREATE INDEX idx_products_brand ON products
    USING GIN ((attributes->'brand'));

-- And query them efficiently
SELECT * FROM products
WHERE attributes->>'brand' = 'Sony'
AND (attributes->>'warranty_months')::int > 12;

Denormalization: When and Why

Denormalization intentionally introduces redundancy to optimize read performance. It is not a sign of bad design — it is a trade-off you make consciously.

When to Denormalize

Read-heavy workloads. If your read-to-write ratio is 100:1, optimize for reads.
Expensive JOINs. When a frequently-run query requires joining 4+ tables.
Reporting/analytics. Precompute aggregates instead of calculating on every request.
Caching at the database level. Store computed values to avoid recalculation.

Common Denormalization Techniques

Precomputed aggregates:

-- Instead of counting every time:
-- SELECT COUNT(*) FROM order_items WHERE order_id = 123;

-- Store the count on the parent
ALTER TABLE orders ADD COLUMN item_count INTEGER DEFAULT 0;

-- Update via trigger
CREATE OR REPLACE FUNCTION update_order_item_count()
RETURNS TRIGGER AS $$
BEGIN
    IF TG_OP = 'INSERT' THEN
        UPDATE orders SET item_count = item_count + 1
        WHERE id = NEW.order_id;
    ELSIF TG_OP = 'DELETE' THEN
        UPDATE orders SET item_count = item_count - 1
        WHERE id = OLD.order_id;
    END IF;
    RETURN NULL;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER trg_update_item_count
AFTER INSERT OR DELETE ON order_items
FOR EACH ROW EXECUTE FUNCTION update_order_item_count();

Duplicated columns (copy for read performance):

-- Instead of JOINing to get product name on every order view:
ALTER TABLE order_items ADD COLUMN product_name VARCHAR(255);
ALTER TABLE order_items ADD COLUMN product_image_url TEXT;

-- Now the order details query needs no JOIN to products table
SELECT oi.product_name, oi.quantity, oi.unit_price
FROM order_items oi
WHERE oi.order_id = 123;

-- Trade-off: if product name changes, you need to update
-- all order_items referencing it (or accept historical name)

Materialized views (PostgreSQL):

-- Precomputed dashboard statistics
CREATE MATERIALIZED VIEW daily_sales AS
SELECT
    DATE(created_at) AS sale_date,
    COUNT(*) AS order_count,
    SUM(total) AS revenue,
    AVG(total) AS avg_order_value,
    COUNT(DISTINCT user_id) AS unique_customers
FROM orders
WHERE status = 'completed'
GROUP BY DATE(created_at)
ORDER BY sale_date DESC;

-- Refresh periodically (not on every write)
REFRESH MATERIALIZED VIEW CONCURRENTLY daily_sales;

-- Query the materialized view (instant results)
SELECT * FROM daily_sales
WHERE sale_date >= CURRENT_DATE - INTERVAL '30 days';

Denormalization Risks

Risk	Mitigation
Data inconsistency	Use database triggers or application-level event handlers
Write amplification	Only denormalize read-heavy data
Storage bloat	Monitor table sizes, archive old data
Maintenance burden	Document all denormalized fields and their source of truth

Putting It All Together: Schema Design Checklist

[ ] Identify all entities and their relationships
[ ] Choose primary key strategy (integer vs UUID)
[ ] Normalize to 3NF first
[ ] Define foreign keys with appropriate ON DELETE behavior
[ ] Add NOT NULL constraints where data is required
[ ] Add CHECK constraints for valid values
[ ] Use appropriate data types (timestamptz, decimal, jsonb)
[ ] Identify the top 10 most frequent queries
[ ] Add indexes for those queries
[ ] Consider denormalization for read-heavy hot paths
[ ] Add created_at and updated_at to all tables
[ ] Plan for soft deletes if needed (deleted_at column)
[ ] Test with realistic data volumes (not just 10 rows)
[ ] Run EXPLAIN ANALYZE on critical queries

Key Takeaways

Choose relational for complex relationships and strong consistency. Choose NoSQL when your access patterns are simple and you need horizontal scaling.
Normalize first, denormalize intentionally. Never skip normalization because you think it is too complex.
Indexes are the single most impactful performance tool. Use EXPLAIN ANALYZE to identify where they are needed.
Design for your queries, not just your data model. The schema should make your most common operations fast.
Use the most specific data types possible. Never store money as floats. Always use timezone-aware timestamps.
Denormalization is a valid trade-off, not a code smell. Document it, maintain it, and monitor it.