Why Database Design Matters
A poorly designed database is technical debt that compounds daily. Queries get slower. Data anomalies creep in. Migrations become terrifying. Developers work around the schema instead of with it.
Good database design is not about following rules blindly. It is about understanding the trade-offs between data integrity, query performance, and development velocity, then making intentional choices for your specific workload.
This article covers relational and NoSQL databases, normalization theory with practical examples, indexing fundamentals, and when breaking the rules (denormalization) is the right call.
Relational vs NoSQL Databases
The first decision is choosing your database paradigm. This is not relational vs NoSQL β it is about matching your data model to your access patterns.
Relational Databases (SQL)
Relational databases store data in tables with rows and columns. Relationships between tables are defined through foreign keys. Data integrity is enforced at the database level.
-- Relational schema for an e-commerce platform
CREATE TABLE users (
id SERIAL PRIMARY KEY,
email VARCHAR(255) UNIQUE NOT NULL,
name VARCHAR(100) NOT NULL,
created_at TIMESTAMP DEFAULT NOW()
);
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
user_id INTEGER NOT NULL REFERENCES users(id),
status VARCHAR(20) DEFAULT 'pending',
total DECIMAL(10, 2) NOT NULL,
created_at TIMESTAMP DEFAULT NOW()
);
CREATE TABLE order_items (
id SERIAL PRIMARY KEY,
order_id INTEGER NOT NULL REFERENCES orders(id),
product_id INTEGER NOT NULL REFERENCES products(id),
quantity INTEGER NOT NULL CHECK (quantity > 0),
unit_price DECIMAL(10, 2) NOT NULL
);
Strengths: ACID transactions, enforced relationships, complex queries with JOINs, mature tooling, strong consistency.
Best for: Financial data, inventory systems, anything requiring strict consistency and complex queries across related entities.
NoSQL Databases
NoSQL databases come in several flavors, each optimized for different access patterns.
Document stores (MongoDB, CouchDB):
// Document model β same e-commerce data
{
"_id": "order_12345",
"userId": "user_789",
"status": "pending",
"total": 149.99,
"items": [
{
"productId": "prod_001",
"name": "Wireless Mouse",
"quantity": 2,
"unitPrice": 29.99
},
{
"productId": "prod_002",
"name": "USB-C Hub",
"quantity": 1,
"unitPrice": 89.99
}
],
"shippingAddress": {
"street": "123 Main St",
"city": "Portland",
"state": "OR",
"zip": "97201"
},
"createdAt": "2026-04-25T10:30:00Z"
}
Key-value stores (Redis, DynamoDB):
Key: "session:abc123"
Value: {"userId": "789", "cart": [...], "expires": 1714060800}
Key: "user:789:profile"
Value: {"name": "Alice", "email": "[email protected]", "tier": "premium"}
Wide-column stores (Cassandra, HBase):
Row Key: "user_789"
Column Family: "orders"
βββ "order_001": {"total": 149.99, "status": "completed"}
βββ "order_002": {"total": 79.50, "status": "pending"}
βββ "order_003": {"total": 220.00, "status": "shipped"}
Column Family: "profile"
βββ "name": "Alice"
βββ "email": "[email protected]"
βββ "tier": "premium"
Graph databases (Neo4j, Amazon Neptune):
// Graph model β social connections
CREATE (alice:User {name: "Alice"})
CREATE (bob:User {name: "Bob"})
CREATE (post:Post {content: "Hello world"})
CREATE (alice)-[:FOLLOWS]->(bob)
CREATE (alice)-[:AUTHORED]->(post)
CREATE (bob)-[:LIKED]->(post)
// Query: Find friends-of-friends who liked Alice's posts
MATCH (alice:User {name: "Alice"})-[:FOLLOWS]->()-[:FOLLOWS]->(fof)
WHERE fof <> alice
MATCH (fof)-[:LIKED]->(post)<-[:AUTHORED]-(alice)
RETURN fof.name, post.content
Comparison Table
| Aspect | Relational (PostgreSQL) | Document (MongoDB) | Key-Value (Redis) | Wide-Column (Cassandra) |
|---|---|---|---|---|
| Schema | Strict, predefined | Flexible, per-document | Schema-less | Column families |
| Consistency | Strong (ACID) | Configurable | Eventual (single-node strong) | Tunable |
| Scaling | Vertical primarily | Horizontal (sharding) | Horizontal | Horizontal |
| Joins | Native, efficient | Manual (application-level) | None | Limited |
| Best for | Complex relations, transactions | Varied document structures | Caching, sessions | Time-series, high write throughput |
| Query language | SQL | MQL / aggregation pipeline | GET/SET commands | CQL |
| Transactions | Full ACID | Multi-document (since 4.0) | Single-key atomic | Lightweight transactions |
Normalization
Normalization is the process of organizing data to reduce redundancy and prevent data anomalies. Each normal form builds on the previous one.
First Normal Form (1NF)
Rule: Every column must contain atomic (indivisible) values. No repeating groups or arrays.
VIOLATION of 1NF:
ββββββ¬βββββββββββ¬ββββββββββββββββββββββββββββ
β id β name β phone_numbers β
ββββββΌβββββββββββΌββββββββββββββββββββββββββββ€
β 1 β Alice β 555-0101, 555-0102 β
β 2 β Bob β 555-0201 β
ββββββ΄βββββββββββ΄ββββββββββββββββββββββββββββ
FIXED (1NF compliant):
ββββββ¬βββββββββββ¬βββββββββββββββ
β id β name β phone_number β
ββββββΌβββββββββββΌβββββββββββββββ€
β 1 β Alice β 555-0101 β
β 1 β Alice β 555-0102 β
β 2 β Bob β 555-0201 β
ββββββ΄βββββββββββ΄βββββββββββββββ
BETTER (separate table):
users: phones:
ββββββ¬βββββββββββ βββββββββββ¬βββββββββββββββ
β id β name β β user_id β phone_number β
ββββββΌβββββββββββ€ βββββββββββΌβββββββββββββββ€
β 1 β Alice β β 1 β 555-0101 β
β 2 β Bob β β 1 β 555-0102 β
ββββββ΄βββββββββββ β 2 β 555-0201 β
βββββββββββ΄βββββββββββββββ
Second Normal Form (2NF)
Rule: Must be in 1NF, and every non-key column must depend on the entire primary key (no partial dependencies). This only applies to tables with composite primary keys.
VIOLATION of 2NF:
Table: order_items (composite PK: order_id + product_id)
ββββββββββββ¬βββββββββββββ¬βββββββββββββββ¬ββββββββββββ¬ββββββββββββ
β order_id β product_id β product_name β quantity β unit_priceβ
ββββββββββββΌβββββββββββββΌβββββββββββββββΌββββββββββββΌββββββββββββ€
β 1 β 101 β Mouse β 2 β 29.99 β
β 1 β 102 β Keyboard β 1 β 79.99 β
β 2 β 101 β Mouse β 3 β 29.99 β
ββββββββββββ΄βββββββββββββ΄βββββββββββββββ΄ββββββββββββ΄ββββββββββββ
Problem: product_name depends only on product_id, not on
(order_id + product_id). If "Mouse" is renamed, you must
update every row containing that product.
FIXED (2NF compliant):
order_items: products:
ββββββββββββ¬βββββββββββββ¬βββββββββ ββββββββββββββ¬βββββββββββββββ¬ββββββββ
β order_id β product_id βquantityβ β product_id β product_name β price β
ββββββββββββΌβββββββββββββΌβββββββββ€ ββββββββββββββΌβββββββββββββββΌββββββββ€
β 1 β 101 β 2 β β 101 β Mouse β 29.99 β
β 1 β 102 β 1 β β 102 β Keyboard β 79.99 β
β 2 β 101 β 3 β ββββββββββββββ΄βββββββββββββββ΄ββββββββ
ββββββββββββ΄βββββββββββββ΄βββββββββ
Third Normal Form (3NF)
Rule: Must be in 2NF, and no non-key column depends on another non-key column (no transitive dependencies).
VIOLATION of 3NF:
ββββββ¬βββββββββββ¬ββββββββββββββ¬βββββββββββββββββββ
β id β name β department β department_head β
ββββββΌβββββββββββΌββββββββββββββΌβββββββββββββββββββ€
β 1 β Alice β Engineering β Carol β
β 2 β Bob β Engineering β Carol β
β 3 β Eve β Marketing β Dave β
ββββββ΄βββββββββββ΄ββββββββββββββ΄βββββββββββββββββββ
Problem: department_head depends on department, not on the PK.
If Carol leaves, you update multiple rows.
FIXED (3NF compliant):
employees: departments:
ββββββ¬βββββββββββ¬βββββββββββββ ββββββββββββββ¬βββββββββββββββββββ
β id β name β dept_id β β dept_id β department_head β
ββββββΌβββββββββββΌβββββββββββββ€ ββββββββββββββΌβββββββββββββββββββ€
β 1 β Alice β 1 β β 1 β Carol β
β 2 β Bob β 1 β β 2 β Dave β
β 3 β Eve β 2 β ββββββββββββββ΄βββββββββββββββββββ
ββββββ΄βββββββββββ΄βββββββββββββ
Boyce-Codd Normal Form (BCNF)
Rule: Must be in 3NF, and every determinant must be a candidate key. BCNF handles edge cases that 3NF misses when there are multiple overlapping candidate keys.
VIOLATION of BCNF:
Table: student_courses
A student can have one advisor per subject.
An advisor teaches only one subject.
βββββββββββ¬ββββββββββ¬βββββββββββ
β student β subject β advisor β
βββββββββββΌββββββββββΌβββββββββββ€
β Alice β Math β Prof. X β
β Alice β Physics β Prof. Y β
β Bob β Math β Prof. X β
β Bob β Physics β Prof. Z β
βββββββββββ΄ββββββββββ΄βββββββββββ
Candidate key: (student, subject)
But advisor -> subject (advisor determines subject)
advisor is a determinant but NOT a candidate key = BCNF violation
FIXED (BCNF compliant):
advisor_subjects: student_advisors:
ββββββββββββ¬ββββββββββ βββββββββββ¬βββββββββββ
β advisor β subject β β student β advisor β
ββββββββββββΌββββββββββ€ βββββββββββΌβββββββββββ€
β Prof. X β Math β β Alice β Prof. X β
β Prof. Y β Physics β β Alice β Prof. Y β
β Prof. Z β Physics β β Bob β Prof. X β
ββββββββββββ΄ββββββββββ β Bob β Prof. Z β
βββββββββββ΄βββββββββββ
Primary Keys, Foreign Keys, and Indexes
Primary Keys
Every table needs a primary key β a column (or combination of columns) that uniquely identifies each row.
-- Auto-incrementing integer (traditional)
CREATE TABLE users (
id SERIAL PRIMARY KEY,
email VARCHAR(255) UNIQUE NOT NULL
);
-- UUID (better for distributed systems)
CREATE TABLE users (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
email VARCHAR(255) UNIQUE NOT NULL
);
-- Composite primary key
CREATE TABLE order_items (
order_id INTEGER REFERENCES orders(id),
product_id INTEGER REFERENCES products(id),
quantity INTEGER NOT NULL,
PRIMARY KEY (order_id, product_id)
);
Integer vs UUID trade-offs:
| Aspect | Auto-increment Integer | UUID |
|---|---|---|
| Size | 4-8 bytes | 16 bytes |
| Index performance | Better (sequential) | Worse (random) |
| Distributed generation | Requires coordination | No coordination needed |
| Predictability | Predictable (security concern) | Unpredictable |
| URL readability | /users/42 | /users/550e8400-e29b... |
Foreign Keys
Foreign keys enforce referential integrity β you cannot reference a row that does not exist.
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
user_id INTEGER NOT NULL,
-- Foreign key with cascade behavior
CONSTRAINT fk_user
FOREIGN KEY (user_id)
REFERENCES users(id)
ON DELETE CASCADE -- delete orders when user is deleted
ON UPDATE CASCADE -- update user_id if user.id changes
);
-- Alternative ON DELETE behaviors:
-- ON DELETE RESTRICT β prevent deleting user if orders exist
-- ON DELETE SET NULL β set user_id to NULL when user deleted
-- ON DELETE SET DEFAULT β set user_id to default value
Indexes
Indexes speed up reads at the cost of slower writes and additional storage. Think of them as a book's index β instead of scanning every page, you look up the topic and jump to the right page.
-- B-tree index (default, works for =, <, >, BETWEEN, ORDER BY)
CREATE INDEX idx_users_email ON users(email);
-- Composite index (column order matters)
CREATE INDEX idx_orders_user_status ON orders(user_id, status);
-- Supports: WHERE user_id = 1
-- Supports: WHERE user_id = 1 AND status = 'pending'
-- Does NOT support: WHERE status = 'pending' (without user_id)
-- Partial index (only indexes rows matching a condition)
CREATE INDEX idx_active_orders ON orders(created_at)
WHERE status = 'active';
-- GIN index (for full-text search, arrays, JSONB)
CREATE INDEX idx_products_tags ON products USING GIN(tags);
-- Expression index
CREATE INDEX idx_users_lower_email ON users(LOWER(email));
How to decide what to index:
-- Step 1: Identify slow queries
EXPLAIN ANALYZE
SELECT * FROM orders
WHERE user_id = 789 AND status = 'pending'
ORDER BY created_at DESC;
-- If you see "Seq Scan" with high cost, add an index
-- Step 2: Check existing indexes
SELECT indexname, indexdef
FROM pg_indexes
WHERE tablename = 'orders';
-- Step 3: Create the index and verify
CREATE INDEX idx_orders_user_status_created
ON orders(user_id, status, created_at DESC);
-- Step 4: Verify improvement
EXPLAIN ANALYZE
SELECT * FROM orders
WHERE user_id = 789 AND status = 'pending'
ORDER BY created_at DESC;
-- Should now show "Index Scan" with lower cost
Schema Design Principles
Design for Your Queries
The most common mistake is designing a schema based on the data model without considering how the data will be queried.
-- Bad: Normalized but requires 4 JOINs for the most common query
-- "Show user's recent orders with items and product details"
SELECT u.name, o.id, o.total, oi.quantity, p.name
FROM users u
JOIN orders o ON u.id = o.user_id
JOIN order_items oi ON o.id = oi.order_id
JOIN products p ON oi.product_id = p.id
WHERE u.id = 789
ORDER BY o.created_at DESC
LIMIT 10;
-- Consider: if this query runs 10,000 times per second,
-- the 4-table JOIN is expensive. Denormalization may be warranted.
Use Appropriate Data Types
-- Use the most specific type possible
CREATE TABLE events (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
-- Timestamps: always use timestamptz (timezone-aware)
created_at TIMESTAMPTZ DEFAULT NOW(),
-- Money: use integers (cents) or DECIMAL, never FLOAT
amount_cents INTEGER NOT NULL, -- store as cents
-- OR
amount DECIMAL(10, 2) NOT NULL, -- fixed precision
-- Status: use enums or check constraints
status VARCHAR(20) NOT NULL CHECK (
status IN ('pending', 'active', 'completed', 'cancelled')
),
-- IP addresses: use inet type (PostgreSQL)
ip_address INET,
-- JSON data: use JSONB (binary, indexable) not JSON
metadata JSONB DEFAULT '{}'
);
Design for Evolving Requirements
-- Flexible metadata without schema changes
CREATE TABLE products (
id SERIAL PRIMARY KEY,
name VARCHAR(255) NOT NULL,
price DECIMAL(10, 2) NOT NULL,
category VARCHAR(100) NOT NULL,
-- JSONB column for variable attributes
attributes JSONB DEFAULT '{}',
-- Electronics: {"brand": "Sony", "warranty_months": 24}
-- Clothing: {"size": "M", "color": "blue", "material": "cotton"}
-- Books: {"isbn": "978-0-13-468599-1", "pages": 464}
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- You can still index JSONB fields
CREATE INDEX idx_products_brand ON products
USING GIN ((attributes->'brand'));
-- And query them efficiently
SELECT * FROM products
WHERE attributes->>'brand' = 'Sony'
AND (attributes->>'warranty_months')::int > 12;
Denormalization: When and Why
Denormalization intentionally introduces redundancy to optimize read performance. It is not a sign of bad design β it is a trade-off you make consciously.
When to Denormalize
- Read-heavy workloads. If your read-to-write ratio is 100:1, optimize for reads.
- Expensive JOINs. When a frequently-run query requires joining 4+ tables.
- Reporting/analytics. Precompute aggregates instead of calculating on every request.
- Caching at the database level. Store computed values to avoid recalculation.
Common Denormalization Techniques
Precomputed aggregates:
-- Instead of counting every time:
-- SELECT COUNT(*) FROM order_items WHERE order_id = 123;
-- Store the count on the parent
ALTER TABLE orders ADD COLUMN item_count INTEGER DEFAULT 0;
-- Update via trigger
CREATE OR REPLACE FUNCTION update_order_item_count()
RETURNS TRIGGER AS $$
BEGIN
IF TG_OP = 'INSERT' THEN
UPDATE orders SET item_count = item_count + 1
WHERE id = NEW.order_id;
ELSIF TG_OP = 'DELETE' THEN
UPDATE orders SET item_count = item_count - 1
WHERE id = OLD.order_id;
END IF;
RETURN NULL;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER trg_update_item_count
AFTER INSERT OR DELETE ON order_items
FOR EACH ROW EXECUTE FUNCTION update_order_item_count();
Duplicated columns (copy for read performance):
-- Instead of JOINing to get product name on every order view:
ALTER TABLE order_items ADD COLUMN product_name VARCHAR(255);
ALTER TABLE order_items ADD COLUMN product_image_url TEXT;
-- Now the order details query needs no JOIN to products table
SELECT oi.product_name, oi.quantity, oi.unit_price
FROM order_items oi
WHERE oi.order_id = 123;
-- Trade-off: if product name changes, you need to update
-- all order_items referencing it (or accept historical name)
Materialized views (PostgreSQL):
-- Precomputed dashboard statistics
CREATE MATERIALIZED VIEW daily_sales AS
SELECT
DATE(created_at) AS sale_date,
COUNT(*) AS order_count,
SUM(total) AS revenue,
AVG(total) AS avg_order_value,
COUNT(DISTINCT user_id) AS unique_customers
FROM orders
WHERE status = 'completed'
GROUP BY DATE(created_at)
ORDER BY sale_date DESC;
-- Refresh periodically (not on every write)
REFRESH MATERIALIZED VIEW CONCURRENTLY daily_sales;
-- Query the materialized view (instant results)
SELECT * FROM daily_sales
WHERE sale_date >= CURRENT_DATE - INTERVAL '30 days';
Denormalization Risks
| Risk | Mitigation |
|---|---|
| Data inconsistency | Use database triggers or application-level event handlers |
| Write amplification | Only denormalize read-heavy data |
| Storage bloat | Monitor table sizes, archive old data |
| Maintenance burden | Document all denormalized fields and their source of truth |
Putting It All Together: Schema Design Checklist
[ ] Identify all entities and their relationships
[ ] Choose primary key strategy (integer vs UUID)
[ ] Normalize to 3NF first
[ ] Define foreign keys with appropriate ON DELETE behavior
[ ] Add NOT NULL constraints where data is required
[ ] Add CHECK constraints for valid values
[ ] Use appropriate data types (timestamptz, decimal, jsonb)
[ ] Identify the top 10 most frequent queries
[ ] Add indexes for those queries
[ ] Consider denormalization for read-heavy hot paths
[ ] Add created_at and updated_at to all tables
[ ] Plan for soft deletes if needed (deleted_at column)
[ ] Test with realistic data volumes (not just 10 rows)
[ ] Run EXPLAIN ANALYZE on critical queries
Key Takeaways
- Choose relational for complex relationships and strong consistency. Choose NoSQL when your access patterns are simple and you need horizontal scaling.
- Normalize first, denormalize intentionally. Never skip normalization because you think it is too complex.
- Indexes are the single most impactful performance tool. Use EXPLAIN ANALYZE to identify where they are needed.
- Design for your queries, not just your data model. The schema should make your most common operations fast.
- Use the most specific data types possible. Never store money as floats. Always use timezone-aware timestamps.
- Denormalization is a valid trade-off, not a code smell. Document it, maintain it, and monitor it.