Data Warehousing: Concepts, Architecture, and Best Practices

A data warehouse is a centralized repository designed for analytical querying and reporting. It stores historical data from multiple sources, optimized for read-heavy analytical workloads rather than transactional processing.

Data Warehousing: Concepts, Architecture, and Best Practices

A data warehouse is a centralized repository designed specifically for analytical querying and reporting. Unlike operational databases (OLTP) optimized for fast transactions, data warehouses (OLAP) are optimized for complex queries that aggregate and analyze large volumes of historical data. They are the foundation of business intelligence, enabling organizations to make data-driven decisions by analyzing trends, patterns, and performance metrics.

Data warehouses consolidate data from multiple sources (transactional databases, CRM systems, log files, external APIs) into a single, consistent, historical record. They are designed for read-heavy workloads, supporting queries that scan millions or billions of rows. To understand data warehousing properly, it is helpful to be familiar with SQL basics, database normalization, and ETL pipelines.

Data warehousing overview:

Operational Database (OLTP)    vs    Data Warehouse (OLAP)
─────────────────────────────────────────────────────────────────
Many small transactions               Large, complex queries
Current data only                     Historical data
Row-based storage                     Column-based storage
Normalized (3NF)                      Denormalized (Star/Snowflake)
Optimized for writes                  Optimized for reads
Second response time                  Second to minute response time

What Is a Data Warehouse

A data warehouse is a system that aggregates data from multiple heterogeneous sources into a single, central, consistent data store designed for analytical querying and reporting. It separates analytical workloads from transactional workloads, preventing analytical queries from impacting operational performance.

Subject-Oriented: Organized around key business subjects (customers, products, sales).
Integrated: Data from multiple sources is cleaned, transformed, and standardized.
Time-Variant: Stores historical data to track changes over time.
Non-Volatile: Data is read-only after loading; no updates or deletes.
Optimized for Read: Designed for complex queries, not transactions.

Why Data Warehousing Matters

Data warehousing enables organizations to perform analytics that are impossible or impractical on operational databases. It provides a single source of truth for business intelligence.

Single Source of Truth: Consistent, integrated data from all sources.
Historical Analysis: Track trends over months or years, not just current state.
No Impact on Operations: Analytical queries run on the warehouse, not production databases.
Better Decision Making: Data-driven insights improve business outcomes.
Performance: Optimized for complex aggregations that would be slow on OLTP systems.
Data Quality: ETL processes clean and standardize data before loading.

OLTP vs OLAP

Feature	OLTP (Operational)	OLAP (Analytical)
Purpose	Support daily transactions	Support business intelligence
Queries	Simple, standardized, high frequency	Complex, ad-hoc, low frequency
Data	Current, detailed	Historical, summarized
Operations 十八章Read, insert, update, delete	Read-only (mostly)
Design	Normalized (3NF)	Denormalized (Star/Snowflake)
Response Time	Milliseconds	Seconds to minutes
Examples	Banking, e-commerce, CRM	Reporting, dashboards, analytics

Data Warehouse Architecture

1. Simple (Single-Tier) Architecture

The simplest architecture where source data is loaded directly into the warehouse. Suitable for small implementations.

Source Systems → ETL → Data Warehouse → Reporting

2. Two-Tier Architecture

Adds a staging area for data cleaning and transformation before loading into the warehouse.

Source Systems → Staging Area → ETL → Data Warehouse → Data Marts → Reporting

3. Three-Tier Architecture (Enterprise Data Warehouse)

The most common architecture for large enterprises. Includes staging, enterprise warehouse, and data marts for specific business units.

Typical three-tier architecture:

┌─────────────────────────────────────────────────────────────────┐
│                        Source Systems                             │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐           │
│  │   ERP    │ │   CRM    │ │   Logs   │ │  Files   │           │
│  └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘           │
│       └────────────┴────────────┴────────────┘                  │
│                              │                                   │
│                              ▼                                   │
│                      ┌───────────────┐                          │
│                      │    ETL/ELT    │                          │
│                      │   (Staging)   │                          │
│                      └───────┬───────┘                          │
│                              │                                   │
│                              ▼                                   │
│                   ┌─────────────────────┐                       │
│                   │  Enterprise Data    │                       │
│                   │     Warehouse       │                       │
│                   └──────────┬──────────┘                       │
│                              │                                   │
│            ┌─────────────────┼─────────────────┐                │
│            ▼                 ▼                 ▼                │
│     ┌────────────┐    ┌────────────┐    ┌────────────┐         │
│     │ Data Mart  │    │ Data Mart  │    │ Data Mart  │         │
│     │  (Sales)   │    │ (Marketing)│    │   (Finance)│         │
│     └─────┬──────┘    └─────┬──────┘    └─────┬──────┘         │
│           └─────────────────┼─────────────────┘                │
│                             ▼                                   │
│                      ┌───────────────┐                          │
│                      │   Reporting   │                          │
│                      │   & BI Tools  │                          │
│                      └───────────────┘                          │
└─────────────────────────────────────────────────────────────────┘

Data Warehouse Design Models

1. Star Schema

The star schema is the simplest and most common data warehouse design. It consists of a central fact table surrounded by dimension tables. This denormalized design is optimized for query performance.

Star schema example:

                    ┌─────────────────┐
                    │   Time_Dim       │
                    │  time_key (PK)   │
                    │  year, month, day│
                    └────────┬────────┘
                             │
┌──────────────┐    ┌────────┴────────┐    ┌──────────────┐
│ Product_Dim  │    │   Sales_Fact    │    │ Customer_Dim │
│ product_key  │────│ product_key (FK)│────│ customer_key │
│ name, category│    │ time_key (FK)  │    │ name, region │
│ price        │    │ customer_key(FK)│    │ segment      │
└──────────────┘    │ units, revenue │    └──────────────┘
                    └─────────────────┘
                             │
                    ┌────────┴────────┐
                    │   Store_Dim     │
                    │  store_key (PK) │
                    │  name, location │
                    └─────────────────┘

Fact table: Sales_Fact (measures: units sold, revenue)
Dimensions: Time, Product, Customer, Store

2. Snowflake Schema

A snowflake schema is a normalized version of the star schema where dimension tables are further normalized into sub-dimensions. It saves storage but requires more joins, reducing query performance.

Snowflake schema example:

Product_Dim                    Category_Dim
┌──────────────┐               ┌──────────────┐
│ product_key  │──────────────▶│ category_key │
│ name         │               │ name         │
│ category_key │               └──────────────┘
└──────────────┘

(Product dimension normalized into Product and Category tables)

3. Fact Constellation (Galaxy Schema)

Multiple fact tables sharing dimension tables. Used for complex data warehouses with multiple business processes.

┌─────────────┐     ┌─────────────┐
│ Sales_Fact  │     │ Returns_Fact│
└──────┬──────┘     └──────┬──────┘
       │                   │
       └─────────┬─────────┘
                 │
          ┌──────┴──────┐
          │ Time_Dim    │
          │ Product_Dim │
          │ Customer_Dim│
          └─────────────┘

Fact Tables and Dimension Tables

Feature	Fact Table	Dimension Table
Purpose	Stores measurable business events	Stores descriptive attributes
Content	Numerical measures (sales, quantity)	Textual descriptions (name, category)
Size	Large (millions to billions of rows)	Small (hundreds to thousands of rows)
Changes	Append-only (new events)	Slowly changing (updates over time)
Keys	Composite foreign keys	Surrogate primary key

ETL vs ELT

ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) are two approaches to moving data into a data warehouse.

Aspect	ETL	ELT
Process	Extract → Transform → Load	Extract → Load → Transform
Transformation Location	In staging area (before loading)	In data warehouse (after loading)
Performance	Slower for large volumes	Faster for large volumes
Processing Power	Requires separate ETL servers	Uses data warehouse compute power
Best For	Traditional warehouses, complex transformations	Cloud warehouses (Snowflake, BigQuery, Redshift)

ETL vs ELT comparison:

ETL (Traditional):
Source → Extract → Transform (staging) → Load → Warehouse

ELT (Modern Cloud):
Source → Extract → Load → Warehouse → Transform (SQL)

Slowly Changing Dimensions (SCD)

Dimension data changes over time. Slowly Changing Dimensions strategies handle how these changes are tracked in the data warehouse.

Type	Description	When to Use
Type 0	No changes allowed (fixed)	Historical data that must never change
Type 1	Overwrite old values	When history is not important
Type 2	Add new row with effective dates	When full history tracking is required
Type 3	Add column for previous value	When only previous value is needed
Type 4	Separate historical table	When history tracking affects performance

SCD Type 2 example:

Customer dimension with history:

┌────────────┬───────────┬────────────┬────────────┬──────────────┐
│ customer_key│ customer_id│   name     │  start_date │  end_date    │
├────────────┼───────────┼────────────┼────────────┼──────────────┤
│ 1          │ 100       │ John Doe   │ 2020-01-01 │ 2023-12-31   │
│ 2          │ 100       │ J. Doe     │ 2024-01-01 │ 9999-12-31   │
└────────────┴───────────┴────────────┴────────────┴──────────────┘

Current record has end_date = 9999-12-31 (or NULL)
Historical records show past values

Platform	Type	Key Features
Snowflake	Cloud-native	Separate compute and storage, auto-scaling, zero-copy cloning
Amazon Redshift	Cloud	Columnar storage, massive parallel processing, integration with AWS
Google BigQuery	Cloud	Serverless, petabyte-scale, built-in machine learning
Azure Synapse	Cloud	Unified analytics, integration with Power BI, data lake
PostgreSQL (Data Warehouse)	Open source	Columnar extensions (Citus, TimescaleDB), cost-effective

Common Data Warehousing Mistakes to Avoid

Not Understanding Business Requirements: Build what business needs, not what seems technically interesting.
Poor Data Quality: Garbage in, garbage out. Invest in data cleaning in ETL.
Overly Complex ETL: Keep ETL pipelines simple, maintainable, and documented.
No Data Lineage: Document where data came from and how it was transformed.
Ignoring Performance: Design for query performance from the start. Use star schemas, indexes, partitioning.
Not Involving Business Users: Business users must validate that the warehouse meets their needs.
No Data Governance: Define ownership, security, and quality standards.

Data Warehousing Best Practices

Start with Business Questions: Design based on what questions business users need to answer.
Use Star Schema for Performance: Denormalized star schemas outperform normalized snowflakes.
Implement Data Quality Checks: Validate data at extraction and loading stages.
Document Everything: Data lineage, ETL logic, and business definitions must be documented.
Involve Business Users: Validate requirements and results with business stakeholders.
Plan for Growth: Design for scalability. Data volume grows quickly.
Automate ETL: Manual processes are error-prone. Automate and monitor.

Frequently Asked Questions

What is the difference between a data warehouse and a database?
A database is designed for OLTP (transactions) and stores current data. A data warehouse is designed for OLAP (analytics) and stores historical data from multiple sources.
What is the difference between a data warehouse and a data lake?
A data warehouse stores structured, processed data ready for analysis. A data lake stores raw data in any format (structured, semi-structured, unstructured). Data lakes are often used as staging areas for data warehouses.
What is a data mart?
A data mart is a subset of a data warehouse focused on a specific business department (sales, marketing, finance). It contains summarized data relevant to that department.
What is the difference between ETL and ELT?
ETL transforms data before loading into the warehouse. ELT loads raw data first, then transforms inside the warehouse. ELT is preferred for cloud warehouses due to their scalability.
What is a slowly changing dimension?
A slowly changing dimension (SCD) is a dimension whose attributes change slowly over time. SCD strategies (Type 1, 2, 3, 4) define how these changes are tracked.
What should I learn next after data warehousing?
After mastering data warehousing, explore ETL pipelines, data lakes, business intelligence tools, and data governance for comprehensive data management.

Data Warehousing: Concepts, Architecture, and Best Practices