SCD Type 2: the history the business didn't know it needed

The sales director shows up at the Monday morning meeting with a simple question: “How many customers did we have in the North region last June?”

The DWH’s answer: silence.

Not because the system was down, or the table was missing. The data was there, technically. But it was wrong. The DWH returned the customers currently in the North region — not the ones that were there in June. Because every night, the loading process overwrote the customer master data with current values, erasing any trace of what came before.

A customer who was in the North region in June and moved to the Central region in September? As far as the DWH was concerned, that customer had always been in the Central region. History didn’t exist.

The project and the original model #

The context was a data warehouse in the insurance sector — claims management and customer portfolio. The source system held a master record for each customer: name, region, assigned agent, risk class, policy type.

The DWH dimension was modeled like this:

CREATE TABLE dim_customer (
    customer_id     NUMBER(10)    NOT NULL,
    name            VARCHAR2(100) NOT NULL,
    region          VARCHAR2(50)  NOT NULL,
    agent           VARCHAR2(100),
    risk_class      VARCHAR2(20),
    policy_type     VARCHAR2(50),
    CONSTRAINT pk_dim_customer PRIMARY KEY (customer_id)
);

The nightly ETL was a simple MERGE : if the customer exists, update all fields; if not, insert.

MERGE INTO dim_customer d
USING stg_customer s ON (d.customer_id = s.customer_id)
WHEN MATCHED THEN UPDATE SET
    d.name        = s.name,
    d.region      = s.region,
    d.agent       = s.agent,
    d.risk_class  = s.risk_class,
    d.policy_type = s.policy_type
WHEN NOT MATCHED THEN INSERT (
    customer_id, name, region, agent, risk_class, policy_type
) VALUES (
    s.customer_id, s.name, s.region, s.agent, s.risk_class, s.policy_type
);

Simple, clean, fast. And completely wrong for a data warehouse.

This is what Kimball calls SCD Type 1 — Slowly Changing Dimension Type 1. Overwrite the old value with the new one. No history, no versioning. The current value erases the previous one.

For an OLTP system it’s perfect: you always want the current address, the updated phone number, the valid email. But a data warehouse isn’t a transactional system. A data warehouse is a time machine. And a time machine that overwrites the past is useless.

What you lose with Type 1 #

The sales director wasn’t the only one asking questions the DWH couldn’t answer. Here’s a sample of requests that piled up over three months:

“How many customers moved from High risk class to Low in the last year?” — Impossible. The previous risk class no longer exists.
“Has agent Rossi lost customers compared to last quarter?” — Impossible. If a customer was reassigned to agent Bianchi, there’s no trace they ever belonged to Rossi.
“Did the South region’s revenue drop or did customers just relocate?” — Impossible to tell. If a 200K customer moved from South to Central, the South region’s revenue drops — but not because business is bad. The customer simply changed address.

Every time the answer was the same: “The system doesn’t keep history.” Which in business language means: “We don’t know.”

At some point the CFO requested a quarterly analysis report comparing customer portfolio composition between Q1 and Q2. The BI team tried to build it. It took three days. The result was unreliable because Q1 data no longer existed — it had been overwritten with Q2 data. The report was comparing Q2 with Q2 dressed up as Q1.

That was the moment that triggered the restructuring project.

SCD Type 2: the principle #

Type 2 doesn’t overwrite. It versions.

When an attribute changes, the current record is closed — it gets an end validity date — and a new record is inserted with the updated values and a new start validity date. The old record stays in the database, intact, with all the values it had when it was current.

To make this work you need three additional elements in the dimension table:

A surrogate key — an identifier generated by the DWH, distinct from the source system’s natural key. This is needed because the same customer will have multiple records (one per version), so the natural key is no longer unique.
Validity dates — valid_from and valid_to — defining the time interval during which each version of the record was current.
A current version flag — is_current — for fast retrieval of the active version without filtering on dates.

The new dimension table #

CREATE TABLE dim_customer (
    customer_key    NUMBER(10)    NOT NULL,
    customer_id     NUMBER(10)    NOT NULL,
    name            VARCHAR2(100) NOT NULL,
    region          VARCHAR2(50)  NOT NULL,
    agent           VARCHAR2(100),
    risk_class      VARCHAR2(20),
    policy_type     VARCHAR2(50),
    valid_from      DATE          NOT NULL,
    valid_to        DATE          NOT NULL,
    is_current      CHAR(1)       DEFAULT 'Y' NOT NULL,
    CONSTRAINT pk_dim_customer PRIMARY KEY (customer_key)
);

CREATE INDEX idx_dim_customer_natural ON dim_customer (customer_id, is_current);
CREATE INDEX idx_dim_customer_validity ON dim_customer (customer_id, valid_from, valid_to);

CREATE SEQUENCE seq_dim_customer START WITH 1 INCREMENT BY 1;

The customer_key is the surrogate key — generated by the sequence, never taken from the source system. The customer_id is the natural key — used to link the different versions of the same customer.

I set valid_to for current records to DATE '9999-12-31' — a standard convention that simplifies temporal queries. When you look for the record valid at a certain date, the filter WHERE reference_date BETWEEN valid_from AND valid_to works without special cases.

The ETL logic #

Type 2 ETL has two phases: first close the records that changed, then insert the new versions. The order matters — if you insert before closing, there’s a moment when two “current” versions of the same customer exist.

Phase 1: identify and close modified records #

MERGE INTO dim_customer d
USING (
    SELECT s.customer_id,
           s.name,
           s.region,
           s.agent,
           s.risk_class,
           s.policy_type
    FROM   stg_customer s
    JOIN   dim_customer d
           ON  s.customer_id = d.customer_id
           AND d.is_current = 'Y'
    WHERE  (s.region     != d.region
         OR s.agent      != d.agent
         OR s.risk_class != d.risk_class
         OR s.policy_type != d.policy_type
         OR s.name       != d.name)
) changed ON (d.customer_id = changed.customer_id AND d.is_current = 'Y')
WHEN MATCHED THEN UPDATE SET
    d.valid_to   = TRUNC(SYSDATE) - 1,
    d.is_current = 'N';

The WHERE clause compares every tracked attribute. If even one is different, the current record is closed: valid_to is set to yesterday and is_current becomes ‘N’.

A practical note: comparison with != doesn’t handle NULLs. If agent can be NULL, you need NULL-safe comparison functions. In Oracle I use DECODE:

WHERE DECODE(s.region, d.region, 0, 1) = 1
   OR DECODE(s.agent, d.agent, 0, 1) = 1
   OR DECODE(s.risk_class, d.risk_class, 0, 1) = 1
   -- ...

DECODE treats two NULLs as equal — exactly the behavior you need.

Phase 2: insert new versions #

INSERT INTO dim_customer (
    customer_key, customer_id, name, region, agent,
    risk_class, policy_type, valid_from, valid_to, is_current
)
SELECT seq_dim_customer.NEXTVAL,
       s.customer_id,
       s.name,
       s.region,
       s.agent,
       s.risk_class,
       s.policy_type,
       TRUNC(SYSDATE),
       DATE '9999-12-31',
       'Y'
FROM   stg_customer s
WHERE  NOT EXISTS (
    SELECT 1
    FROM   dim_customer d
    WHERE  d.customer_id = s.customer_id
    AND    d.is_current = 'Y'
);

This INSERT catches two cases: entirely new customers (who don’t exist in dim_customer) and customers whose current version was just closed in Phase 1 (who therefore no longer have a record with is_current = 'Y').

The valid_from is today’s date. The valid_to is “end of time” — 9999-12-31. The customer_key is generated by the sequence.

The data: before and after #

Let’s look at a concrete example. Customer 2001 — “Alfa Insurance Ltd” — is in the North region, assigned to agent Rossi, risk class Medium.

In July the customer is reassigned to agent Bianchi. In October the risk class changes from Medium to High.

With Type 1 (the previous model), in October dim_customer contains a single row:

CUSTOMER_ID  NAME                  REGION  AGENT    RISK_CLASS
-----------  --------------------  ------  -------  ----------
2001         Alfa Insurance Ltd    North   Bianchi  High

No trace of Rossi. No trace of Medium risk class. As far as the DWH knows, this customer has always belonged to agent Bianchi with High risk class.

With Type 2, in October dim_customer contains three rows:

KEY   CUSTOMER_ID  NAME                  REGION  AGENT    CLASS   VALID_FROM  VALID_TO    CURRENT
----  -----------  --------------------  ------  -------  ------  ----------  ----------  -------
1001  2001         Alfa Insurance Ltd    North   Rossi    Medium  2025-01-15  2025-07-09  N
1002  2001         Alfa Insurance Ltd    North   Bianchi  Medium  2025-07-10  2025-10-04  N
1003  2001         Alfa Insurance Ltd    North   Bianchi  High    2025-10-05  9999-12-31  Y

Three versions of the same customer. Each version tells a piece of the story: who the agent was, what the risk class was, and during which period. The dates don’t overlap. The is_current flag identifies the active version.

Temporal queries #

Now the sales director can get his answer.

How many customers in the North region in June? #

SELECT COUNT(DISTINCT customer_id) AS north_customers_june
FROM   dim_customer
WHERE  region = 'North'
AND    DATE '2025-06-15' BETWEEN valid_from AND valid_to;

The query is straightforward: get all records that were valid on June 15, 2025 and filter by region. No CASE WHEN, no conditional logic, no approximations.

Customers who changed risk class in the last year #

SELECT c1.customer_id,
       c1.name,
       c1.risk_class  AS previous_class,
       c2.risk_class  AS current_class,
       c1.valid_to + 1 AS change_date
FROM   dim_customer c1
JOIN   dim_customer c2
       ON  c1.customer_id = c2.customer_id
       AND c1.valid_to + 1 = c2.valid_from
WHERE  c1.risk_class != c2.risk_class
AND    c1.valid_to >= ADD_MONTHS(TRUNC(SYSDATE), -12)
ORDER BY change_date DESC;

Two consecutive versions of the same customer, joined by transition date. If the risk class differs between the two versions, the customer changed class. The change date is the day after the previous version was closed.

Q1 vs Q2 portfolio comparison #

SELECT region,
       COUNT(DISTINCT CASE
           WHEN DATE '2025-03-31' BETWEEN valid_from AND valid_to
           THEN customer_id END) AS customers_q1,
       COUNT(DISTINCT CASE
           WHEN DATE '2025-06-30' BETWEEN valid_from AND valid_to
           THEN customer_id END) AS customers_q2
FROM   dim_customer
WHERE  DATE '2025-03-31' BETWEEN valid_from AND valid_to
   OR  DATE '2025-06-30' BETWEEN valid_from AND valid_to
GROUP BY region
ORDER BY region;

A single table scan, two distinct counts filtered by date. The CFO gets his quarterly report — the real one, not the one comparing Q2 with itself.

The fact table and surrogate keys #

A point that’s often underestimated: the fact table must use the surrogate key, not the natural key.

CREATE TABLE fact_claim (
    claim_key       NUMBER(10)    NOT NULL,
    customer_key    NUMBER(10)    NOT NULL,  -- FK to the specific version
    date_key        NUMBER(8)     NOT NULL,
    amount          NUMBER(15,2),
    claim_type      VARCHAR2(50),
    CONSTRAINT pk_fact_claim PRIMARY KEY (claim_key),
    CONSTRAINT fk_fact_customer FOREIGN KEY (customer_key)
        REFERENCES dim_customer (customer_key)
);

The customer_key in the fact points to the version of the customer that was current at the time of the claim. If a claim occurs in May, when the customer still belonged to agent Rossi, the fact points to the version with agent Rossi. If another claim occurs in September, with the customer now under agent Bianchi, the fact points to the version with agent Bianchi.

The result is that every fact is associated with the correct dimensional context for the moment it occurred. Query May’s claims and you see agent Rossi. Query September’s claims and you see agent Bianchi. No temporal logic in the query — the direct JOIN between fact and dimension returns the right context.

-- Claims by agent, with the correct context at the time of the claim
SELECT d.agent,
       COUNT(*)      AS num_claims,
       SUM(f.amount) AS total_amount
FROM   fact_claim f
JOIN   dim_customer d ON f.customer_key = d.customer_key
GROUP BY d.agent
ORDER BY total_amount DESC;

No temporal clause. The surrogate key JOIN does all the work.

The dimensions of Type 2 #

The cost of Type 2 is dimension table growth. With Type 1, each customer is one row. With Type 2, each customer can have N rows — one for each tracked attribute change.

In the insurance project the numbers looked like this:

Metric	Value
Active customers	~120,000
Tracked attributes	4 (region, agent, risk class, policy type)
Average change rate	~8% of customers/year
dim_customer rows after 1 year	~140,000
dim_customer rows after 3 years	~180,000
dim_customer rows after 5 years	~220,000

From 120K to 220K in five years. An 83% increase — which sounds like a lot in percentage terms but is negligible in absolute terms. 220K rows are nothing for Oracle. The query with an index on the surrogate key stays in the millisecond range.

The issue arises when you have millions of customers with high change rates. In that case you monitor growth, consider partitioning the dimension, and most importantly choose carefully which attributes to track. Not every attribute deserves Type 2. The customer’s phone number? Type 1, overwrite. The sales region? Type 2, because it affects revenue analysis.

The choice of which attributes to track with Type 2 is a business decision, not a technical one. Ask the business: “If this field changes, do you need to know what the previous value was?” If the answer is yes, it’s Type 2. If it’s no, it’s Type 1.

When you don’t need Type 2 #

Not every dimension needs history. I’ve seen projects where every dimension was Type 2 “just in case” — the result was a needlessly complex model, slow ETL, and nobody ever querying the history of the “payment_type” or “sales_channel” dimension.

Type 2 has a cost: ETL complexity, table growth, the need to manage surrogate keys in the fact table. It’s a cost worth paying when the business needs history. If it doesn’t, Type 1 is the right choice.

There are also cases where Type 2 isn’t enough. If you need to know not just what changed but also who made the change and why, then you need an audit trail — a separate table with a complete change log. Type 2 tracks versions, not causes.

And for dimensions with very frequent changes — prices that change daily, scores that update hourly — Type 2 can generate unsustainable growth. In those cases you consider Type 6 (a combination of Types 1, 2 and 3) or mini-dimension approaches.

But for the most common case — customer master data, products, employees, locations — Type 2 is the right tool. Simple enough to implement without exotic frameworks, powerful enough to give the business back the dimension it was missing: time.

What I learned #

The sales director didn’t know he needed history until he needed it. And when he needed it, the DWH didn’t have it.

That’s the point. You don’t implement Type 2 because “it’s best practice” or because “Kimball says so in chapter 5.” You implement it because a data warehouse without history is an operational database with a star schema bolted on top. It works for current month reports, but it can’t answer the question that sooner or later someone will ask: “What was it like before?”

The question always comes. The only question is whether your DWH is ready to answer.

Glossary #

Surrogate key — A numeric identifier generated by the data warehouse, distinct from the source system’s natural key. In SCD Type 2 it’s essential because the same record can have multiple versions, making the natural key no longer unique.

Fact table — The central table in a star schema containing numeric measures (amounts, quantities, counts) and foreign keys to dimension tables. Each row represents a business event or transaction.

Kimball — Ralph Kimball, author of the data warehouse design methodology based on dimensional modeling, star schemas and bottom-up ETL processes. His framework classifies Slowly Changing Dimensions into types 0 through 7.

MERGE — A SQL statement that combines INSERT and UPDATE in a single operation: if the record exists it updates it, if it doesn’t it inserts it. In Oracle it’s also known as “upsert” and is the core ETL mechanism for SCD dimensions.

Star schema — A data model typical of data warehouses: a central fact table connected to multiple dimension tables via foreign keys. It simplifies analytical queries and optimizes aggregation performance.