What are Teradata Surrogate Keys?
A surrogate key in Teradata maps the natural keys of source systems to a unique key, usually an integer value.
Usually, one or more natural key columns are mapped to a surrogate key worthy of an INTEGER. Often a consecutive number is generated.
We can use the ROW_NUMBER() function for a surrogate key in Teradata or an IDENTITY column. Click below for the article about identity columns:
An IDENTITY column’s advantage is that the system generates a new surrogate key when a row is inserted into the key table.
However, it is essential to define the IDENTITY column to create no duplicates.
We must define the IDENTITY column with GENERATED ALWAYS and NO CYCLE.
GENERATED always prevents the surrogate key column from being updated. NO CYCLE precludes the reuse of values. If the highest possible number for an integer data type is reached, Teradata can generate no new surrogate keys, and an error is reported.
Detailed information about IDENTITY columns is available in table DBC.IdCol
Disaster Reload of Key Tables with IDENTITY columns
If a key table that uses IDENTITY columns has to be reloaded, please pay attention to the following to avoid duplicates:
- The key table needs to be re-created with new START and MINVALUE numbers to exclude the numbers generated before to avoid the same number being generated again.
- The identity column definition needs to be changed from “GENERATED ALWAYS” to “GENERATED BY DEFAULT” to be able to reload the keys that Teradata generated before.
- New surrogate keys will be generated only when NULL values are passed to the identity column.
Here is an example where the highest value of the IDENTITY column is 200,000:
CREATE MULTISET TABLE TheKeys ( NATURAL_KEY VARCHAR(500) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL, SURROGATE_KEY BIGINT GENERATED ALWAYS AS IDENTITY (START WITH 1 INCREMENT BY 1 MINVALUE -999999999999999 MAXVALUE 999999999999999 NO CYCLE), ) UNIQUE PRIMARY INDEX (NATURAL_KEY);
If we need to reload the key table, the following DDL is required:
CREATE MULTISET TABLE TheKeys ( NATURAL_KEY VARCHAR(500) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL, SURROGATE_KEY BIGINT GENERATED BY DEFAULT AS IDENTITY (START WITH 1 INCREMENT BY 1 MINVALUE 200000 MAXVALUE 999999999999999 NO CYCLE), ) UNIQUE PRIMARY INDEX (NATURAL_KEY);
Why should you use Surrogate Keys?
There are several reasons why replacing the natural keys with Teradata surrogate keys is good.
Different source systems deliver the same natural key for different information.
You may have more than one source system delivering the same value in the natural key for the same target table but with a different meaning.
This requires using surrogate keys because keeping the natural key from both sources would not allow you to distinguish the objects.
Different source systems deliver a different natural key for the same information.
But it can also be that the same information comes from different source systems with other natural keys. For example, the customers of a bank may be kept in different source systems. Surrogate keys are ideally suited if it is necessary to represent these as a single customer in the data warehouse.
Unfortunately, many Teradata data warehouse implementations decide against surrogate keys and put effort into integrating the information into the reports.
This approach will ultimately cause more costs than having the right surrogate key design from the beginning (don’t forget, such an approach requires repeating the logic of combining object information into every report).
Exchange of Source System leads to the Delivery of new Natural Keys.
If the source system’s natural keys are used, and you undergo a replacement of the original system, you end up in a bad situation.
In this case, let’s compare the effort caused by it.
If we have surrogate keys available, we must adjust only the mapping table (Natural Keys -> Surrogate Keys). All surrogate keys have to be assigned once to the new natural keys.
If we only have the natural keys available in our data warehouse. The old natural keys must be replaced everywhere by the new natural keys!
You should never tightly couple your data model to the source system structures. An invoice table should be called Invoice and not something like SAP-Invoice. Data Warehousing is not the 1:1 storage of operational data.
At the beginning of a project, it is always more agile to avoid surrogate keys and integrate source system tables 1:1.
However, this only shifts costs to the end of the chain. Changes that have to be made afterward are many times more expensive!
Teradata Surrogates Keys and Performance
If properly used, surrogate keys may increase your performance as you can replace several natural key columns with an integer column.
Often, natural key columns are character columns. From a performance point of view, this makes no difference for the hashing of the primary index. However, there is a negative effect on performance with joins since Teradata must take character sets, etc., into account here.
The number of columns and data types of the natural key can negatively impact performance or have a noticeable effect.