What are Teradata Surrogate Keys?
There are several definitions of what a surrogate key is, but on the whole, they all say the same thing. A Teradata Surrogate Key maps a natural key to an artificial key which is usually of the data type INTEGER or BIGINT. While the natural key can consist of several columns, the surrogate key is one column. Typically, the surrogate key is generated automatically and is represented by an ascending number.
In Teradata, we can use the ROW_NUM() function to programmatically generate the surrogate key or an IDENTITY column for which Teradata automatically generates numbers. What possibilities we have to generate numbers using IDENTITY columns is described in this article:
The IDENTITY column has the advantage over the programmatic method in that the numbers are generated by the system and are consistent. With the programmatic method, it can happen that errors creep in which can be fatal, especially with surrogate keys.
But also with the IDENTITY column, it is important to define it correctly to prevent the creation of duplicates. For this, the two parameters GENERATED ALWAYS and NO CYCLE are necessary. GENERATED ALWAYS prevents that own values can be inserted into the IDENTITY column. NO CYCLE prevents the counter from starting again at the lowest value when the largest allowed number has been generated. Instead, Teradata generates an error when we want to insert more rows.
Detailed information about currently used IDENTITY columns is available in table DBC.IdCol
Disaster Reload of Key Tables with IDENTITY columns
If it is necessary to reload a key table, take care of the following to prevent duplicates from being created:
- The key table needs to be re-created with new START and MINVALUE definitions to exclude already existing numbers to avoid the same number being generated again.
- The IDENTITY column definition needs to be changed from GENERATED ALWAYS to GENERATED BY DEFAULT to be able to insert the keys that Teradata generated earlier.
- New surrogate keys will be generated only when NULL values are passed to the IDENTITY column.
Here is an example where the highest value of the IDENTITY column is 200,000:
CREATE MULTISET TABLE TheKeys ( NATURAL_KEY VARCHAR(500) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL, SURROGATE_KEY BIGINT GENERATED ALWAYS AS IDENTITY (START WITH 1 INCREMENT BY 1 MINVALUE -999999999999999 MAXVALUE 999999999999999 NO CYCLE), ) UNIQUE PRIMARY INDEX (NATURAL_KEY);
If we need to reload the key table, the following DDL is required:
CREATE MULTISET TABLE TheKeys ( NATURAL_KEY VARCHAR(500) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL, SURROGATE_KEY BIGINT GENERATED BY DEFAULT AS IDENTITY (START WITH 1 INCREMENT BY 1 MINVALUE 200000 MAXVALUE 999999999999999 NO CYCLE), ) UNIQUE PRIMARY INDEX (NATURAL_KEY);
Why should we use Surrogate Keys?
There are several reasons why replacing the natural keys with Teradata surrogate keys is suitable.
Different source systems deliver the same natural key for different information.
We may have more than one source system delivering the same value in the natural key for the same target table but with a different meaning. This situation requires using surrogate keys because keeping the natural key from both sources would not allow us to distinguish the objects.
Different source systems deliver a different natural key for the same information.
It can happen that the same information comes from different source systems with different natural keys. For example, the customer information of a bank may be kept in various source systems. Surrogate keys are ideally suited if it is necessary to represent these as a single customer in the data warehouse.
Unfortunately, many Teradata data warehouse implementations decide against surrogate keys. This only shifts the effort of clean integration to the back (data marts, reports). Such an approach will ultimately cause more costs than having the right surrogate key design from the beginning (don’t forget, such an approach requires repeating the logic of combining object information into every data mart or report).
Exchange of Source System leads to the Delivery of new Natural Keys.
If the source system’s natural keys are used, and we undergo a replacement of the original system, we end up in a nasty situation.
If we have surrogate keys available, we must adjust only the mapping table (Natural Keys -> Surrogate Keys). All surrogate keys have to be assigned once to the new natural keys.
If we only have the natural keys available in our data warehouse. The old natural keys must be replaced everywhere by the new natural keys!
We should never tightly couple your data model to the source system structures. An invoice table should be called Invoice and not something like SAP-Invoice. Data Warehousing is not the 1:1 storage of operational data.
At the beginning of a project, it is always more agile to avoid surrogate keys and integrate source system tables 1:1. However, this only shifts costs to the end of the development chain. Changes that have to be made afterward are many times more expensive!
Teradata Surrogates Keys and Performance
If properly used, surrogate keys may slightly increase performance as we can replace several natural key columns with one integer column, but only if there was a character set conversion in joins happening, otherwise, the hashing algorithm’s benefit will be minuscule. Still, there might be an enormous performance advantage when switching to surrogate keys from character column natural keys as described here: