The Importance of Teradata Surrogate Keys

Roland Wenzlofsky

April 28, 2023

minutes reading time


What are Teradata Surrogate Keys?

A Teradata Surrogate Key is an artificial key that maps to a natural key. It is usually of the data type INTEGER or BIGINT and is represented by a single column. The natural key can consist of multiple columns. The surrogate key is generated automatically and is represented by an ascending number.

We can use the ROW_NUM() function in Teradata to programmatically generate the surrogate key or an IDENTITY column for which Teradata automatically generates numbers. What possibilities we have to generate numbers using IDENTITY columns are described in this article:

The IDENTITY column surpasses the programmatic approach in producing consistent numbers. The programmatic approach is vulnerable to errors, which is particularly risky for surrogate keys.

Correctly defining the IDENTITY column is crucial in preventing duplicate creations. To achieve this, two parameters, GENERATED ALWAYS and NO CYCLE, play a vital role. The GENERATED ALWAYS parameter prohibits the insertion of custom values into the IDENTITY column. In contrast, the NO CYCLE parameter prevents the counter from restarting at the lowest value after generating the maximum allowed number. As a result, Teradata generates an error message when attempts are made to insert additional rows beyond the limit.

For comprehensive details on the present IDENTITY columns in use, refer to the DBC.IdCol table.

Disaster Reload of Key Tables with IDENTITY columns

To avoid creating duplicates, ensure the following steps are taken when reloading a key table:

  • The key table needs to be re-created with new START and MINVALUE definitions to exclude already existing numbers to avoid the same number being generated again.
  • The IDENTITY column definition needs to be changed from GENERATED ALWAYS to GENERATED BY DEFAULT to be able to insert the keys that Teradata generated earlier.
  • New surrogate keys will be generated only when NULL values are passed to the IDENTITY column.

The IDENTITY column has a maximum value of 200,000 in the following example:

CREATE MULTISET TABLE TheKeys 
(
   NATURAL_KEY VARCHAR(500) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL,
   SURROGATE_KEY BIGINT GENERATED ALWAYS AS IDENTITY
   (START WITH 1 INCREMENT BY 1 
    MINVALUE -999999999999999 
    MAXVALUE 999999999999999 
    NO CYCLE),
)
UNIQUE PRIMARY INDEX (NATURAL_KEY);

To reload the key table, execute the following DDL statement:

CREATE MULTISET TABLE TheKeys 
(
   NATURAL_KEY VARCHAR(500) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL,
   SURROGATE_KEY BIGINT GENERATED BY DEFAULT AS IDENTITY
   (START WITH 1 INCREMENT BY 1 
    MINVALUE 200000 
    MAXVALUE 999999999999999 
    NO CYCLE),
)
UNIQUE PRIMARY INDEX (NATURAL_KEY);

Why should we use Surrogate Keys?

There are multiple benefits to substituting Teradata surrogate keys for natural keys.

Different source systems deliver the same natural key for different information.

If multiple source systems provide identical values for a specific target table’s natural key but with distinct meanings, surrogate keys become necessary to differentiate between objects. Retaining the natural key from both sources would not suffice.

Different source systems deliver different natural keys for the same information.

Sometimes, identical data can originate from separate source systems that utilize distinct natural keys. A prime example of this is with banks, where customer data can be stored across multiple sources. Surrogate keys are particularly useful for consolidating this information into a singular customer entity within the data warehouse.

Many Teradata data warehouse implementations refrain from using surrogate keys, which leads to the burden of integrating data cleanly being shifted to the back-end (data marts, reports). This approach increases costs in the long run, as it requires duplicating the process of combining object information in every data mart or report. It is important to note that implementing the right surrogate key design from the beginning can prevent these issues.

The Exchange of Source System leads to the Delivery of new Natural Keys.

Using the source system’s natural keys while replacing the original system can result in a problematic situation.

The mapping table (Natural Keys -> Surrogate Keys) requires adjustment only if surrogate keys are present. The new natural keys must be assigned surrogate keys only once.

We must replace the old natural keys with the new ones throughout our data warehouse if we only possess the natural keys.

Our data model should not be closely linked to the structures of the source system. It is appropriate to name an invoice table “Invoice” rather than “SAP-Invoice”. Data Warehousing does not involve storing operational data in a one-to-one manner.

At project initiation, it is advisable to eschew surrogate keys and instead integrate source system tables one-to-one, as this fosters agility. Nonetheless, deferring this approach incurs greater costs toward the end of the development cycle, with subsequent changes becoming significantly more expensive.

Teradata Surrogates Keys and Performance

Surrogate keys can enhance performance by replacing multiple natural key columns with a single integer column. However, this benefit only applies if the character set conversions occur during joins, as the hashing algorithm’s benefit in other cases would be negligible. Nevertheless, switching from character column natural keys to surrogate keys could result in a significant performance advantage, as outlined here:

  • Avatar
    Aleksey Svitin says:

    Hi,

    It’s another excellent article. Thank you.

    I would also add
    – Surrogate ids are important elements for Data vault modeling
    – Surrogate ids are must have for tables with complex natural key or for combining different sources(like Original_Id + Source_Id) as Roland says.
    – For many cases, the performance is better with surrogate id because they could be smaller in size (e.g. one INT column vs a couple of varchar columns + source_id), thus joining/grouping is faster.
    – The most important performance benefit is related to”attribute” or dimension tables which is used by “fact” tables. It’s not obvious. Let’s assume we have “small” Country table. Every country has unique string code. We can use this code as a primary key.
    Let’s assume that we have a “big” fact table (translations, mobile calls or anything else) which should have reference to Country table.
    If we use Country_Code as PK when we will use it in dependent fact table as well. But string Country_Code will be always bigger in size than alternative numeric surrogate key. So this difference will be multiplied by number of rows in fact table ( which could be in billions). So it’s very important to have as small PK as possible for dimensions and surrogate keys help a lot here.
    – Personally, I like the approach, when a system should use own “small” internal entity identifiers (generated internally) everywhere inside and GUIDs or natural keys for External communication. It gives both flexibility, strong performance and solid data model.

    It’s just my experience.

    Have a good day!

    Best regards,
    Aleksey Svitin.

    • Hi Aleksey,

      Thanks for your comment.

      Right now I am “eyewitnessing” a total misdesign of a data mart where not using surrogate keys leads to significant problems.

      The data is both delivered from the source system and sent back to it and re-sent a second time. But the format is changed when the data is sent a second time. This is really about little things like spaces etc.

      Since the information from the source system has not been normalized (more than one piece of information is stored in a column), the result is that every join in every SQL must be done with string functions. From a performance point of view a disaster. The optimizer cannot use statistics and fights with full table scans. An elegant solution with surrogate keys at the beginning of the ETL chain could have prevented this.

      – Roland

  • Terdata and few td consultants have fooled organisations with this concept and complicated simple things. Simple surrogate key is to maintain slowly changing dimensions and not for integration of different sources as said in this post. If integration was purpose like example of customer lifecycle then its a separate topic of master data management, dont try to mix all master data , metadata(gcfr) and all things in one and complicate things. I have seen organisations sufferinf becoz of such design and majorly some pakistani consultants have been ruling these confusions and securing their jobs and sucking clients money.
    Honestly teradata is amazing product but design it right way and keep it simple

    • I give you an example, and would kindly ask you to tell me how you would design the model without surrogate keys: You have two billing systems, invoices have to be stored in the same entity (“Invoice”). The natural keys of both billing systems are overlapping. How would you design the “Invoice” table without surrogate keys?

      • Avatar
        Boris Mogilevsky says:

        Conceptually, surrogate keys have a say but using Identity columns for this purpose is a big problem. Their definitions are tied to Hash map which will change in case if the Box needs to be re-configured, for example when there is an upgrade. I personally was in a project where everything has been built using Surrogate Keys based on Identity and there were lots of issues.

        • Avatar
          Алексей Свитин says:

          Hi Boris,

          As a solution, I can suggest to generate Ids during ETL stage instead of relying on Identity column ( which has many side effects) . We use Informatica for that purpose. It works pretty well.

          With best regards,
          Aleksei Svitin

  • can you please let me know about bkey and bmap genearation? and waht they a re

  • {"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

    You might also like

    >