fbpx

Teradata Surrogate Keys Guidelines

By Roland Wenzlofsky

November 19, 2019


What are Teradata Surrogate Keys?

A surrogate key in Teradata is used to map the natural keys of source systems to a unique key, usually an integer value.

Usually, one or more natural key columns are mapped to a surrogate key worthy of an INTEGER. Often a consecutive number is generated.

For a surrogate key in Teradata, the ROW_NUMBER() function can be used, or an IDENTITY column. Click here for the article about identity columns:

An IDENTITY column’s advantage is that a new surrogate key is generated by the system as soon as a row is inserted into the key table.

However, it is important to define the IDENTITY column so that no duplicates are created.

For this, the IDENTITY column must be defined with GENERATED ALWAYS and NO CYCLE.

GENERATED always prevents the surrogate key column from being updated. NO CYCLE prevents the reuse of values. If the highest possible number for an integer data type is reached, no new surrogate keys can be generated, and an error is reported.

Detailed information about IDENTITY columns is available in table DBC.IdCol

Disaster Reload of Key Tables with IDENTITY columns

If a key table that uses IDENTITY columns has to be reloaded, please pay attention to the following to avoid duplicates:

  • The key table needs to be re-created with a new START and MINVALUE numbers to exclude the numbers generated before to avoid the same number to be generated again.
  • The identity column definition needs to be changed from “GENERATED ALWAYS” to “GENERATED BY DEFAULT” to be able to reload the keys that were generated before.
  • New surrogate keys will be generated only when NULL values are passed to the identity column.

Here is an example where the highest value of the IDENTITY column is 200,000:

CREATE MULTISET TABLE TheKeys 
(
   NATURAL_KEY VARCHAR(500) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL,
   SURROGATE_KEY BIGINT GENERATED ALWAYS AS IDENTITY
   (START WITH 1 INCREMENT BY 1 
    MINVALUE -999999999999999 
    MAXVALUE 999999999999999 
    NO CYCLE),
)
UNIQUE PRIMARY INDEX (NATURAL_KEY);

If we need to reload the key table the following DDL is required:

CREATE MULTISET TABLE TheKeys 
(
   NATURAL_KEY VARCHAR(500) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL,
   SURROGATE_KEY BIGINT GENERATED BY DEFAULT AS IDENTITY
   (START WITH 1 INCREMENT BY 1 
    MINVALUE 200000 
    MAXVALUE 999999999999999 
    NO CYCLE),
)
UNIQUE PRIMARY INDEX (NATURAL_KEY);

Why should you use Surrogate Keys?

There are several reasons why it is good to replace the natural keys with Teradata surrogate keys.

Different source systems deliver the same natural key for different information.

You may have more than one source system delivering the same value in the natural key for the same target table but with a different meaning.

This fact requires you to use surrogate keys because keeping the natural key from both sources would not allow you to distinguish the objects.

Different source systems deliver a different natural key for the same information.

But it can also be that the same information comes from different source systems with different natural keys. For example, the customers of a bank may be kept in different source systems. If it is necessary to represent these as a single customer in the data warehouse, surrogate keys are perfectly suited.

Unfortunately, many Teradata data warehouse implementations decide against surrogate keys and put the effort of integrating the information into the reports.

This approach will ultimately cause more costs than having a right surrogate key design from the beginning (don’t forget, such an approach requires repeating the logic of combining object information into every single report).

Exchange of Source System leads to the Delivery of new Natural Keys.

If the source system’s natural keys are used, and you undergo a replacement of the original system, you end up in a bad situation.

In this case, let’s compare the effort caused by it.

If we have surrogate keys available, only the mapping table (Natural Keys -> Surrogate Keys) has to be adjusted. All surrogate keys have to be assigned once to the new natural keys.

If we only have the natural keys available in our data warehouse, this usually means a complete re-design. The old natural keys must be replaced everywhere by the new natural keys!

In general, you never should tightly couple your data model to the source system structures. An invoice table should be called Invoice and not something like SAP-Invoice. Data Warehousing is not the 1:1 storage of operational data.

At the beginning of a project, it is always more agile to avoid surrogate keys and integrate source system tables 1:1.

However, this only shifts costs to the end of the chain. Changes that have to be made afterward are many times more expensive!

Teradata Surrogates Keys and Performance

If properly used, surrogate keys may increase your performance as you can replace several natural key columns by an integer column.

Often, natural key columns are character columns. From a performance point of view, this makes no difference for the hashing of the primary index. Where there is a negative effect on performance, however, is with joins, since Teradata must take character sets, etc. into account here.

Depending on the number of columns and data types of the natural key, this can negatively impact performance or a noticeable effect.

__CONFIG_colors_palette__{"active_palette":0,"config":{"colors":{"62516":{"name":"Main Accent","parent":-1}},"gradients":[]},"palettes":[{"name":"Default Palette","value":{"colors":{"62516":{"val":"var(--tcb-skin-color-0)"}},"gradients":[]},"original":{"colors":{"62516":{"val":"rgb(19, 114, 211)","hsl":{"h":210,"s":0.83,"l":0.45}}},"gradients":[]}}]}__CONFIG_colors_palette__
__CONFIG_colors_palette__{"active_palette":0,"config":{"colors":{"b4fbe":{"name":"Main Accent","parent":-1}},"gradients":[]},"palettes":[{"name":"Default Palette","value":{"colors":{"b4fbe":{"val":"rgb(241, 99, 52)"}},"gradients":[]},"original":{"colors":{"b4fbe":{"val":"rgb(19, 114, 211)","hsl":{"h":210,"s":0.83,"l":0.45}}},"gradients":[]}}]}__CONFIG_colors_palette__
Previous Article
__CONFIG_colors_palette__{"active_palette":0,"config":{"colors":{"b4fbe":{"name":"Main Accent","parent":-1}},"gradients":[]},"palettes":[{"name":"Default Palette","value":{"colors":{"b4fbe":{"val":"rgb(241, 99, 52)"}},"gradients":[]},"original":{"colors":{"b4fbe":{"val":"rgb(19, 114, 211)","hsl":{"h":210,"s":0.83,"l":0.45}}},"gradients":[]}}]}__CONFIG_colors_palette__
Next Article
__CONFIG_colors_palette__{"active_palette":0,"config":{"colors":{"62516":{"name":"Main Accent","parent":-1}},"gradients":[]},"palettes":[{"name":"Default Palette","value":{"colors":{"62516":{"val":"rgb(255, 0, 0)"}},"gradients":[]}}]}__CONFIG_colors_palette__
GET OUR TERADATA BOOK

Roland Wenzlofsky

Roland Wenzlofsky is an experienced freelance Teradata Consultant & Performance Trainer. Born in Austria's capital Vienna, he is building and tuning some of the largest Teradata Data Warehouses in the European financial and telecommunication sectors for more than 20 years. He has played all the roles of developer, designer, business analyst, and project manager. Therefore, he knows like no other the problems and obstacles that make many data warehouse projects fail and all the tricks and tips that will help you succeed.

  • Avatar
    Aleksey Svitin says:

    Hi,

    It’s another excellent article. Thank you.

    I would also add
    – Surrogate ids are important elements for Data vault modeling
    – Surrogate ids are must have for tables with complex natural key or for combining different sources(like Original_Id + Source_Id) as Roland says.
    – For many cases, the performance is better with surrogate id because they could be smaller in size (e.g. one INT column vs a couple of varchar columns + source_id), thus joining/grouping is faster.
    – The most important performance benefit is related to”attribute” or dimension tables which is used by “fact” tables. It’s not obvious. Let’s assume we have “small” Country table. Every country has unique string code. We can use this code as a primary key.
    Let’s assume that we have a “big” fact table (translations, mobile calls or anything else) which should have reference to Country table.
    If we use Country_Code as PK when we will use it in dependent fact table as well. But string Country_Code will be always bigger in size than alternative numeric surrogate key. So this difference will be multiplied by number of rows in fact table ( which could be in billions). So it’s very important to have as small PK as possible for dimensions and surrogate keys help a lot here.
    – Personally, I like the approach, when a system should use own “small” internal entity identifiers (generated internally) everywhere inside and GUIDs or natural keys for External communication. It gives both flexibility, strong performance and solid data model.

    It’s just my experience.

    Have a good day!

    Best regards,
    Aleksey Svitin.

    • Hi Aleksey,

      Thanks for your comment.

      Right now I am “eyewitnessing” a total misdesign of a data mart where not using surrogate keys leads to significant problems.

      The data is both delivered from the source system and sent back to it and re-sent a second time. But the format is changed when the data is sent a second time. This is really about little things like spaces etc.

      Since the information from the source system has not been normalized (more than one piece of information is stored in a column), the result is that every join in every SQL must be done with string functions. From a performance point of view a disaster. The optimizer cannot use statistics and fights with full table scans. An elegant solution with surrogate keys at the beginning of the ETL chain could have prevented this.

      – Roland

  • Terdata and few td consultants have fooled organisations with this concept and complicated simple things. Simple surrogate key is to maintain slowly changing dimensions and not for integration of different sources as said in this post. If integration was purpose like example of customer lifecycle then its a separate topic of master data management, dont try to mix all master data , metadata(gcfr) and all things in one and complicate things. I have seen organisations sufferinf becoz of such design and majorly some pakistani consultants have been ruling these confusions and securing their jobs and sucking clients money.
    Honestly teradata is amazing product but design it right way and keep it simple

    • I give you an example, and would kindly ask you to tell me how you would design the model without surrogate keys: You have two billing systems, invoices have to be stored in the same entity (“Invoice”). The natural keys of both billing systems are overlapping. How would you design the “Invoice” table without surrogate keys?

      • Avatar
        Boris Mogilevsky says:

        Conceptually, surrogate keys have a say but using Identity columns for this purpose is a big problem. Their definitions are tied to Hash map which will change in case if the Box needs to be re-configured, for example when there is an upgrade. I personally was in a project where everything has been built using Surrogate Keys based on Identity and there were lots of issues.

        • Avatar
          Алексей Свитин says:

          Hi Boris,

          As a solution, I can suggest to generate Ids during ETL stage instead of relying on Identity column ( which has many side effects) . We use Informatica for that purpose. It works pretty well.

          With best regards,
          Aleksei Svitin

  • can you please let me know about bkey and bmap genearation? and waht they a re

  • {"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

    You might also like

    >