fbpx

Data Modeling Considerations

In relational data modeling theory, each entity needs a primary key. Each primary key value uniquely identifies an object. Early releases of Teradata followed the rules of relational data modeling tightly and did not allow row duplicates in a table, and Teradata Multiset Tables were not available.

As a logical data modeler, I would expect duplicate rows only in the physical data model. From a business point of view, I can’t imagine a scenario where we would need to store row-level duplicates. Data cleansing as part of the ETL/ELT chain is, in our opinion, the correct approach to deal with row duplicates.

When the ANSI standard permitted the usage of row duplicates, Teradata added this possibility; the ANSI standard view on duplicates forced Teradata to introduce SET tables (no row duplicates allowed).

The above-mentioned background information clarifies why the default in Teradata mode is SET tables, and the default in ANSI mode is MULTISET tables.

Differences between Teradata Set and Multiset Tables

  • In contrast to the Teradata MULTISET table, SET tables forbid duplicate rows inserted with an INSERT statement or created with an UPDATE statement.
  • SET tables can have a negative performance impact in certain cases.
  • For INSERT INTO SELECT * FROM <TABLE> statements, duplicate rows are automatically for SET tables (no error occurs). Nevertheless, INSERT INTO <TABLE> VALUES (a, b) statements will abort and throw a mistake message for duplicate rows.
  • There is no way to change an existing SET table into a MULTISET table; we don’t know why this limitation exists.
  • Set tables can hurt performance:
    • Each time a row is inserted or updated, Teradata checks if this would violate row uniqueness.
    • This check is called the DUPLICATE ROW CHECK, and will severely degrade performance if many rows with the same primary index are inserted.
    • The number of checks increases exponentially with each row added.
  • SET tables’ performance is not degraded when there is a UPI (Unique primary index) defined on the table.
    • As the UPI itself ensures (key) uniqueness, no DUPLICATE ROW CHECK is done. Instead of a UPI, we can use a USI (Unique Secondary Index), or any column with a UNIQUE or PRIMARY KEY constraint.
    • Any index or constraint which ensures uniqueness (and therefore allows Teradata to bypass the DUPLICATE ROW CHECK) bypasses the duplicate row check!

The analysis of SET tables for excessive DUPLICATE ROW CHECKS is part of a Teradata system’s performance tuning. The easiest way to find all SET tables on a Teradata system is to query the view DBC.TablesV:

Finding SET and MULTISET Tables

SELECT * FROM DBC.TABLESV 
WHERE checkopt = 'N' AND TABLEKIND = 'T'; -- Set Tables

SELECT * FROM DBC.TABLESV 
WHERE checkopt = 'Y' AND TABLEKIND = 'T'; -- Multiset tables 

For all large SET tables, it is worth considering for performance reasons whether uniqueness can be better achieved with SQL code such as a GROUP BY statement. The performance improvement depends on the number of values per PI; the amount of DUPLICATE ROW CHECKS grows exponentially with the number of rows per PI. In general, we do not recommend using SET tables to remove unwanted duplicates; this leads to hard to maintain code and hides errors that are often not detected for a long time. We consider a good practice always to define the intended table type rather than relying on session semantics (ANSI or Teradata).

Be aware that Teradata supports a direct change from a non-unique to a unique primary index only when the table is empty. This introduces some additional administrative burden (e.g., reserve, if necessary, additional database space, create a new SET table with a unique primary index, copy the data from the existing table, drop the existing table, rename the new one, collect any additional metadata necessary, and release, the additional database space).

Below is an example INSERT statement which shows the negative effect of a SET table:

First we create the same table as SET and MULTISET table:

CREATE SET TABLE TMP_SET
(
PK INTEGER NOT NULL,
DESCR INTEGER NOT NULL
) PRIMARY INDEX (PK);

CREATE MULTISET TABLE TMP_MULTISET
(
PK INTEGER NOT NULL,
DESCR INTEGER NOT NULL
) PRIMARY INDEX (PK);

Next, we remember the session number of our query:

SELECT SESSION;
7376827 

As the last step, we populate both tables with random data from a huge table called BigTable, with only 500 different PI values. This ensures that sufficient DUPLICATE ROW CHECKS are generated for the SET table:

INSERT INTO TMP_MULTISET
SELECT
RANDOM(1,500) AS x,
RANDOM(1,999999999) AS descr
FROM BigTable;

INSERT INTO TMP_SET
SELECT
RANDOM(1,500) AS x,
RANDOM(1,999999999) AS descr
FROM BigTable;

If we compare the consumed resources of both INSERT statements, we immediately see the enormous difference:

SELECT TotalIOCount,CPUSeconds FROM DBC.DBQLOGTBL 
WHERE SESSIONID = 7376827;
TotalIOCountCPUSeconds
Set 126.827,00 20,15
Multiset 3.556,00  0,96
Difference in IOs and CPU seconds

What to consider when loading

Only the Multiload utility is compatible with set tables. The Fastload utility automatically filters all row duplicates during loading (except for NOPI tables).

We can use the automatic filtering of row duplicates on Fastload to our advantage to remove duplicates from large tables. A FastExport followed by a Fastload does exactly this. Another possibility is to create a copy as a SET table and copy the content from the MULTISET table if the table is not so large.

__CONFIG_colors_palette__{"active_palette":0,"config":{"colors":{"62516":{"name":"Main Accent","parent":-1}},"gradients":[]},"palettes":[{"name":"Default Palette","value":{"colors":{"62516":{"val":"var(--tcb-skin-color-0)"}},"gradients":[]},"original":{"colors":{"62516":{"val":"rgb(19, 114, 211)","hsl":{"h":210,"s":0.83,"l":0.45}}},"gradients":[]}}]}__CONFIG_colors_palette__
__CONFIG_colors_palette__{"active_palette":0,"config":{"colors":{"b4fbe":{"name":"Main Accent","parent":-1}},"gradients":[]},"palettes":[{"name":"Default Palette","value":{"colors":{"b4fbe":{"val":"rgb(241, 99, 52)"}},"gradients":[]},"original":{"colors":{"b4fbe":{"val":"rgb(19, 114, 211)","hsl":{"h":210,"s":0.83,"l":0.45}}},"gradients":[]}}]}__CONFIG_colors_palette__
Previous Article
__CONFIG_colors_palette__{"active_palette":0,"config":{"colors":{"b4fbe":{"name":"Main Accent","parent":-1}},"gradients":[]},"palettes":[{"name":"Default Palette","value":{"colors":{"b4fbe":{"val":"rgb(241, 99, 52)"}},"gradients":[]},"original":{"colors":{"b4fbe":{"val":"rgb(19, 114, 211)","hsl":{"h":210,"s":0.83,"l":0.45}}},"gradients":[]}}]}__CONFIG_colors_palette__
Next Article
__CONFIG_colors_palette__{"active_palette":0,"config":{"colors":{"62516":{"name":"Main Accent","parent":-1}},"gradients":[]},"palettes":[{"name":"Default Palette","value":{"colors":{"62516":{"val":"rgb(255, 0, 0)"}},"gradients":[]}}]}__CONFIG_colors_palette__
GET OUR TERADATA BOOK

Roland Wenzlofsky

Roland Wenzlofsky is an experienced freelance Teradata Consultant & Performance Trainer. Born in Austria's capital Vienna, he is building and tuning some of the largest Teradata Data Warehouses in the European financial and telecommunication sectors for more than 20 years. He has played all the roles of developer, designer, business analyst, and project manager. Therefore, he knows like no other the problems and obstacles that make many data warehouse projects fail and all the tricks and tips that will help you succeed.

  • If we have a NUPI , and Table is defined as SET table , In this scenario Row Check will happened or Not when row is not unique .

    • In this case, the duplicate row check will happen. If you want to avoid it, you could, for example, add a unique secondary index.

  • Hello Roland,
    If I have a SET table with UPI and you write:
    “However, the performance of the SET table is not degraded when the UPI (Unique Primary Index) is defined for the table. Because the UPI itself guarantees uniqueness, the DUPLICATE ROW CHECK check is not performed.”

    However, duplicate rows ( inserted by IaS ) were discarded (silently) similarly to DUPLICATE ROW CHECK and only non-duplicate rows were inserted.
    (This behavior differs from the behavior of the MULTISET table with UPI, where a reported error (INSERT Failed 2801) and of course no rows were inserted.)

    You could clarify the difference between (and with respect to the above) your statement
    1. “… For commands INSERT INTO SELECT * FROM , duplicate rows are automatically for SET tables (no error) …” – I mean particulary ‘automatically’.
    and
    2. DUPLICATE ROW CHECK for the SET table?

    Thank you in advance
    Henri

  • Avatar
    Abel Debela says:

    Thank you so much. This article clearly helped me understand the difference.

  • Avatar
    Khurram Nazir says:

    This is why we have SET tables (Does NOT allow duplicate rows) and MULTISET tables (row duplicates allowed) on Teradata.

  • Avatar
    Subbareddy K says:

    Small correction to my first comment.

    SELECT * FROM DBC.TABLES WHERE checkopt = ‘N’ AND TABLEKIND = ‘T’; — Set Tables
    Comment: Checkopt= ‘N’ is possible for MULTISET tables also.If the table is MULTISET with Unique index(UPI or USI) then Checkopt=’N’.

    SELECT * FROM DBC.TABLES WHERE checkopt = ‘Y’ AND TABLEKIND = ‘T’; — Multiset tables
    Comment: Checkopt=’Y’ means tables is always MULTISET but not in reverse order(Vice Varsa).

  • {"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

    You might also like

    >