Data Modeling Considerations

In relational data modeling theory, each entity needs a primary key. Each primary key value uniquely identifies an object. Early releases of Teradata followed the rules of relational data modeling tightly and did not allow row duplicates in a table, and Teradata Multiset Tables were not available.

As a logical data modeler, I would expect duplicate rows only in the physical data model. From a business point of view, I can’t imagine a scenario where we would need to store row-level duplicates. Data cleansing as part of the ETL/ELT chain is, in our opinion, the correct approach to deal with row duplicates.

When the ANSI standard permitted the usage of row duplicates, Teradata added this possibility; the ANSI standard view on duplicates forced Teradata to introduce SET tables (no row duplicates allowed).

The above-mentioned background information clarifies why the default in Teradata mode is SET tables, and the default in ANSI mode is MULTISET tables.

Differences between Teradata Set and Multiset Tables

  • In contrast to the Teradata MULTISET table, SET tables forbid duplicate rows inserted with an INSERT statement or created with an UPDATE statement.
  • SET tables can have a negative performance impact in certain cases.
  • For INSERT INTO SELECT * FROM <TABLE> statements, duplicate rows are automatically for SET tables (no error occurs). Nevertheless, INSERT INTO <TABLE> VALUES (a, b) statements will abort and throw a mistake message for duplicate rows.
  • There is no way to change an existing SET table into a MULTISET table; we don’t know why this limitation exists.
  • Set tables can hurt performance:
    • Each time a row is inserted or updated, Teradata checks if this would violate row uniqueness.
    • This check is called the DUPLICATE ROW CHECK, and will severely degrade performance if many rows with the same primary index are inserted.
    • The number of checks increases exponentially with each row added.
  • SET tables’ performance is not degraded when there is a UPI (Unique primary index) defined on the table.
    • As the UPI itself ensures (key) uniqueness, no DUPLICATE ROW CHECK is done. Instead of a UPI, we can use a USI (Unique Secondary Index), or any column with a UNIQUE or PRIMARY KEY constraint.
    • Any index or constraint which ensures uniqueness (and therefore allows Teradata to bypass the DUPLICATE ROW CHECK) bypasses the duplicate row check!

The analysis of SET tables for excessive DUPLICATE ROW CHECKS is part of a Teradata system’s performance tuning. The easiest way to find all SET tables on a Teradata system is to query the view DBC.TablesV:

Finding SET and MULTISET Tables

WHERE checkopt = 'N' AND TABLEKIND = 'T'; -- Set Tables

WHERE checkopt = 'Y' AND TABLEKIND = 'T'; -- Multiset tables 

For all large SET tables, it is worth considering for performance reasons whether uniqueness can be better achieved with SQL code such as a GROUP BY statement. The performance improvement depends on the number of values per PI; the amount of DUPLICATE ROW CHECKS grows exponentially with the number of rows per PI. In general, we do not recommend using SET tables to remove unwanted duplicates; this leads to hard to maintain code and hides errors that are often not detected for a long time. We consider a good practice always to define the intended table type rather than relying on session semantics (ANSI or Teradata).

Be aware that Teradata supports a direct change from a non-unique to a unique primary index only when the table is empty. This introduces some additional administrative burden (e.g., reserve, if necessary, additional database space, create a new SET table with a unique primary index, copy the data from the existing table, drop the existing table, rename the new one, collect any additional metadata necessary, and release, the additional database space).

Below is an example INSERT statement which shows the negative effect of a SET table:

First we create the same table as SET and MULTISET table:



Next, we remember the session number of our query:


As the last step, we populate both tables with random data from a huge table called BigTable, with only 500 different PI values. This ensures that sufficient DUPLICATE ROW CHECKS are generated for the SET table:

RANDOM(1,500) AS x,
RANDOM(1,999999999) AS descr
FROM BigTable;

RANDOM(1,500) AS x,
RANDOM(1,999999999) AS descr
FROM BigTable;

If we compare the consumed resources of both INSERT statements, we immediately see the enormous difference:

Set 126.827,00 20,15
Multiset 3.556,00  0,96
Difference in IOs and CPU seconds

What to consider when loading

Only the Multiload utility is compatible with set tables. The Fastload utility automatically filters all row duplicates during loading (except for NOPI tables).

We can use the automatic filtering of row duplicates on Fastload to our advantage to remove duplicates from large tables. A FastExport followed by a Fastload does exactly this. Another possibility is to create a copy as a SET table and copy the content from the MULTISET table if the table is not so large.

Buy now at Amazon
  • If we have a NUPI , and Table is defined as SET table , In this scenario Row Check will happened or Not when row is not unique .

    • In this case, the duplicate row check will happen. If you want to avoid it, you could, for example, add a unique secondary index.

  • Hello Roland,
    If I have a SET table with UPI and you write:
    “However, the performance of the SET table is not degraded when the UPI (Unique Primary Index) is defined for the table. Because the UPI itself guarantees uniqueness, the DUPLICATE ROW CHECK check is not performed.”

    However, duplicate rows ( inserted by IaS ) were discarded (silently) similarly to DUPLICATE ROW CHECK and only non-duplicate rows were inserted.
    (This behavior differs from the behavior of the MULTISET table with UPI, where a reported error (INSERT Failed 2801) and of course no rows were inserted.)

    You could clarify the difference between (and with respect to the above) your statement
    1. “… For commands INSERT INTO SELECT * FROM , duplicate rows are automatically for SET tables (no error) …” – I mean particulary ‘automatically’.
    2. DUPLICATE ROW CHECK for the SET table?

    Thank you in advance

  • Avatar
    Abel Debela says:

    Thank you so much. This article clearly helped me understand the difference.

  • Avatar
    Khurram Nazir says:

    This is why we have SET tables (Does NOT allow duplicate rows) and MULTISET tables (row duplicates allowed) on Teradata.

  • Avatar
    Subbareddy K says:

    Small correction to my first comment.

    SELECT * FROM DBC.TABLES WHERE checkopt = ‘N’ AND TABLEKIND = ‘T’; — Set Tables
    Comment: Checkopt= ‘N’ is possible for MULTISET tables also.If the table is MULTISET with Unique index(UPI or USI) then Checkopt=’N’.

    SELECT * FROM DBC.TABLES WHERE checkopt = ‘Y’ AND TABLEKIND = ‘T’; — Multiset tables
    Comment: Checkopt=’Y’ means tables is always MULTISET but not in reverse order(Vice Varsa).

  • {"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

    You might also like