Using Teradata MULTISET tables correctly

Roland Wenzlofsky

June 9, 2022

minutes reading time


Data Modeling Considerations and Teradata MULTISET Tables

In relational data modeling theory, each entity needs a primary key. Each primary key value uniquely identifies an object. Early releases of Teradata followed the rules of relational data modeling tightly and did not allow row duplicates in a table. At that time, the concept of MULTISET tables did not exist.

As a logical data modeler, we anticipate duplicate rows only in the physical data model. From a business point of view, we can’t imagine a scenario where we would need to store row-level duplicates. Data cleansing as part of the ETL/ELT chain is, in our opinion, the correct approach to deal with row duplicates.

When the ANSI standard allowed to store row duplicates, Teradata allowed it too; the ANSI standard view on duplicates forced Teradata to introduce SET tables, which cannot store row duplicates.

The background described above should clarify why Teradata creates MULTISET tables by default in ANSI mode and SET tables in Teradata mode.

Differences between Teradata Set and Multiset Tables

  • In contrast to the Teradata MULTISET table, SET tables forbid duplicate rows inserted with an INSERT statement or created with an UPDATE statement. Teradata returns an error message if we try to insert duplicates.
  • SET tables can harm performance in some instances.
  • For INSERT INTO SELECT * FROM <TABLE> statements, duplicate rows are declined automatically for SET tables (no error occurs). Nevertheless, INSERT INTO <TABLE> VALUES (a, b) statements will abort and throw an error message for duplicate rows.

    Hint if we fill tables with INSERT INTO … SELECT and at the same time ensure referential integrity by means of a primary key definition: With SET tables row duplicates are removed before the integrity check is performed. Therefore it is possible that errors in the data (duplicates) are not noticed although they are present. If the table is then changed to a MULTISET table at some point, the primary key check will suddenly be triggered because there are unexpectedly key duplicates (because if there are row duplicates, these are automatically also key duplicates). So we have to be careful here.
  • There is no way to change an existing SET table into a MULTISET table; we don’t know why this limitation exists.
  • Below are more details on why SET tables can hurt performance:
    • Teradata checks each row when being inserted or updated to see if it would violate row uniqueness.
    • This check is called the DUPLICATE ROW CHECK and will severely degrade performance if we insert many rows with the same primary index.
    • The number of checks required increases exponentially with each duplicate row added.
  • SET tables’ performance does not deteriorate when a UPI (Unique primary index) exists on the table.
    • As the UPI ensures (key) uniqueness, Teradata does not do a DUPLICATE ROW CHECK. Instead of a UPI, we can use a USI (Unique Secondary Index) or any column with a UNIQUE or PRIMARY KEY constraint.
    • Any index or constraint which ensures uniqueness (and therefore allows Teradata to bypass the DUPLICATE ROW CHECK) bypasses the duplicate row check!

The analysis of SET tables for excessive DUPLICATE ROW CHECKS is part of a Teradata system’s performance tuning. The easiest way to find all SET tables on a Teradata system is to query the view DBC.TablesV:

Discovering SET and MULTISET Tables

SELECT * FROM DBC.TABLESV 
WHERE checkopt = 'N' AND TABLEKIND = 'T'; -- Set Tables

SELECT * FROM DBC.TABLESV 
WHERE checkopt = 'Y' AND TABLEKIND = 'T'; -- Multiset tables 

For all large SET tables, it is worth considering for performance reasons whether we can achieve uniqueness with SQL code such as a GROUP BY statement. The performance improvement depends on the number of values per PI; the amount of DUPLICATE ROW CHECKS grows exponentially with the number of rows per PI. We do not recommend using SET tables to remove unwanted duplicates; such an approach leads to hard-to-maintain code and hides errors often not detected for a long time. We always consider defining the intended table type a good practice rather than relying on session semantics (ANSI or Teradata).

Be aware that Teradata supports a direct change from a non-unique to a unique primary index only when the table is empty. This restriction introduces some additional administrative burdens. We must:

  • Reserve additional database space
  • Create a new SET table with a Unique Primary Index
  • Copy the data from the existing table into the new table
  • Drop the existing table
  • Rename the new table to the old name
  • Collect statistics
  • Release the additional disk space

Below is an example INSERT statement that shows the negative effect of a SET table for many duplicate rows. First, we create two almost identical tables. The only difference is that one of them is a SET table, the other a MULTISET table:

CREATE SET TABLE TMP_SET
(
PK INTEGER NOT NULL,
DESCR INTEGER NOT NULL
) PRIMARY INDEX (PK);

CREATE MULTISET TABLE TMP_MULTISET
(
PK INTEGER NOT NULL,
DESCR INTEGER NOT NULL
) PRIMARY INDEX (PK);

Next, we note the session of our query:

SELECT SESSION;
7376827 

As the last step, we populate both tables with random data from a vast table called BigTable, with only 500 different PI values. Since we use only a few different primary index values, we force Teradata to perform many duplicate row checks:

INSERT INTO TMP_MULTISET
SELECT
RANDOM(1,500) AS x,
RANDOM(1,999999999) AS descr
FROM BigTable;

INSERT INTO TMP_SET
SELECT
RANDOM(1,500) AS x,
RANDOM(1,999999999) AS descr
FROM BigTable;

If we compare the consumed resources of both INSERT statements, we immediately see the enormous difference:

SELECT TotalIOCount,CPUSeconds FROM DBC.DBQLOGTBL 
WHERE SESSIONID = 7376827;
TotalIOCountCPUSeconds
Set 126.827,00 20,15
Multiset 3.556,00  0,96
The difference in IOs and CPU seconds

As you can see in the graph above, the difference between the SET and MULTISET tables is enormous. It is not that the load is distributed unevenly but that the required CPU seconds and IOs are many times higher in absolute numbers. The duplicate row checks can explain the increase in CPU seconds. The additional CPU seconds are based on the permanent restructuring of the data blocks, which are filled and split when they have reached the maximum size.

What to consider when loading

Only the Multiload utility is compatible with SET tables. The Fastload utility automatically filters all row duplicates during loading (except for NOPI tables).

We can automatically filter row duplicates with Fastload to remove duplicates from large tables. A FastExport followed by a Fastload does precisely this. Another possibility is to create a copy as a SET table and copy the content from the MULTISET table if the table is not so large.

  • If we have a NUPI , and Table is defined as SET table , In this scenario Row Check will happened or Not when row is not unique .

    • In this case, the duplicate row check will happen. If you want to avoid it, you could, for example, add a unique secondary index.

  • Hello Roland,
    If I have a SET table with UPI and you write:
    “However, the performance of the SET table is not degraded when the UPI (Unique Primary Index) is defined for the table. Because the UPI itself guarantees uniqueness, the DUPLICATE ROW CHECK check is not performed.”

    However, duplicate rows ( inserted by IaS ) were discarded (silently) similarly to DUPLICATE ROW CHECK and only non-duplicate rows were inserted.
    (This behavior differs from the behavior of the MULTISET table with UPI, where a reported error (INSERT Failed 2801) and of course no rows were inserted.)

    You could clarify the difference between (and with respect to the above) your statement
    1. “… For commands INSERT INTO SELECT * FROM , duplicate rows are automatically for SET tables (no error) …” – I mean particulary ‘automatically’.
    and
    2. DUPLICATE ROW CHECK for the SET table?

    Thank you in advance
    Henri

  • Avatar
    Abel Debela says:

    Thank you so much. This article clearly helped me understand the difference.

  • Avatar
    Khurram Nazir says:

    This is why we have SET tables (Does NOT allow duplicate rows) and MULTISET tables (row duplicates allowed) on Teradata.

  • Avatar
    Subbareddy K says:

    Small correction to my first comment.

    SELECT * FROM DBC.TABLES WHERE checkopt = ‘N’ AND TABLEKIND = ‘T’; — Set Tables
    Comment: Checkopt= ‘N’ is possible for MULTISET tables also.If the table is MULTISET with Unique index(UPI or USI) then Checkopt=’N’.

    SELECT * FROM DBC.TABLES WHERE checkopt = ‘Y’ AND TABLEKIND = ‘T’; — Multiset tables
    Comment: Checkopt=’Y’ means tables is always MULTISET but not in reverse order(Vice Varsa).

  • {"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

    You might also like

    >