Advantages of the Teradata Multiset Table - Design Guide for Performance
13

Advantages of the Teradata Multiset Table

Data Modeling considerations

In relational data modeling theory, each entity needs a primary key. Each primary key value uniquely identifies an object. Teradata initially followed the rules of relational data modeling tightly and did not allow row duplicates in a table and Teradata Multiset Tables were not available.

As a logical data modeler, I would expect duplicate rows only in the physical data model. From a business point of view, I can’t imagine a scenario where we would need to store row-level duplicates. Data cleansing as part of the ETL/ELT chain is the correct approach to handling row duplicates.

When the ANSI standard permitted to allow row duplicates, also Teradata added this option. The ANSI standard view on duplicates forced Teradata to allow SET tables (no row duplicates allowed) and MULTISET tables (row duplicates allowed) exist.

The above background knowledge makes clear that the default in Teradata mode is SET tables, and the default in ANSI mode is MULTISET tables.

Let me repeat the key aspects of SET and MULTISET tables:

Differences between Teradata Multiset Table and Set Table

  • In contrast to the Teradata MULTISET table, SET tables forbid duplicate rows to be inserted with an INSERT statement or created with an UPDATE statement.
  • SET tables can have a negative performance impact.
  • For INSERT INTO SELECT * FROM <TABLE> statements, duplicate rows are automatically for SET tables (no error occurs). Nevertheless, INSERT INTO <TABLE> VALUES (a, b) statements, will abort and throw a mistake message for duplicate rows.
  • There is no way to change an existing SET table into a MULTISET table. I don't know why this limitation exists…
  • There is a negative performance impact for SET tables.
    • Each time a row is inserted or updated, Teradata checks if this would violate row uniqueness.
    • This test is called a DUPLICATE ROW CHECK, and will severely degrade performance if many rows with the same primary index are inserted.
    • The number of controls increases exponentially with each new row added to the table!
  • The performance of SET tables is not degraded when there is a UPI (Unique primary index) defined for the table.
    • As the UPI itself ensures uniqueness, no DUPLICATE ROW CHECK is done. Instead of a UPI, we can also use a USI (Unique Secondary Index), or any column with a UNIQUE or PRIMARY KEY constraint.
    • Any index or constraint which ensures uniqueness (and therefore allows Teradata to bypass the DUPLICATE ROW CHECK), avoids the duplicate row check!

SET tables are good candidates for performance improvement. The easiest way to find all SET tables on a Teradata system is to query the table “TABLES” in database “DBC” :

Identify the Teradata Multiset Table and Set Table

SELECT * FROM DBC.TABLES 
WHERE checkopt = 'N' AND TABLEKIND = 'T'; -- Set Tables

SELECT * FROM DBC.TABLES 
WHERE checkopt = 'Y' AND TABLEKIND = 'T'; -- Multiset tables 

All tables where uniqueness can be achieved programmatically – such is the case in a GROUP BY statement – can be switched from SET to MULTISET. In this way, we can make performance improvements.

The magnitude of the improvement depends on the number of rows per primary index value.

Don't forget: The number of DUPLICATE ROW CHECKS grows exponentially with the number of rows per Primary Index (PI)!

Here is an example to prove the performance penalty of a SET table with many duplicate primary index values:

In our example, we create two identical tables:

CREATE SET TABLE TMP_SET
(
PK INTEGER NOT NULL,
DESCR INTEGER NOT NULL
) PRIMARY INDEX (PK);

CREATE MULTISET TABLE TMP_MULTISET
(
PK INTEGER NOT NULL,
DESCR INTEGER NOT NULL
) PRIMARY INDEX (PK);

In the next step we check our session id, as we will need it to analyze the resource usage:

SELECT SESSION;
--> 7376827 

We insert random data into the Set and Multiset table but only use 500 different Primary Index values to cause some impact on performance.

The “descr” column has to be unique as row-level duplicates would be filtered:

INSERT INTO TMP_MULTISET
SELECT
RANDOM(1,500) AS x,
RANDOM(1,999999999) AS descr
FROM BigTable
;

INSERT INTO TMP_SET
SELECT
RANDOM(1,500) AS x,
RANDOM(1,999999999) AS descr
FROM BigTable 
;

We compare the CPU seconds and DISK IOs for the SET and the MULTISET table:

SELECT * FROM DBC.DBQLOGTBL WHERE SESSIONID = 7376827;
Total IO CPU Seconds
Set 126.827,00 20,15
Multiset 3.556,00  0,96

The above example shows that SET tables consume much more Disk IOs and CPU seconds.

Set Tables / Multiset Table and the Loading Utilities

While Multiload is compatible with SET and MULTISET tables, this is not the case for the Fastload Utility. Fastload will filter row duplicates whether the table is a SET table or a MULTISET table.

If we want to get rid of row duplicates in a Multiset table,  we can do a combination of FastExport and Fastload. The Fastload will filter the row duplicates.

Another possibility to remove row duplicates is to simply create a copy of the Multiset Table as a Set Table, and copy the rows with an INSERT…SELECT statement.

Roland Wenzlofsky
 

Roland Wenzlofsky is a graduated computer scientist and Data Warehouse professional working with the Teradata database system for more than 15 years. He is experienced in the fields of banking and telecommunication with a strong focus on performance optimization.

  • Avatar Subbareddy K says:

    Small correction to my first comment.

    SELECT * FROM DBC.TABLES WHERE checkopt = ‘N’ AND TABLEKIND = ‘T’; — Set Tables
    Comment: Checkopt= ‘N’ is possible for MULTISET tables also.If the table is MULTISET with Unique index(UPI or USI) then Checkopt=’N’.

    SELECT * FROM DBC.TABLES WHERE checkopt = ‘Y’ AND TABLEKIND = ‘T’; — Multiset tables
    Comment: Checkopt=’Y’ means tables is always MULTISET but not in reverse order(Vice Varsa).

  • Avatar Khurram Nazir says:

    This is why we have SET tables (Does NOT allow duplicate rows) and MULTISET tables (row duplicates allowed) on Teradata.

  • Avatar Abel Debela says:

    Thank you so much. This article clearly helped me understand the difference.

  • Avatar Henri says:

    Hello Roland,
    If I have a SET table with UPI and you write:
    “However, the performance of the SET table is not degraded when the UPI (Unique Primary Index) is defined for the table. Because the UPI itself guarantees uniqueness, the DUPLICATE ROW CHECK check is not performed.”

    However, duplicate rows ( inserted by IaS ) were discarded (silently) similarly to DUPLICATE ROW CHECK and only non-duplicate rows were inserted.
    (This behavior differs from the behavior of the MULTISET table with UPI, where a reported error (INSERT Failed 2801) and of course no rows were inserted.)

    You could clarify the difference between (and with respect to the above) your statement
    1. “… For commands INSERT INTO SELECT * FROM , duplicate rows are automatically for SET tables (no error) …” – I mean particulary ‘automatically’.
    and
    2. DUPLICATE ROW CHECK for the SET table?

    Thank you in advance
    Henri

  • Avatar SREHMAN says:

    If we have a NUPI , and Table is defined as SET table , In this scenario Row Check will happened or Not when row is not unique .

  • >