teradata multiset table

In relational data modeling theory, each entity needs a primary key. Each primary key value uniquely identifies an object. Teradata initially followed the rules of relational data modeling tightly and did not allow row duplicates in a table.

As a logical data modeler, I would expect duplicate rows only in the physical data model. From a business point of view, I can’t imagine a scenario where we would need to store row level duplicates. Data cleansing as part of the ETL/ELT chain is the correct approach to handling row duplicates.

When the ANSI standard permitted to allow row duplicates, also Teradata added this option. The ANSI standard view on duplicates forced Teradata to allow SET tables (no row duplicates allowed) and MULTISET tables (row duplicates allowed) exist.

Above background knowledge makes clear that the default in Teradata mode are SET tables, and the default in ANSI mode are MULTISET tables.

Let me repeat the key aspects of SET and MULTISET tables:

  • In contrast to MULTISET tables, SET tables forbid duplicate rows to be inserted with an INSERT statement or created with an UPDATE statement.
  • SET tables can have a negative performance impact.
  • For INSERT INTO SELECT * FROM <TABLE> statements, duplicate rows are automatically for SET tables (no error occurs). Nevertheless, INSERT INTO <TABLE> VALUES (a, b) statements, will abort and throw a mistake message for duplicate rows.
  • There is no way to change an existing SET table into a MULTISET table. I don’t know why this limitation exists…
  • There is a negative performance impact for SET tables. Each time a row is inserted or updated, Teradata checks if this would violate row uniqueness. This test is called DUPLICATE ROW CHECK, and will severely degrade performance if many rows with the same primary index are inserted. The number of controls increases exponentially with each new row added to the table!
  • Still, the performance of SET tables is not degraded, when there is a UPI (Unique primary index) defined for the table. As the UPI itself ensures uniqueness, no DUPLICATE ROW CHECK is done.

Instead of a UPI, we can also use a USI (Unique Secondary Index), or any column with a UNIQUE or PRIMARY KEY constraint.

Any index or constraint which ensures uniqueness (and therefore allows Teradata to bypass the DUPLICATE ROW CHECK), avoids the duplicate row check!

SET tables are good candidates for performance improvement. The easiest way to find all SET tables on a Teradata system is to query the table “TABLES” in database “DBC” :

SELECT * FROM DBC.TABLES WHERE checkopt = ‘N’ AND TABLEKIND = ‘T’;  — Set Tables

SELECT * FROM DBC.TABLES WHERE checkopt = ‘Y’ AND TABLEKIND = ‘T’;  — Multiset tables 

All tables where uniqueness is given programmatically – such is the case in a GROUP BY statement,  – can be switched from SET to MULTISET. In this way, we can make performance improvements. The magnitude of the improvement depends on the number of rows per primary index value.

Again: The number of DUPLICATE ROW CHECKS grows exponentially with the number of rows per Primary Index (PI)!

Here is an example to prove the performance penalty of a SET table with many duplicate primary index values:

In our example, we create two identical tables:

CREATE SET  TABLE TMP_SET
(
PK INTEGER NOT NULL,
DESCR INTEGER NOT NULL
) PRIMARY INDEX (PK);

CREATE MULTISET  TABLE TMP_MULTISET
(
PK INTEGER NOT NULL,
DESCR INTEGER NOT NULL
) PRIMARY INDEX (PK);

— In the next step we check our session id, as we will need it to analyze the resource usage:

SELECT SESSION;

7376827

— We insert random data into the Set and Multiset table but only use 500 different Primary Index values to cause
— some impact on performance. The “descr” column has to be unique as row level duplicates would be filtered
INSERT INTO TMP_MULTISET
SELECT
RANDOM(1,500) AS x,
RANDOM(1,999999999) AS descr
FROM
<ANY_TABLE_WITH_ABOUT_500000_RECORDS>;
;

INSERT INTO TMP_SET
SELECT
RANDOM(1,500) AS x,
RANDOM(1,999999999) AS descr
FROM
<ANY_TABLE_WITH_ABOUT_500000_RECORDS>;
;

— We compare the CPU seconds and DISK IOs for the SET and the MULTISET table:

SELECT * FROM DBC.DBQLOGTBL WHERE SESSIONID = 7376827;

Total IO CPU Seconds
Set 126.827,00 20,15
Multiset 3.556,00 0,96

 

Above example shows, that SET tables consume much more Disk IOs and CPU seconds.

Set Tables / Multiset Tables and the Loading Utilities

While Multiload is compatible with SET and MULTISET tables, this is not the case for the Fastload Utility. Fastload will filter row duplicates whether the table is a SET table or a MULTISET table.

If we want to get rid of row duplicates in a Multiset table,  we can do a combination of FastExport and Fastload. The Fastload will filter the row duplicates.

Is there anything missing in this blog post or is something wrong which we should correct?

Please leave us a comment!

Our Reader Score
[Total: 23    Average: 4.3/5]
Teradata Multiset vs. Set Tables – Usage Guidlines for the Expert written by Roland Wenzlofsky on March 10, 2014 average rating 4.3/5 - 23 user ratings

10 COMMENTS

  1. This is why we have SET tables (Does NOT allow duplicate rows) and MULTISET tables (row duplicates allowed) on Teradata.

  2. Small correction to my first comment.

    SELECT * FROM DBC.TABLES WHERE checkopt = ‘N’ AND TABLEKIND = ‘T’; — Set Tables
    Comment: Checkopt= ‘N’ is possible for MULTISET tables also.If the table is MULTISET with Unique index(UPI or USI) then Checkopt=’N’.

    SELECT * FROM DBC.TABLES WHERE checkopt = ‘Y’ AND TABLEKIND = ‘T’; — Multiset tables
    Comment: Checkopt=’Y’ means tables is always MULTISET but not in reverse order(Vice Varsa).

  3. Nice explanation but below stmts are wrong

    SELECT * FROM DBC.TABLES WHERE checkopt = ‘N’ AND TABLEKIND = ‘T’; — Set Tables
    Comment: Checkopt= ‘N’ is possible for MULTISET tables also.If the table is MULTISET with Unique index(UPI or USI) then Checkopt=’Y’.

    SELECT * FROM DBC.TABLES WHERE checkopt = ‘Y’ AND TABLEKIND = ‘T’; — Multiset tables
    Comment: Checkopt=’Y’ means tables is always MULTISET but not in reverse order(Vice Varsa).

  4. Hi Roland

    In the very last sentence, it should be SET table which is using more CPU & IO.
    It’a a typo error I guess.


    Nitin

  5. Hi,

    Below 2 sentences are contradicting… Can you please clarify my doubt on conversion of SET to MULTISET table :
    1. Don’t forget that there is no way to change a SET table to a MULTISET table after it has been created
    2. In case you already ensure uniqueness programatically (GROUP BY statement for example), you can switch easily from SET to MULTISET tables which could give you some performance gain

    Thanks in Advance.

    Regards,
    Sunil

    • Hi. I think my description is not clear enough. i will improve this article.

      What I really tried to say with point 2. is that you can avoid using SET tables and use instead MULTISET tables if you already ensure row uniqueness with a GROUP BY statement on all columns.

      I hope this was clearer now.

      Roland

LEAVE A REPLY

Please enter your comment!
Please enter your name here