Data Modeling Considerations and Teradata MULTISET Tables
In relational data modeling theory, each entity needs a primary key. Each primary key value uniquely identifies an object. Early releases of Teradata followed the rules of relational data modeling tightly and did not allow row duplicates in a table. At that time, the concept of MULTISET tables did not exist.
As a logical data modeler, we anticipate duplicate rows only in the physical data model. From a business point of view, we can’t imagine a scenario where we would need to store row-level duplicates. Data cleansing as part of the ETL/ELT chain is, in our opinion, the correct approach to deal with row duplicates.
When the ANSI standard allowed to store row duplicates, Teradata allowed it too; the ANSI standard view on duplicates forced Teradata to introduce SET tables, which cannot store row duplicates.
The background described above should clarify why Teradata creates MULTISET tables by default in ANSI mode and SET tables in Teradata mode.
Differences between Teradata Set and Multiset Tables
- In contrast to the Teradata MULTISET table, SET tables forbid duplicate rows inserted with an INSERT statement or created with an UPDATE statement. Teradata returns an error message if we try to insert duplicates.
- SET tables can harm performance in some instances.
- For INSERT INTO SELECT * FROM <TABLE> statements, duplicate rows are declined automatically for SET tables (no error occurs). Nevertheless, INSERT INTO <TABLE> VALUES (a, b) statements will abort and throw an error message for duplicate rows.
Hint if we fill tables with INSERT INTO … SELECT and at the same time ensure referential integrity by means of a primary key definition: With SET tables row duplicates are removed before the integrity check is performed. Therefore it is possible that errors in the data (duplicates) are not noticed although they are present. If the table is then changed to a MULTISET table at some point, the primary key check will suddenly be triggered because there are unexpectedly key duplicates (because if there are row duplicates, these are automatically also key duplicates). So we have to be careful here.
- There is no way to change an existing SET table into a MULTISET table; we don’t know why this limitation exists.
- Below are more details on why SET tables can hurt performance:
- Teradata checks each row when being inserted or updated to see if it would violate row uniqueness.
- This check is called the DUPLICATE ROW CHECK and will severely degrade performance if we insert many rows with the same primary index.
- The number of checks required increases exponentially with each duplicate row added.
- SET tables’ performance does not deteriorate when a UPI (Unique primary index) exists on the table.
- As the UPI ensures (key) uniqueness, Teradata does not do a DUPLICATE ROW CHECK. Instead of a UPI, we can use a USI (Unique Secondary Index) or any column with a UNIQUE or PRIMARY KEY constraint.
- Any index or constraint which ensures uniqueness (and therefore allows Teradata to bypass the DUPLICATE ROW CHECK) bypasses the duplicate row check!
The analysis of SET tables for excessive DUPLICATE ROW CHECKS is part of a Teradata system’s performance tuning. The easiest way to find all SET tables on a Teradata system is to query the view DBC.TablesV:
Discovering SET and MULTISET Tables
SELECT * FROM DBC.TABLESV WHERE checkopt = 'N' AND TABLEKIND = 'T'; -- Set Tables SELECT * FROM DBC.TABLESV WHERE checkopt = 'Y' AND TABLEKIND = 'T'; -- Multiset tables
For all large SET tables, it is worth considering for performance reasons whether we can achieve uniqueness with SQL code such as a GROUP BY statement. The performance improvement depends on the number of values per PI; the amount of DUPLICATE ROW CHECKS grows exponentially with the number of rows per PI. We do not recommend using SET tables to remove unwanted duplicates; such an approach leads to hard-to-maintain code and hides errors often not detected for a long time. We always consider defining the intended table type a good practice rather than relying on session semantics (ANSI or Teradata).
Be aware that Teradata supports a direct change from a non-unique to a unique primary index only when the table is empty. This restriction introduces some additional administrative burdens. We must:
- Reserve additional database space
- Create a new SET table with a Unique Primary Index
- Copy the data from the existing table into the new table
- Drop the existing table
- Rename the new table to the old name
- Collect statistics
- Release the additional disk space
Below is an example INSERT statement that shows the negative effect of a SET table for many duplicate rows. First, we create two almost identical tables. The only difference is that one of them is a SET table, the other a MULTISET table:
CREATE SET TABLE TMP_SET ( PK INTEGER NOT NULL, DESCR INTEGER NOT NULL ) PRIMARY INDEX (PK); CREATE MULTISET TABLE TMP_MULTISET ( PK INTEGER NOT NULL, DESCR INTEGER NOT NULL ) PRIMARY INDEX (PK);
Next, we note the session of our query:
SELECT SESSION; 7376827
As the last step, we populate both tables with random data from a vast table called BigTable, with only 500 different PI values. Since we use only a few different primary index values, we force Teradata to perform many duplicate row checks:
INSERT INTO TMP_MULTISET SELECT RANDOM(1,500) AS x, RANDOM(1,999999999) AS descr FROM BigTable; INSERT INTO TMP_SET SELECT RANDOM(1,500) AS x, RANDOM(1,999999999) AS descr FROM BigTable;
If we compare the consumed resources of both INSERT statements, we immediately see the enormous difference:
SELECT TotalIOCount,CPUSeconds FROM DBC.DBQLOGTBL WHERE SESSIONID = 7376827;
As you can see in the graph above, the difference between the SET and MULTISET tables is enormous. It is not that the load is distributed unevenly but that the required CPU seconds and IOs are many times higher in absolute numbers. The duplicate row checks can explain the increase in CPU seconds. The additional CPU seconds are based on the permanent restructuring of the data blocks, which are filled and split when they have reached the maximum size.
What to consider when loading
Only the Multiload utility is compatible with SET tables. The Fastload utility automatically filters all row duplicates during loading (except for NOPI tables).
We can automatically filter row duplicates with Fastload to remove duplicates from large tables. A FastExport followed by a Fastload does precisely this. Another possibility is to create a copy as a SET table and copy the content from the MULTISET table if the table is not so large.