Data Modeling Considerations
In relational data modeling theory, each entity needs a primary key. Each primary key value uniquely identifies an object. Early releases of Teradata followed the rules of relational data modeling tightly and did not allow row duplicates in a table, and Teradata Multiset Tables were not available.
As a logical data modeler, I would expect duplicate rows only in the physical data model. From a business point of view, I can’t imagine a scenario where we would need to store row-level duplicates. Data cleansing as part of the ETL/ELT chain is, in our opinion, the correct approach to deal with row duplicates.
When the ANSI standard permitted the usage of row duplicates, Teradata added this possibility; the ANSI standard view on duplicates forced Teradata to introduce SET tables (no row duplicates allowed).
The above-mentioned background information clarifies why the default in Teradata mode is SET tables, and the default in ANSI mode is MULTISET tables.
Differences between Teradata Set and Multiset Tables
- In contrast to the Teradata MULTISET table, SET tables forbid duplicate rows inserted with an INSERT statement or created with an UPDATE statement.
- SET tables can have a negative performance impact in certain cases.
- For INSERT INTO SELECT * FROM <TABLE> statements, duplicate rows are automatically for SET tables (no error occurs). Nevertheless, INSERT INTO <TABLE> VALUES (a, b) statements will abort and throw a mistake message for duplicate rows.
- There is no way to change an existing SET table into a MULTISET table; we don’t know why this limitation exists.
- Set tables can hurt performance:
- Each time a row is inserted or updated, Teradata checks if this would violate row uniqueness.
- This check is called the DUPLICATE ROW CHECK, and will severely degrade performance if many rows with the same primary index are inserted.
- The number of checks increases exponentially with each row added.
- SET tables’ performance is not degraded when there is a UPI (Unique primary index) defined on the table.
- As the UPI itself ensures (key) uniqueness, no DUPLICATE ROW CHECK is done. Instead of a UPI, we can use a USI (Unique Secondary Index), or any column with a UNIQUE or PRIMARY KEY constraint.
- Any index or constraint which ensures uniqueness (and therefore allows Teradata to bypass the DUPLICATE ROW CHECK) bypasses the duplicate row check!
The analysis of SET tables for excessive DUPLICATE ROW CHECKS is part of a Teradata system’s performance tuning. The easiest way to find all SET tables on a Teradata system is to query the view DBC.TablesV:
Finding SET and MULTISET Tables
SELECT * FROM DBC.TABLESV WHERE checkopt = 'N' AND TABLEKIND = 'T'; -- Set Tables SELECT * FROM DBC.TABLESV WHERE checkopt = 'Y' AND TABLEKIND = 'T'; -- Multiset tables
For all large SET tables, it is worth considering for performance reasons whether uniqueness can be better achieved with SQL code such as a GROUP BY statement. The performance improvement depends on the number of values per PI; the amount of DUPLICATE ROW CHECKS grows exponentially with the number of rows per PI. In general, we do not recommend using SET tables to remove unwanted duplicates; this leads to hard to maintain code and hides errors that are often not detected for a long time. We consider a good practice always to define the intended table type rather than relying on session semantics (ANSI or Teradata).
Be aware that Teradata supports a direct change from a non-unique to a unique primary index only when the table is empty. This introduces some additional administrative burden (e.g., reserve, if necessary, additional database space, create a new SET table with a unique primary index, copy the data from the existing table, drop the existing table, rename the new one, collect any additional metadata necessary, and release, the additional database space).
Below is an example INSERT statement which shows the negative effect of a SET table:
First we create the same table as SET and MULTISET table:
CREATE SET TABLE TMP_SET ( PK INTEGER NOT NULL, DESCR INTEGER NOT NULL ) PRIMARY INDEX (PK); CREATE MULTISET TABLE TMP_MULTISET ( PK INTEGER NOT NULL, DESCR INTEGER NOT NULL ) PRIMARY INDEX (PK);
Next, we remember the session number of our query:
SELECT SESSION; 7376827
As the last step, we populate both tables with random data from a huge table called BigTable, with only 500 different PI values. This ensures that sufficient DUPLICATE ROW CHECKS are generated for the SET table:
INSERT INTO TMP_MULTISET SELECT RANDOM(1,500) AS x, RANDOM(1,999999999) AS descr FROM BigTable; INSERT INTO TMP_SET SELECT RANDOM(1,500) AS x, RANDOM(1,999999999) AS descr FROM BigTable;
If we compare the consumed resources of both INSERT statements, we immediately see the enormous difference:
SELECT TotalIOCount,CPUSeconds FROM DBC.DBQLOGTBL WHERE SESSIONID = 7376827;
What to consider when loading
Only the Multiload utility is compatible with set tables. The Fastload utility automatically filters all row duplicates during loading (except for NOPI tables).
We can use the automatic filtering of row duplicates on Fastload to our advantage to remove duplicates from large tables. A FastExport followed by a Fastload does exactly this. Another possibility is to create a copy as a SET table and copy the content from the MULTISET table if the table is not so large.