Data Modeling considerations
In relational data modeling theory, each entity needs a primary key. Each primary key value uniquely identifies an object. Teradata initially followed the rules of relational data modeling tightly and did not allow row duplicates in a table and Teradata Multiset Tables were not available.
As a logical data modeler, I would expect duplicate rows only in the physical data model. From a business point of view, I can’t imagine a scenario where we would need to store row-level duplicates. Data cleansing as part of the ETL/ELT chain is the correct approach to handling row duplicates.
When the ANSI standard permitted to allow row duplicates, also Teradata added this option. The ANSI standard view on duplicates forced Teradata to allow SET tables (no row duplicates allowed) and MULTISET tables (row duplicates allowed) exist.
The above background knowledge makes clear that the default in Teradata mode is SET tables, and the default in ANSI mode is MULTISET tables.
Let me repeat the key aspects of SET and MULTISET tables:
Differences between Teradata Multiset Table and Set Table
- In contrast to the Teradata MULTISET table, SET tables forbid duplicate rows to be inserted with an INSERT statement or created with an UPDATE statement.
- SET tables can have a negative performance impact.
- For INSERT INTO SELECT * FROM <TABLE> statements, duplicate rows are automatically for SET tables (no error occurs). Nevertheless, INSERT INTO <TABLE> VALUES (a, b) statements, will abort and throw a mistake message for duplicate rows.
- There is no way to change an existing SET table into a MULTISET table. I don't know why this limitation exists…
- There is a negative performance impact for SET tables.
- Each time a row is inserted or updated, Teradata checks if this would violate row uniqueness.
- This test is called a DUPLICATE ROW CHECK, and will severely degrade performance if many rows with the same primary index are inserted.
- The number of controls increases exponentially with each new row added to the table!
- The performance of SET tables is not degraded when there is a UPI (Unique primary index) defined for the table.
- As the UPI itself ensures uniqueness, no DUPLICATE ROW CHECK is done. Instead of a UPI, we can also use a USI (Unique Secondary Index), or any column with a UNIQUE or PRIMARY KEY constraint.
- Any index or constraint which ensures uniqueness (and therefore allows Teradata to bypass the DUPLICATE ROW CHECK), avoids the duplicate row check!
SET tables are good candidates for performance improvement. The easiest way to find all SET tables on a Teradata system is to query the table “TABLES” in database “DBC” :
Identify the Teradata Multiset Table and Set Table
SELECT * FROM DBC.TABLES WHERE checkopt = 'N' AND TABLEKIND = 'T'; -- Set Tables SELECT * FROM DBC.TABLES WHERE checkopt = 'Y' AND TABLEKIND = 'T'; -- Multiset tables
All tables where uniqueness can be achieved programmatically – such is the case in a GROUP BY statement – can be switched from SET to MULTISET. In this way, we can make performance improvements.
The magnitude of the improvement depends on the number of rows per primary index value.
Don't forget: The number of DUPLICATE ROW CHECKS grows exponentially with the number of rows per Primary Index (PI)!
Here is an example to prove the performance penalty of a SET table with many duplicate primary index values:
In our example, we create two identical tables:
CREATE SET TABLE TMP_SET ( PK INTEGER NOT NULL, DESCR INTEGER NOT NULL ) PRIMARY INDEX (PK); CREATE MULTISET TABLE TMP_MULTISET ( PK INTEGER NOT NULL, DESCR INTEGER NOT NULL ) PRIMARY INDEX (PK);
In the next step we check our session id, as we will need it to analyze the resource usage:
SELECT SESSION; --> 7376827
We insert random data into the Set and Multiset table but only use 500 different Primary Index values to cause some impact on performance.
The “descr” column has to be unique as row-level duplicates would be filtered:
INSERT INTO TMP_MULTISET SELECT RANDOM(1,500) AS x, RANDOM(1,999999999) AS descr FROM BigTable
; INSERT INTO TMP_SET SELECT RANDOM(1,500) AS x, RANDOM(1,999999999) AS descr FROMBigTable ;
We compare the CPU seconds and DISK IOs for the SET and the MULTISET table:
SELECT * FROM DBC.DBQLOGTBL WHERE SESSIONID = 7376827;
|Total IO||CPU Seconds|
The above example shows that SET tables consume much more Disk IOs and CPU seconds.
Set Tables / Multiset Table and the Loading Utilities
While Multiload is compatible with SET and MULTISET tables, this is not the case for the Fastload Utility. Fastload will filter row duplicates whether the table is a SET table or a MULTISET table.
If we want to get rid of row duplicates in a Multiset table, we can do a combination of FastExport and Fastload. The Fastload will filter the row duplicates.
Another possibility to remove row duplicates is to simply create a copy of the Multiset Table as a Set Table, and copy the rows with an INSERT…SELECT statement.