Data Modeling Considerations for Teradata: SET and MULTISET Tables

Roland Wenzlofsky

April 21, 2023

minutes reading time


Data Modeling Considerations and Teradata MULTISET Tables

Relational data modeling dictates that each entity must possess a distinct primary key for identification purposes. Teradata’s initial iterations rigidly followed this principle and disallowed the replication of rows within a table. The concept of MULTISET tables had not yet emerged during this era.

As a logical data modeler, we only expect duplicate rows to exist in the physical data model. We cannot conceive of a business situation where storing duplicate rows at the row level would be necessary. We believe addressing duplicate rows through data cleansing during the ETL/ELT process is appropriate.

Teradata implemented the storage of row duplicates when it became permitted by the ANSI standard. However, introducing MULTISET tables was necessary to adhere to the ANSI standard’s stance on duplicates. These tables have the capability to store multiple copies of the same row.

Based on the aforementioned background, Teradata defaults to creating MULTISET tables in ANSI mode and SET tables in Teradata mode.

Differences between Teradata Set and Multiset Tables

  • In contrast to the Teradata MULTISET table, SET tables forbid duplicate rows inserted with an INSERT statement or created with an UPDATE statement. Teradata returns an error message if we try to insert duplicates.
  • SET tables can harm performance in some instances.
  • For INSERT INTO SELECT * FROM <TABLE> statements, duplicate rows are declined automatically for SET tables (no error occurs). Nevertheless, INSERT INTO <TABLE> VALUES (a, b) statements will abort and throw an error message for duplicate rows.

    Hint if we populate tables with INSERT INTO … SELECT and at the same time ensure referential integrity using a primary key definition: With SET tables, row duplicates are removed before the integrity check is performed. Therefore it is possible that errors in the data (duplicates) are not noticed, although they are present. Suppose the table is changed to a MULTISET table at some point. In that case, the primary key check will suddenly be triggered because there are unexpected key duplicates (if there are row duplicates, these are automatically also key duplicates). So we have to be careful here.

  • There is no way to change an existing SET table into a MULTISET table; we don’t know why this limitation exists.
  • Below are more details on why SET tables can hurt performance:
    • Teradata checks each row when being inserted or updated to see if it would violate row uniqueness.
    • This check is called the DUPLICATE ROW CHECK and will severely degrade performance if we insert many rows with the same primary index.
    • The number of checks required increases exponentially with each duplicate row added.
  • SET tables’ performance does not deteriorate when a UPI (Unique primary index) exists on the table.
    • As the UPI ensures (key) uniqueness, Teradata does not do a DUPLICATE ROW CHECK. Instead of a UPI, we can use a USI (Unique Secondary Index) or any column with a UNIQUE or PRIMARY KEY constraint.
    • Any index or constraint which ensures uniqueness (and therefore allows Teradata to bypass the DUPLICATE ROW CHECK) bypasses the duplicate row check!

The analysis of SET tables for excessive DUPLICATE ROW CHECKS is part of a Teradata system’s performance tuning. The easiest way to find all SET tables on a Teradata system is to query the view DBC.TablesV:

Discovering SET and MULTISET Tables

SELECT * FROM DBC.TABLESV WHERE checkopt = ‘N’ AND TABLEKIND = ‘T’; — Set TablesSELECT * FROM DBC.TABLESV WHERE checkopt = ‘Y’ AND TABLEKIND = ‘T’; — Multiset tables 

Use SQL code like a GROUP BY statement for large SET tables to achieve uniqueness and improve performance. The performance benefit will vary based on the number of values per PI, as the number of DUPLICATE ROW CHECKS increases exponentially with the number of rows per PI. Avoid using SET tables to eliminate duplicates, as this can result in cumbersome code that may hide errors for an extended period. It is better to define the table type explicitly rather than relying on session semantics like ANSI or Teradata.

If the table is empty, Teradata only allows for direct conversion of a non-unique to a unique primary index. This limitation results in increased administrative responsibilities, which include:

  • Reserve additional database space
  • Create a new SET table with a Unique Primary Index
  • Copy the data from the existing table into the new table
  • Drop the existing table
  • Rename the new table to the old name
  • Collect statistics
  • Release the additional disk space

An example INSERT statement demonstrates the negative impact of duplicate rows in a SET table. Initially, we generate two nearly identical tables, with the sole distinction being that one is a SET table and the other is a MULTISET table:

CREATE SET TABLE TMP_SET
(
PK INTEGER NOT NULL,
DESCR INTEGER NOT NULL
) PRIMARY INDEX (PK);

CREATE MULTISET TABLE TMP_MULTISET
(
PK INTEGER NOT NULL,
DESCR INTEGER NOT NULL
) PRIMARY INDEX (PK);

We note the query session.

SELECT SESSION;
7376827

Finally, we populate both tables with randomized data sourced from the BigTable, which contains only 500 unique PI values. Due to our limited primary index values, Teradata is compelled to execute numerous duplicate row checks.

INSERT INTO TMP_MULTISET
SELECT RANDOM(1,500) AS x,
RANDOM(1,999999999) AS descr
FROM BigTable;

INSERT INTO TMP_SET
SELECT
RANDOM(1,500) AS x,
RANDOM(1,999999999) AS descr
FROM BigTable;

Comparing the resource consumption between the two INSERT statements reveals a significant disparity.

SELECT TotalIOCount, CPUSeconds FROM DBC.DBQLOGTBL WHERE SESSIONID = 7376827;

TotalIOCountCPUSeconds
Set 126.827,00 20,15
Multiset 3.556,00  0,96

The difference in IOs and CPU seconds

The disparity between SET and MULTISET tables is pronounced, as illustrated in the graph above. This discrepancy is not due to uneven load distribution but rather the significantly higher number of required CPU seconds and IOs. The elevated CPU seconds can be attributed to duplicate row checks. Moreover, the excess CPU seconds stem from permanently restructuring data blocks, which are filled and split upon reaching maximum capacity.

What to consider when loading

Multiload is the sole batch load utility that is compatible with SET tables. In contrast, Fastload automatically filters out all row duplicates upon loading into Primary Index Tables (not NOPI tables).

Large tables can have row duplicates automatically filtered through the use of Fastload. This tool can effectively remove duplicates.

A FastExport, followed by a Fastload, can also achieve this. For smaller tables, an alternative option is to create a SET table and copy the content from the MULTISET table.

Frequently Asked Questions: Creating Tables in Teradata

This FAQ will cover the most frequently asked questions regarding table creation in Teradata.

How do I create a table in Teradata?

Creating a table in Teradata involves using the CREATE TABLE statement with the desired table name, followed by the columns and their data types enclosed in parentheses.

Here’s a basic example:

CREATE TABLE department (
    department_id INTEGER,
    department_name VARCHAR(50),
    manager_id INTEGER,
    location VARCHAR(100)
) PRIMARY INDEX (department_id);

This will create a table named ‘department’ with four columns: department_id, department_name, manager_id, and location.

How to create a table in Teradata with the primary key?

To create a table with a primary key, include the PRIMARY KEY constraint in the CREATE TABLE statement. A primary key is a unique identifier for each record in a table and ensures that no duplicate records exist.

Here’s an example:

CREATE TABLE product (
    product_id INTEGER,
    product_name VARCHAR(50),
    price DECIMAL(10, 2),
    category VARCHAR(50),
    PRIMARY KEY (product_id)
) PRIMARY INDEX (category);

The product_id column is set as the primary key in the above example.

How can I create a table without Primary Index?

Creating a table without a Primary Index in Teradata is a straightforward process. Use the CREATE TABLE statement, followed by the table name and the columns with their respective data types enclosed in parentheses.

Here’s another example:

CREATE TABLE customer (
    customer_id INTEGER,
    first_name VARCHAR(50),
    last_name VARCHAR(50),
    email VARCHAR(100),
    phone VARCHAR(15)
);

This will create a table named ‘customer’ with five columns: customer_id, first_name, last_name, email, and phone, and no Primary Index.

How to create a table in Teradata using another table?

You can create a table in Teradata using another table as a template. This is useful when creating a new table with the same structure as an existing one. The CREATE TABLE … AS statement allows you to create a new table based on the structure and data of another table.

Here’s an example:

CREATE TABLE customer_backup AS customer WITH NO DATA;

This will create a new table named ‘customer_backup’ with the same structure as the ‘customer’ table but without any data.

If you want to create a new table with both the structure and data from an existing table, add the “WITH DATA” clause:

CREATE TABLE customer_copy AS customer WITH DATA;

This will create a new table named ‘customer_copy’ with the same structure and data as the ‘customer’ table.

  • If we have a NUPI , and Table is defined as SET table , In this scenario Row Check will happened or Not when row is not unique .

    • In this case, the duplicate row check will happen. If you want to avoid it, you could, for example, add a unique secondary index.

  • Hello Roland,
    If I have a SET table with UPI and you write:
    “However, the performance of the SET table is not degraded when the UPI (Unique Primary Index) is defined for the table. Because the UPI itself guarantees uniqueness, the DUPLICATE ROW CHECK check is not performed.”

    However, duplicate rows ( inserted by IaS ) were discarded (silently) similarly to DUPLICATE ROW CHECK and only non-duplicate rows were inserted.
    (This behavior differs from the behavior of the MULTISET table with UPI, where a reported error (INSERT Failed 2801) and of course no rows were inserted.)

    You could clarify the difference between (and with respect to the above) your statement
    1. “… For commands INSERT INTO SELECT * FROM , duplicate rows are automatically for SET tables (no error) …” – I mean particulary ‘automatically’.
    and
    2. DUPLICATE ROW CHECK for the SET table?

    Thank you in advance
    Henri

  • Avatar
    Abel Debela says:

    Thank you so much. This article clearly helped me understand the difference.

  • Avatar
    Khurram Nazir says:

    This is why we have SET tables (Does NOT allow duplicate rows) and MULTISET tables (row duplicates allowed) on Teradata.

  • Avatar
    Subbareddy K says:

    Small correction to my first comment.

    SELECT * FROM DBC.TABLES WHERE checkopt = ‘N’ AND TABLEKIND = ‘T’; — Set Tables
    Comment: Checkopt= ‘N’ is possible for MULTISET tables also.If the table is MULTISET with Unique index(UPI or USI) then Checkopt=’N’.

    SELECT * FROM DBC.TABLES WHERE checkopt = ‘Y’ AND TABLEKIND = ‘T’; — Multiset tables
    Comment: Checkopt=’Y’ means tables is always MULTISET but not in reverse order(Vice Varsa).

  • {"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

    You might also like

    >