At this point, it is assumed you have settled all issues regarding the Teradata physical data model and the table design so that all your tables can be worked with regularly. It is now time to let Teradata know what to expect from a table before it is picked up in an SQL statement. The direction should be clear: make use of it! Provide information so that Teradata can estimate its path through a query close to reality. A free and powerful mechanism for doing this is the collection of statistics.
The Scope of the Statistics Collection
Statistics can be collected on a single column as well as on a combination thereof.
The physical table design determines part of what you collect, but apart from that, it should be driven by table usage practice.
The first set of collection options should always be chosen:
on the dummy column PARTITION, irrespective of your table being partitioned or not
on the Primary Index columns (in combination)
on the Secondary Index columns (in combination)
on the Columns used for partitioning
The second, usage-driven set of collect options consists of
Join columns: columns that are typically used to join the table to others.
WHERE condition columns: columns frequently used to select data subsets from the table
There are some attributes that, from our experience, do not qualify for collection and should, therefore, be omitted unless high case-to-case evidence indicates otherwise
time and date columns in table landscapes of daily changes
Code, type or flag columns hardly used in follow-up queries
In order to understand how queries execute on a Teradata System, it is important to have an understanding of the parallel architecture of Teradata and how the components making up the system are interacting with each other.
The Teradata architecture described is the same for each relevant release of Teradata. Additional features introduced over time are not covered as they do not give to the general understanding of the system architecture.
I will not go into great detail here. If you are interested in more detailed information, I recommend reading the Teradata Manuals which you can download for free. I will limit this post to information sufficient to understand parallelism on Teradata.
– Parsing Engine (PE)
– Access Module Processor (AMP)
– Virtual Disks
The Parsing Engine
The parsing engine is a process responsible for the following tasks:
– Session handling (session authorization, log on, log off)
– Parsing the SQL statement (syntax checking, looking up the used database objects)
– Optimization of the SQL statement (statistics are crucial at this time)
– Creation of the execution plan
– Dispatching the execution plan to the AMPs
– some other tasks like character set conversion etc. which are not relevant for our purpose of understanding how SQL statements are executed.
The Parsing Engine is the piece of software communicating with the client applications (SQL Assistant, BTEQ, etc.)
Each Teradata System has several Parsing Engines executed at the same time. The number of Parsing Engines can be increased dynamically by the system if needed. Each Parsing Engine can only handle a limited amount of sessions.
The Teradata AMP
Each Teradata AMP is a process handling a portion of the database system on its own.
AMPs are doing the work on a Teradata System, and each Teradata AMP is managing all work related to the rows which it is assigned to.
The main tasks of an AMP are storing rows, retrieving rows, sorting, and aggregation.
Each AMP returns its result set back to the Parsing Engine (via the BYTNET, see the details below). As we already know, the Parsing Engine is communicating directly with the client tools passing the complete result set to the client software, such as SQL Assistant.
Each AMP has its memory and handles exactly one virtual disk. We are not interested in concrete details of the implementation here as this is not needed for our understanding of the system and how components are interacting with each other.
AMPS and Parallelism
AMPs are the foundation for parallelism on Teradata. The goal of each Teradata system is to distribute work evenly across all available AMPs. If we can reach this aim, we will have a performance-optimized system regarding parallelism.
Finally, the BYNET is a link between the Parsing Engine and the AMPs. It is the network part used for the communication between the Parsing Engine and the AMPs.
From a result set point of view, it does not matter if you use DISTINCT or GROUP BY in Teradata. The answer set will be the same.
From a performance point of view, it is not the same.
To understand what impacts performance, you need to know what happens on Teradata when executing a statement with DISTINCT or GROUP BY.
In the case of DISTINCT, the rows are redistributed immediately without any preaggregation taking place, while in the case of GROUP BY, in a first step a preaggregation is done and only then are the unique values redistributed across the AMPs.
Don’t get fooled by my statement above and think now that GROUP BY is always better from a performance point of view. When you have many different values, the preaggregation step of GROUP BY is not very efficient. Teradata has to sort the data to remove duplicates. In this case, it may be better to the redistribution first, i.e. use the DISTINCT statement. Only if there are many duplicate values, the GROUP BY statement is probably the better choice as only once the deduplication step takes place, after redistribution.
In short, DISTINCT vs. GROUP BY in Teradata means:
GROUP BY -> for many duplicates
DISTINCT -> no or a few duplicates only
At times, when using DISTINCT, you run out of spool space on an AMP. The reason is that redistribution takes place immediately, and skewing could cause AMPs to run out of space.
If this happens, you have probably a better chance with GROUP BY, as duplicates are already removed in a first step, and less data is moved across the AMPs.
The Teradata Collect Statistics Using Sample feature explained
Collecting sample statistics on a Teradata system helps to cut workload. If used cautiously, it allows creating as good execution plans as full statistics.
What is sample statistics? The optimizer will use a sample of 2% of the rows to get information about the data demographics. We can easily change the sample size, by running the statement below (The new sample size is valid until you log off your session):
DIAGNOSTIC “COLLECTSTATS, SAMPLESIZE=n” ON FOR SESSION;
When can sample statistics be applied? For sample statistics, it’s crucial, that the table is not skewed. Skewing can be found by executing below query:
If the skew is small (<= 5%), sample statistics can be used. I would not be concerned about collecting full statistics on small tables, but a collection of comprehensive statistics on big tables consumes a lot of CPU and Disk IOs and should be considered as sample statistics candidates:
COLLECT STATISTICS USING SAMPLE ON <TABLE> COLUMN (<COLUMN>);
Sample statistics are not possible on GLOBAL TEMPORARY TABLES, JOIN INDICES and VOLATILE TABLES.
There are two major problems, when using functions on join columns.
The Teradata Optimizer cannot make use of statistics; String manipulation functions like SUBSTR(), TRIM(), the LIKE function, COALESCE() and CASE WHEN statements force the optimizer to use heuristics to make its estimations.In case you cannot avoid using a function, try at least different strategies, and figure out which function allows for the best execution planNevertheless, using functions in your joins, means that you have a bad physical data model. Either it’s technically outdated, inaccurate on the logical data model, or it has to serve two masters.
Ask yourself: Why there is no separate column that stores the essential information as you need it?
Using functions on your join columns forces the optimizer to do a full table scan, slowing down your queries
Not each function is equal in its negative impact on performance. I experienced different results for applications of LIKE ‘a%’ versus SUBSTR(column,1,1) = ‘a’ .
You have to experiment a little bit with the various alternatives and chose the one that impairs performance the least. The best approach, though, is to avoid the usage of functions on any join column by design.
In relational data modeling theory, each entity needs a primary key. Each primary key value uniquely identifies an object. Teradata initially followed the rules of relational data modeling tightly and did not allow row duplicates in a table.
As a logical data modeler, I would expect duplicate rows only in the physical data model. From a business point of view, I can’t imagine a scenario where we would need to store row level duplicates. Data cleansing as part of the ETL/ELT chain is the correct approach to handling row duplicates.
When the ANSI standard permitted to allow row duplicates, also Teradata added this option. The ANSI standard view on duplicates forced Teradata to allow SET tables (no row duplicates allowed) and MULTISET tables (row duplicates allowed) exist.
Above background knowledge makes clear that the default in Teradata mode are SET tables, and the default in ANSI mode are MULTISET tables.
Let me repeat the key aspects of SET and MULTISET tables:
In contrast to MULTISET tables, SET tables forbid duplicate rows to be inserted with an INSERT statement or created with an UPDATE statement.
SET tables can have a negative performance impact.
For INSERT INTO SELECT * FROM <TABLE> statements, duplicate rows are automatically for SET tables (no error occurs). Nevertheless, INSERT INTO <TABLE> VALUES (a, b) statements, will abort and throw a mistake message for duplicate rows.
There is no way to change an existing SET table into a MULTISET table. I don’t know why this limitation exists…
There is a negative performance impact for SET tables. Each time a row is inserted or updated, Teradata checks if this would violate row uniqueness. This test is called DUPLICATE ROW CHECK, and will severely degrade performance if many rows with the same primary index are inserted. The number of controls increases exponentially with each new row added to the table!
Still, the performance of SET tables is not degraded, when there is a UPI (Unique primary index) defined for the table. As the UPI itself ensures uniqueness, no DUPLICATE ROW CHECK is done.
Instead of a UPI, we can also use a USI (Unique Secondary Index), or any column with a UNIQUE or PRIMARY KEY constraint.
Any index or constraint which ensures uniqueness (and therefore allows Teradata to bypass the DUPLICATE ROW CHECK), avoids the duplicate row check!
SET tables are good candidates for performance improvement. The easiest way to find all SET tables on a Teradata system is to query the table “TABLES” in database “DBC” :
SELECT * FROM DBC.TABLES WHERE checkopt = ‘N’ AND TABLEKIND = ‘T’; — Set Tables
SELECT * FROM DBC.TABLES WHERE checkopt = ‘Y’ AND TABLEKIND = ‘T’; — Multiset tables
All tables where uniqueness is given programmatically – such is the case in a GROUP BY statement, – can be switched from SET to MULTISET. In this way, we can make performance improvements. The magnitude of the improvement depends on the number of rows per primary index value.
Again: The number of DUPLICATE ROW CHECKS grows exponentially with the number of rows per Primary Index (PI)!
Here is an example to prove the performance penalty of a SET table with many duplicate primary index values:
In our example, we create two identical tables:
CREATE SET TABLE TMP_SET ( PK INTEGER NOT NULL, DESCR INTEGER NOT NULL ) PRIMARY INDEX (PK);
CREATE MULTISET TABLE TMP_MULTISET ( PK INTEGER NOT NULL, DESCR INTEGER NOT NULL ) PRIMARY INDEX (PK);
— In the next step we check our session id, as we will need it to analyze the resource usage:
— We insert random data into the Set and Multiset table but only use 500 different Primary Index values to cause — some impact on performance. The “descr” column has to be unique as row level duplicates would be filtered INSERT INTO TMP_MULTISET SELECT RANDOM(1,500) AS x, RANDOM(1,999999999) AS descr FROM <ANY_TABLE_WITH_ABOUT_500000_RECORDS>; ;
INSERT INTO TMP_SET SELECT RANDOM(1,500) AS x, RANDOM(1,999999999) AS descr FROM <ANY_TABLE_WITH_ABOUT_500000_RECORDS>; ;
— We compare the CPU seconds and DISK IOs for the SET and the MULTISET table:
SELECT * FROM DBC.DBQLOGTBL WHERE SESSIONID = 7376827;
Above example shows, that SET tables consume much more Disk IOs and CPU seconds.
Set Tables / Multiset Tables and the Loading Utilities
While Multiload is compatible with SET and MULTISET tables, this is not the case for the Fastload Utility. Fastload will filter row duplicates whether the table is a SET table or a MULTISET table.
If we want to get rid of row duplicates in a Multiset table, we can do a combination of FastExport and Fastload. The Fastload will filter the row duplicates.
Is there anything missing in this blog post or is something wrong which we should correct?
Roland Wenzlofsky is a graduated computer scientist and Data Warehouse professional working with the Teradata database system for more than 15 years. He is experienced in the fields of banking and telecommunication with a strong focus on performance optimization.
My mission is to make DWHPro a leading independent network for professionals working with Teradata from all walks of business life, across cultures, countries, employment statuses, industries or roles.