Teradata Table Design Checklist
Creating tables is an ever-recurring task, so I thought it was time to offer a checklist.
- Primary Index or NOPI?
- Is the Primary Index distributing rows evenly?
- Is the correct character set used?
- Are character sets and data types matching across tables?
- Can the Primary Index be unique?
- Should the table be MULTISET?
- Should secondary indices be available?
- Can temporary tables be normalized?
- Is the correct data block size used?
1. Primary Index Or NOPI Table?
The Primary Index is used for various purposes. It is usually optimized for even row distribution and used as the primary access path to the data.
But there are scenarios where creating a table without a primary index (NOPI) could be beneficial.
Staging is one example:
By defining stage tables without a primary index, the load performance can be increased (for example, fastloads will not need to sort and distribute rows in the final loading phase).
If NOPI tables help improve the overall load performance, it depends on the later steps in the ETL process and the impact of a missing primary index.
More details about the usage of NOPI tables you can find here: NOPI Tables on Teradata
2. Is The Primary Index Skewed?
The goal is to have a Primary Index that distributes the table rows evenly across all AMPs. Consequently, it is necessary to know your data.
Don’t rely on small amounts of test data as a base for your primary index decision.
Data demographics of production data can change significantly. I have experienced this several times; I will give you one real-life example (my client was a European telco):
The primary index of the main customer table was mapped 1:1 from the correspondent source system column. Let’s call it “customer_id”.
Several files with customer test data were delivered during the project’s development phase, supporting the chosen primary index. As rows were distributed evenly across all AMPs, load performance was excellent.
The whole physical data model was optimized to support call-center employees to look up the customer value in real-time during their calls (i.e., tactical workload).
Furthermore, the client’s table could be joined efficiently with other related tables, such as call details to support reporting.
Everything looked perfect until we entered the production phase. The customer table’s load times increased from about 30 minutes to 6 hours on the first day with a subsequent “no more disk space” error.
What had happened?
All delivered files were related to “contract” customers during the test phase. No one ever thought about testing “prepaid” customers. It turned out that all prepaid customers were assigned to a single “customer_id” in the source system…
The result was a massive redesign, a project delay of several months, and a complete loss of reputation.
You would like to “design” a skewed primary index in rare cases, for example, if you want to force all rows into one data block of one AMP.
By doing this, you can achieve rowhash locking instead of table locks and have other advantages. Please read our article All Rows on one AMP by Design for details.
3. Which CHARACTER SET To Use?
While it saves time to leave out such considerations, in the end, it never pays off. Of course, you could define each single character column as UNICODE.
But such an approach has two disadvantages: you will waste valuable disk space and decrease the performance of requests related to these tables.
Such a sloppy approach can lead to a disaster. Here is one example from real life. It recently happened in a European central bank:
We had to deal with tables carrying several hundreds of columns. Load times (fast load) were horrible: 6 hours for a ridiculous 50GB of data!
When I was called to the project to analyze these performance issues, it turned out that every single column was defined with the character set UNICODE.
Some simple calculations revealed that data rows were wide: Only 1-2 rows fit into each data block.
A redesign of dozens of tables was required. Changing all columns to the character set LATIN and only keeping a few UNICODE columns, the load times were reduced drastically as more rows could be packed into each data block.
This article shows you how significant the impact of wrongly chosen character sets can be: Teradata Tuning Success – The Best ever
4. Are Data Types And CHARACTER SETS Matching Across Tables?
Data type consistency is a crucial point but often overseen.
Data Type and Character Set Conversions during query executions are expensive operations (CPU time) and prevent AMP local activities (such as row-key-based merge joins). Rows from different tables joined on their primary index are only co-located on the same AMP if data types match. The same holds for character sets.
5. Can A Unique Primary Index Be used?
If possible, Always define your primary index as unique. It’s a very sloppy way of developing if you skip uniqueness because it’s more convenient (“no problem with duplicates”).
A unique primary index may not always be possible, for example, in row partitioned tables, if the partition column is not included in the primary index. In this case, you can achieve uniqueness by adding a USI (unique secondary index).
There are several (often not obvious) disadvantages if we skip uniqueness. Here are some examples:
- NUPI tables drastically reduce the possibilities of the Teradata Optimizer to use the Nested Join strategy.
- No duplicate row checks (see in point 6) are needed in the case of SET tables, as the UPI guarantees uniqueness.
6. Can MULTISET Tables Be Used?
Always use MULTISET tables if you don’t have to ensure duplicate rows are declined.
To understand duplicate row checks, you must know how Teradata distributes across all AMPs using hashing.
Each time a row is inserted into a table, the parsing engine determines a hash value and uses the hashmap to determine the AMP that should write the row to its logical disk.
Therefore a message consisting of TableId (to identify the table uniquely), RowHash (the hash value), and the content of the Primary Index Columns is sent to the AMP.
Why is the value of the primary index columns sent?
Two different Primary Index values can generate the same row hash. It’s rare, but it happens.
This means that to sort a row into the data block at the right place, the hash value must be compared and the content of the primary index columns.
Since no two identical rows may be inserted into a SET table, each new row must be compared against all existing values (“duplicate row check”).
The required comparisons increase exponentially, leading to one or more AMPs being overloaded in skewed insert statements.
If you cannot define a MULTISET table, a duplicate row check can be prevented by creating a unique secondary index!
More details about SET and MULTISET tables you can find here: Teradata Multiset versus Set Tables
7. Should A Secondary Index Be Created?
Delete secondary indexes and join indexes from tables before loading them with many rows, even if the loading method does not require this (transactional load methods such as BTEQ and TPUMP).
It is usually better to remove the indexes, load the tables, and create the indexes.
Here is an example of why this approach is advantageous
Once I had to solve the following problem:
A user had started an insert statement, which was not finished after 2 hours. Finally, this user became impatient and aborted the query.
This led to a rollback. I was notified the following day when the query was already in “ABORTED” status for another 10 hours.
I analyzed the situation:
- The inserted primary index values were skewed, and all rows ended up on the same AMP (mostly NULL values)
- The table was a SET table; this worsened the situation because of “duplicate row checks”.
- There were 4 NUSIs defined in the table.
I decided to cancel the rollback because this query had a massive impact on the performance of the Teradata system.
This is always an emergency solution because the table is then in an inconsistent state!
The combination of the SET table, four secondary indices, and the enormous skew had led to this catastrophic result.
The significant number of duplicate row checks performed by only one AMP was a considerable problem. But of course, the maintenance of 4 NUSIs also had an enormous negative effect on the system’s performance in this situation.
This problem could have been prevented with the above method.
8. Can Columns Be Normalized On The Fly With Temporary Tables?
If you are using temporary tables (volatile, global temporary, or physical tables), you should consider improving all areas mentioned above (data types, character sets, primary index, etc.) to best support your workload.
It is also a good practice to normalize columns in temporary tables. This prevents functions from being applied to these columns to obtain the required part of the information.
Functions applied to columns usually create a problem in joins as well as in WHERE conditions:
- They often prevent the use of statistics in these columns.
- They require a redistribution of the rows in joins.
Make them atomic. Splitting will help improve joining performance, as you can avoid joining conditions such as “ON TABLE1.PK = SUBSTRING(TABLE2.PK,1,5).
9. Does The Data Block Size Support Your Type Of Workload?
Starting with Teradata 14, it is possible to create data blocks up to a size of 1MB.
Please note the following:
The larger the data blocks, the more data can be copied from the storage device to the main memory in one read operation. However, main memory is needed per data block.
In the worst case, Teradata reads a 1MB data block to return a single row.
Smaller data blocks are usually recommended for the tactical workload on a Teradata system, as individual rows are queried.
Larger data blocks are recommended for strategic workloads (including a full table scan and aggregations).
–“Some simple calculations revealed that data rows were huge: Only 1-2 rows were fitting into each data block.”
Is there a SQL way to find out how many rows of a table are being stored in each data block.
On a side note, google search “dwhpro the ultimate Teradata create table checklist” didn’t return expected results.
If you are on Teradata 14.10 or above and have statistics version 6 enabled, you will get this information with the SHOW STATISTICS statement.
COLLECT SUMMARY STATISTICS
/* Version */ 6,
/* AvgRowsPerBlock */ 337.500000,
/* AvgBlockSize (bytes) */ 256512.000000,
/* AvgRowSize */ 64033.000000,
The only problem is that the number AvgRowsPerBlock is wrong at the moment (a bugfix from Teradata will follow). But you can easily calculate the average number of rows per block using the correct two measures:
(AvgBlockSize / AvgRowSize) = Average Rows per Block
Another option would be to use the ferret utility, but this would require access to the Teradata node (and probably a linux root user).
Hope this answered your question
Thanks for the comment about google. In my experience it can require some weeks until Google gets the search results right.