9 Things to consider when creating a Teradata Table
Teradata Table Design Checklist
The creation of tables in an ever-recurring task, so I thought it was time to offer a checklist.
- Primary Index or NOPI?
- Is the Primary Index distributing rows evenly?
- Is the correct character set used?
- Are character set and data types matching across tables?
- Can the Primary Index be unique?
- Should the table be MULTISET?
- Should secondary indices be available?
- Can temporary tables be normalized?
- Is the correct data block size used?
1. Primary Index or NOPI Table?
The Primary Index is used for various purposes. It usually is optimized for even row distribution and to be sued as the main access path to the data.
But there are scenarios where it's could be beneficial to create a table without a primary index (NOPI).
Staging is one example:
By defining stage tables without a primary index, the load performance can be increased (for example, fastloads will not need to sort and distribute rows in its final phase of loading).
If NOPI tables help to improve the overall load performance, of course, it depends on the later steps in the ETL process and the impact of a missing primary index.
More details about the usage of NOPI tables you can find here: NOPI Tables on Teradata
2. Is the Primary Index skewed?
The goal is to have a Primary Index which distributes the table rows evenly across all AMPs. Consequently, it is necessary to know your data.
Don't rely on small amounts of test data as a base for your primary index decision.
Data demographics of production data can change significantly. I experienced this several times; I will give you one real-life example (my client was a European telco):
The main customer table's primary index was mapped 1:1 from the correspondent source system column, let's call it "customer_id".
During the development phase of the project, several files with customer test data were delivered, supporting the chosen primary index. As rows were distributed evenly across all AMPs, load performance was excellent.
The whole physical data model was optimized to support call-center employees to look up the customer value in real-time during their calls (i.e. tactical workload).
Furthermore, the client's table could be joined very efficient with other related tables, such as call details to support reporting.
Everything looked perfect until we entered the production phase. On the first day, the load times for the customer table increased from about 30 minutes to 6 hours with a subsequent "no more disk space" error.
What had happened?
During the test phase, all delivered files were related to "contract" customers. No one ever thought about testing "prepaid" customers. It turned out that all prepaid customers were assigned to a single "customer_id" in the source system...
The result was a massive redesign, a project delay of several months, and a complete loss of reputation.
In rare cases, you would like to "design" a skewed primary index, for example, if you want to force all rows into one data block of one AMP.
Doing this, you can achieve rowhash locking instead of table locks and have other advantages. For details, please read our article All Rows on one AMP by Design.
3. Which CHARACTER SET to use?
While it saves time to leave out such considerations, in the end, it never pays off. Of course, you could define each single character column as UNICODE.
But such an approach comes with two disadvantages: you will waste valuable disk space and decrease the performance of requests related to these tables.
Such a sloppy approach can lead to a disaster. Here one example from real life. It recently happened in a major European bank:
We had to deal with tables, each of them carrying several hundreds of columns. Load times (fast load) were horrible: 6 hours for a ridiculous 50GB of data!
When I was called to the project to analyze these performance issues, it turned out that every single column was defined with character set UNICODE.
Some simple calculations revealed that data rows wide: Only 1-2 rows were fitting into each data block.
A redesign of dozens of tables was required. By changing all columns to the character set LATIN, and only keeping a few UNICODE columns, the load times were reduced drastically as more rows could be packed into each data block.
This article shows you how significant the impact of wrongly chosen character sets can be: Teradata Tuning Success - The Best ever
4. Are Data Types and CHARACTER SETS matching across Tables?
Data type consistency is a crucial point but often overseen.
Data Type and Character Set Conversions during query executions are expensive operations (CPU time) and prevent AMP local activities (such as row-key based merge joins). Rows from different tables joined on their primary index are only co-located on the same AMP if data types match. The same holds true for character sets.
5. Can a Unique Primary Index be defined?
If possible, Always define your primary index as unique. It's a very sloppy way of developing if you skip uniqueness because it's more convenient ("no problem with duplicates").
A unique primary index may not always be possible, for example in the case of row partitioned tables, if the partition column is not included in the primary index. In this case, you can achieve uniqueness by adding a USI (unique secondary index).
There are several (often not obvious) disadvantages if we skip uniqueness. Here are some examples:
- NUPI tables drastically reduce the possibilities of the Teradata Optimizer to use the Nested Join strategy.
- No duplicate row checks (see in point 6) are needed in the case of SET tables, as the UPI guarantees uniqueness.
6. Can MULTISET tables be used?
If you don't have to make sure that duplicate rows are declined, always use MULTISET tables.
To understand what duplicate row checks are, you need to know how Teradata distributes the data over the AMPs using hashing.
Each time a row is inserted into a table, the parsing engine determines a hash value and uses the hashmap to determine the AMP that should write the row to its logical disk.
Therefore a message consisting of TableId (to uniquely identify the table), RowHash (the hash value), and the content of the Primary Index Columns is sent to the AMP.
Why is the value of the primary index columns sent?
This is because two different Primary Index values can generate the same row hash. It's rare, but it happens.
This means that in order to sort a row into the data block at the right place, not only the hash value must be compared, but also the content of the primary index columns.
Since no two ident rows may be inserted into a SET table, each new row to be inserted must be compared against all existing values ("duplicate row check").
The number of required comparisons increases exponentially and finally leads to one or more AMPs being overloaded in skewed insert statements.
If you cannot define a MULTISET table, then a duplicate row check can be prevented by creating a unique secondary index!
More details about SET and MULTISET tables you can find here: Teradata Multiset versus Set Tables
7. Should a Secondary Index be defined?
Delete secondary indexes and join indexes from tables before loading them with a large number of rows, even if the loading method does not require this (transactional load methods such as BTEQ, TPUMP).
It is usually better to remove the indexes, load the tables, and then create the indexes.
Here is an example of why this approach is advantageous
Once I had to solve the following problem:
A user had started an insert statement, which was not finished after 2 hours. Finally, this user became impatient and aborted the query.
This led to a rollback. I was notified the following day when the query was already in "ABORTED" status for another 10 hours.
I analyzed the situation:
- The inserted primary index values were skewed and all rows ended up on the same AMP (mostly NULL values)
- The table was a SET table; this was further worsening the situation because of "duplicate row checks" being done
- There were 4 NUSIs defined on the table
I decided to cancel the rollback because this query had a massive impact on the performance of the Teradata system.
This is always an emergency solution because the table is then in an inconsistent state!
The combination of the SET table, 4 secondary indices, and the enormous skew had led to this catastrophic result.
The huge number of duplicate row checks performed by only one AMP was a major problem. But of course, the maintenance of 4 NUSIs additionally had an enormous negative effect on the performance of the system in this situation.
This problem could have been prevented with the above method
8. Can Columns be normalized on the fly with Temporary Tables?
If you are using temporary tables (volatile, global temporary or physical tables), you should at first consider improving all areas mentioned above (data types, character sets, primary index, etc.) to support your workload in the best possible way.
It is also a good practice to normalize columns in temporary tables. This prevents functions from having to be applied to these columns to obtain the required part of the information.
Functions applied to columns usually create a problem in joins as well as in WHERE conditions:
- They often prevent the use of statistics on these columns.
- They require a redistribution of the rows in joins.
Make them atomic. Splitting will help to improve the performance of joining, as you can avoid joining conditions such as "ON TABLE1.PK = SUBSTRING(TABLE2.PK,1,5).
9. Does the Data Block Size support your type of Workload?
Starting with Teradata 14 it is possible to create data blocks up to a size of 1MB.
Please note the following:
The larger the data blocks, the more data can be copied from the mass memory to the main memory in one read operation. However, more main memory is needed per data block.
In the worst case, Teradata reads a 1MB block of data to return a single row.
For tactical workload on a Teradata system, smaller data blocks are usually recommended, as individual rows are queried.
Larger data blocks are recommended for strategic workloads (which often include a full table scan and aggregations).
--"Some simple calculations revealed that data rows were huge: Only 1-2 rows were fitting into each data block."
Is there a SQL way to find out how many rows of a table are being stored in each data block.
On a side note, google search "dwhpro the ultimate teradata create table checklist" didn't return expected results.
If you are on Teradata 14.10 or above and have statistics version 6 enabled, you will get this information with the SHOW STATISTICS statement.
COLLECT SUMMARY STATISTICS
/* Version */ 6,
/* AvgRowsPerBlock */ 337.500000,
/* AvgBlockSize (bytes) */ 256512.000000,
/* AvgRowSize */ 64033.000000,
The only problem is that the number AvgRowsPerBlock is wrong at the moment (a bugfix from Teradata will follow). But you can easily calculate the average number of rows per block using the correct two measures:
(AvgBlockSize / AvgRowSize) = Average Rows per Block
Another option would be to use the ferret utility, but this would require access to the Teradata node (and probably a linux root user).
Hope this answered your question
Thanks for the comment about google. In my experience it can require some weeks until Google gets the search results right.