The Ultimate Teradata CREATE TABLE Checklist
The creation of tables in an ever-recurring task, so I thought it was time to offer a checklist.1
Will you define a Primary Index?
The Primary Index is used for various applications. It usually is designed for even row distribution and as the main access path to your data.
But there are scenarios where it’s could be beneficial to create a table without a primary index (NOPI).
Staging is one example: By defining a stage tables without a primary index, we can speed upload processing (for example, fastloads will not need to sort and distribute rows in the final load phase). If this helps to improve overall load performance, of course, depends on the later steps, you have to consider if and how they get slowed down because of a missing primary index.
More details about the usage of NOPI tables you can find here: NOPI Tables on Teradata2
Will your Primary Index be skewed?
Almost always, your main goal is to have a Primary Index which distributes the table rows evenly across all AMPs. Consequently, it is necessary to know your data. Don’t rely on small amounts of test data as a base for your primary index decision. Data demographics of production data can change significantly. I experienced this several times; I will give you one example of an epic fail from real life:
Once upon a time in a European Telecommunication company: The main customer table’s primary index was mapped 1:1 from the correspondent source system column, let’s call it “customer_id”. During the development phase of the project, several files with customer test data were delivered, supporting the chosen primary index. As rows were distributed evenly across all AMPs, load performance was excellent.
The whole physical data model was optimized to support shop members to look up the customer status during calls (i.e. tactical workload). Furthermore, the client’s table could be joined very efficient with other related tables, such as call details to support reporting.
Everything looked perfect until we entered production phase. On the first day, the load times for the customer table increased from about 30 minutes to 6 hours with a subsequent “no more disk space” error.
What happened? During the test phase, all delivered files were related to “contract” customers. No one ever thought about testing “prepaid” customers. It turned out that all prepaid customers were assigned to a single “customer_id” in the source system…
The result was a massive redesign, a project delay of several months, and a complete loss of reputation.
In rare cases, you would like to “design” a skewed primary index, for example, if you want to force all rows into one data block of one AMP. This way you could achieve rowhash locking instead of table locks and other advantages. For details, please read our article All Rows on one AMP by Design.3
Did you choose the correct CHARACTER SET for each Column?
While it saves time to leave out such considerations, in the end, it never pays off. Of course, you could define each single character column as UNICODE. But such an approach comes with two disadvantages: you will waste valuable disk space and decrease the performance of requests related to these tables.
If you are unlucky, such a sloppy approach will lead to a disaster. Here another example from real life. It recently happened in a major European bank:
We had to deal with tables, each of them carrying several hundreds of columns. Load times (fastload) were horrible: 6 hours for a ridiculous 50GB of data!
When I was called to the project to analyze these performance issues, it turned out that each single column was defined with character set UNICODE. Some simple calculations revealed that data rows were huge: Only 1-2 rows were fitting into each data block.
Put briefly, a redesign of dozens of table DDL’s was required. By changing all columns to character set LATIN, and only keeping a few as UNICODE, load times could be reduced drastically as more rows could be packed into each block.
This article shows you how significant the impact of wrongly chosen character sets can be: Teradata Tuning Success – The Best ever4
Do Data Types and CHARACTER SETS match across all Tables?
Data type consistency is a crucial point but often overseen. Data Type and Character Set Conversions during query executions are expensive operations (CPU time) and may prevent AMP local activities (such as joins). Rows from different tables joined via their primary index are only co-located on the same AMP if data types match. The same holds true for character sets.5
Can the Primary Index be UNIQUE?
I would recommend: always define your primary index unique if possible. Again, it’s a very sloppy way of developing if you skip uniqueness because it’s more convenient (“no problem with duplicates”).
Hint: This may not always be possible, for example in the case of row partitioned tables, if the partition column is not included in the primary index.
There are several (often not obvious) disadvantages if we skip uniqueness. Here are some examples:
- NUPI tables drastically reduce the possibilities of the Teradata Optimizer to use the Nested Join strategy.
- No duplicate row checks are needed in the case of SET tables, as the UPI guarantees uniqueness.
Can the Table be a MULTISET Table?
If you don’t have to make sure that duplicate rows are declined, always use MULTISET tables. They don’t need a costly duplicate row check when data is inserted. Furthermore, in the case of a skewed primary index, the difference in performance, when adding rows, can become huge.
More details about SET and MULTISET tables you can find here: Teradata Multiset versus Set Tables7
When is the best Time to add an Index?
If you have secondary indexes or join indexes defined on your table, it’s always a good idea to drop them before inserting big amounts of data into already populated tables. You can recreate the indexes afterward, and it will be much cheaper, especially if something goes wrong, like in the following example:
Recently, I worked on the following problem:
Somebody had executed an insert statement. As is was still not finished after hours, the same person decided to abort the session. As a result, a rollback started.
I was informed the next day when the request was executing already 12 hours…
I rapidly collected the facts:
- The inserted primary index values were skewed; all rows went to the same AMP (an enormous amount of NULL values)
- The table was a SET table; this was further worsening the situation
- There were 4 NUSIs defined on the table
I decided to cancel the rollback as it was clear that it would run for at least 30 more hours!
Without the 4 NUSIs, the rollback would have been much faster. But maintaining five rows per table row (4 index sub-table rows plus the base table row) was increasing this disaster.8
Temporary Tables: Can Columns be made atomic?
If you are working with any temporary tables (volatile, global temporary or physical), you should always consider improving all areas mentioned above (data types, character sets, primary index, etc.) to support your workload in the best possible way.
Furthermore, it’s a good practice to split columns which contain different types of information into separate columns. Make them atomic. Splitting will help to improve the performance of joining, as you can avoid join conditions such as “ON TABLE1.PK = SUBSTRING(TABLE2.PK,1,5).9
Does the Data Block Size support your Rows?
Since Teradata 14 we can define data blocks as big as 1024KB. If you increase data block size, the performance improvement can be high if you are dealing with enormous rows or have a lot of strategic queries scanning all rows.