Teradata Table Design Checklist
As creating tables is a frequent task, I deemed it necessary to provide a checklist.
- Primary Index or NOPI?
- Is the Primary Index distributing rows evenly?
- Is the correct character set used?
- Are character sets and data types matching across tables?
- Can the Primary Index be unique?
- Should the table be MULTISET?
- Should secondary indices be available?
- Can temporary tables be normalized?
- Is the correct data block size used?
1. Primary Index Or NOPI Table?
The Primary Index serves multiple purposes, primarily functioning as the primary access path to the data and being optimized for an even distribution of rows.
However, designing a table without a primary key can be advantageous in some situations.
Staging is a prime illustration:
Defining stage tables without a primary index can boost load performance. This means fastloads will not have to sort and distribute rows during the final loading phase.
The effect of NOPI tables on load performance is contingent upon subsequent stages in the ETL process and the consequences of a lack of a primary index.
Find additional information on using NOPI tables here: NOPI Tables on Teradata.
2. Is The Primary Index Skewed?
To achieve a balanced allocation of table rows across all AMPs, it is crucial to possess knowledge about your data. This necessitates the existence of a Primary Index.
Avoid using minimal test data as the foundation for your primary index determination.
The demographics of production data can undergo significant changes, as I have observed on multiple occasions. Allow me to provide a concrete instance from my professional experience involving a European telecommunications client. Specifically, the primary index of the primary customer table had been directly mapped, on a one-to-one basis, from the corresponding column in the source system. This column was known as “customer_id”.
Customer test data files were delivered during the project’s development phase to support the selected primary index. The rows were evenly distributed across all AMPs, resulting in excellent load performance.
The physical data model was optimized to facilitate call-center employees in real-time customer value lookup during tactical workload.
To facilitate reporting, the client’s table can efficiently join other related tables, including call details.
The project appeared flawless until the commencement of production. On the initial day of production, the loading period for the customer table drastically increased from roughly 30 minutes to 6 hours. This was followed by an error indicating insufficient disk space.
What occurred?
During the testing phase, only files about “contract” customers were delivered, and no consideration was given to testing files for “prepaid” customers. It was subsequently discovered that all prepaid customers had been assigned a single “customer_id” in the source system.
The outcome was significant revamp, a delay of several months in the project timeline, and total damage to the reputation.
It is advisable only to employ a skewed primary index in rare circumstances, particularly when the goal is to consolidate all records in one data block associated with a single AMP.
Achieve rowhash locking and gain additional advantages by reading our article “All Rows on One AMP by Design.”
3. Which CHARACTER SET To Use?
Neglecting these considerations may initially save time but ultimately proves futile. Naturally, you have the option to designate each individual character column as UNICODE.
However, this approach has two drawbacks: it will consume precious disk space and hinder the performance of requests about said tables.
Carelessness can result in a catastrophe, as demonstrated by an actual incident at a European central bank.
We encountered tables containing hundreds of columns that caused slow load times, taking an absurd six hours to load just 50GB of data.
Upon being assigned to the project to assess the performance issues, it was discovered that all columns were defined with the UNICODE character set.
Calculations indicated that the data rows were wide, with only one to two rows fitting into each data block.
Dozens of tables underwent a redesign that necessitated changing all columns to the LATIN character set and retaining only a select number of UNICODE columns. This resulted in significantly reduced load times due to the increased number of data blocks that could accommodate more rows.
This article shows you how significant the impact of wrongly chosen character sets can be: Teradata Tuning Success – The Best ever
4. Are Data Types And CHARACTER SETS Matching Across Tables?
Ensuring consistency of data types is critical, yet it is frequently overlooked.
During query execution, converting data types and character sets can be costly in terms of CPU time and can hinder AMP local activities, such as row-key-based merge joins. For rows from different tables that are joined on their primary index to be co-located on the same AMP, their data types and character sets must match.
5. Can A Unique Primary Index Be Used?
Always define your primary index as unique. Skipping uniqueness for convenience and ignoring duplicates is an unprofessional approach to development.
A primary index may not be feasible in row partitioned tables when the partition column is not part of the primary index, thereby rendering it non-unique. In such instances, a unique secondary index (USI) can be included to attain uniqueness.
Skipping uniqueness can result in various drawbacks that may not always be apparent at first glance. Here are a few instances:
- NUPI tables drastically reduce the possibilities of the Teradata Optimizer to use the Nested Join strategy.
- No duplicate row checks (see point 6) are needed in the case of SET tables, as the UPI guarantees uniqueness.
6. Can MULTISET Tables Be Used?
Utilize MULTISET tables if you allow duplicate rows.
To comprehend the duplicate row checks, familiarity with Teradata’s hashing-based distribution across all AMPs is essential.
Upon insertion of a row into a table, the parsing engine calculates a hash value. It references a hashmap to determine the appropriate AMP to write the row onto its logical disk.
A message with TableId, RowHash, and Primary Index Columns content is sent to the AMP for unique table identification.
What is the reason for sending the primary index column values?
Although infrequent, it is possible for two distinct Primary Index values to produce an identical row hash.
To properly arrange a row within the data block, comparing the hash value and primary index column content is necessary.
To ensure that a SET table does not contain identical rows, each new row must undergo a “duplicate row check,” which involves comparing it to all existing values.
The necessary comparisons grow exponentially, resulting in overloaded AMPs when dealing with skewed insert statements.
You can create a unique secondary index to prevent duplicate rows without defining a MULTISET table.
Learn more about SET and MULTISET tables by visiting the Teradata Multiset versus Set Tables link.
7. Should A Secondary Index Be Created?
Remove secondary and join indexes from tables before loading them with large amounts of data, regardless of the loading technique (such as transactional methods BTEQ and TPUMP).
Ideally, tables ought to be loaded without indexes and then have the indexes generated.
This exemplifies the benefits of this strategy.
I once encountered a problem that required a solution:
After 2 hours, the user began an insert statement but did not complete it. Due to impatience, the user ultimately aborted the query.
I was informed of a rollback the next day after the query had been in “ABORTED” status for an additional 10 hours.
I assessed the situation.
- The inserted primary index values were skewed, and all rows ended up on the same AMP (mostly NULL values)
- The table was a SET table; this worsened the situation because of “duplicate row checks”.
- There were 4 NUSIs defined in the table.
I canceled the rollback due to the substantial impact of this query on the Teradata system’s performance.
This is always an emergency solution since it renders the table inconsistent.
The catastrophic outcome was caused by the combination of the SET table, four secondary indices, and significant skew.
The excessive amount of duplicate row verifications executed by a single AMP presented a significant issue. Additionally, the maintenance of four NUSIs substantially decreased the system’s efficiency in this scenario.
This issue could have been avoided with the aforementioned approach.
8. Can Columns Be Normalized On The Fly With Temporary Tables?
Suppose you utilize temporary tables of any kind. In that case, optimizing all relevant aspects, such as data types, character sets, and primary indexes, is advisable to ensure optimal performance for your workload.
Normalizing columns in temporary tables is a recommended practice to prevent the application of functions for obtaining necessary information from these columns.
Applying functions to columns can pose issues in both joins and WHERE conditions.
- They often prevent the use of statistics in these columns.
- They require a redistribution of the rows in joins.
Make them atomic. Splitting will help improve joining performance, as you can avoid joining conditions such as “ON TABLE1.PK = SUBSTRING(TABLE2.PK,1,5).
9. Does The Data Block Size Support Your Type Of Workload?
Teradata 14 allows for creating data blocks with a maximum size of 1MB.
Kindly take note of the following:
Larger data blocks allow for more efficient data copying from storage to main memory. However, each data block requires main memory.
Teradata may read a 1MB data block to retrieve a solitary row in extreme situations.
Smaller data blocks are preferred for tactical workloads on Teradata systems, as they allow for querying individual rows.
Using bigger data blocks for strategic workloads like performing a complete table scan and running aggregations is advisable.
Hi Roland,
–“Some simple calculations revealed that data rows were huge: Only 1-2 rows were fitting into each data block.”
Is there a SQL way to find out how many rows of a table are being stored in each data block.
On a side note, google search “dwhpro the ultimate Teradata create table checklist” didn’t return expected results.
Regards,
Virendra
Hi Virendra,
If you are on Teradata 14.10 or above and have statistics version 6 enabled, you will get this information with the SHOW STATISTICS statement.
COLLECT SUMMARY STATISTICS
ON TheDatabase.TheTable
VALUES
(
/* Version */ 6,
/* AvgRowsPerBlock */ 337.500000,
/* AvgBlockSize (bytes) */ 256512.000000,
/* AvgRowSize */ 64033.000000,
);
The only problem is that the number AvgRowsPerBlock is wrong at the moment (a bugfix from Teradata will follow). But you can easily calculate the average number of rows per block using the correct two measures:
(AvgBlockSize / AvgRowSize) = Average Rows per Block
Another option would be to use the ferret utility, but this would require access to the Teradata node (and probably a linux root user).
Hope this answered your question
Thanks for the comment about google. In my experience it can require some weeks until Google gets the search results right.
Roland