The creation of tables in an ever-recurring task, so I thought it was time to offer a checklist.
The Primary Index is used for various purposes. It usually is optimized for even row distribution and to be sued as the main access path to the data.
But there are scenarios where it's could be beneficial to create a table without a primary index (NOPI).
Staging is one example:
By defining stage tables without a primary index, the load performance can be increased (for example, fastloads will not need to sort and distribute rows in its final phase of loading).
If NOPI tables help to improve the overall load performance, of course, it depends on the later steps in the ETL process and the impact of a missing primary index.
More details about the usage of NOPI tables you can find here: NOPI Tables on Teradata
The goal is to have a Primary Index which distributes the table rows evenly across all AMPs. Consequently, it is necessary to know your data.
Don't rely on small amounts of test data as a base for your primary index decision.
Data demographics of production data can change significantly. I experienced this several times; I will give you one real-life example (my client was a European telco):
The main customer table's primary index was mapped 1:1 from the correspondent source system column, let's call it “customer_id”.
During the development phase of the project, several files with customer test data were delivered, supporting the chosen primary index. As rows were distributed evenly across all AMPs, load performance was excellent.
The whole physical data model was optimized to support call-center employees to look up the customer value in real-time during their calls (i.e. tactical workload).
Furthermore, the client's table could be joined very efficient with other related tables, such as call details to support reporting.
Everything looked perfect until we entered the production phase. On the first day, the load times for the customer table increased from about 30 minutes to 6 hours with a subsequent “no more disk space” error.
What had happened?
During the test phase, all delivered files were related to “contract” customers. No one ever thought about testing “prepaid” customers. It turned out that all prepaid customers were assigned to a single “customer_id” in the source system…
The result was a massive redesign, a project delay of several months, and a complete loss of reputation.
In rare cases, you would like to “design” a skewed primary index, for example, if you want to force all rows into one data block of one AMP.
Doing this, you can achieve rowhash locking instead of table locks and have other advantages. For details, please read our article All Rows on one AMP by Design.
While it saves time to leave out such considerations, in the end, it never pays off. Of course, you could define each single character column as UNICODE.
But such an approach comes with two disadvantages: you will waste valuable disk space and decrease the performance of requests related to these tables.
Such a sloppy approach can lead to a disaster. Here one example from real life. It recently happened in a major European bank:
We had to deal with tables, each of them carrying several hundreds of columns. Load times (fast load) were horrible: 6 hours for a ridiculous 50GB of data!
When I was called to the project to analyze these performance issues, it turned out that every single column was defined with character set UNICODE.
Some simple calculations revealed that data rows wide: Only 1-2 rows were fitting into each data block.
A redesign of dozens of tables was required. By changing all columns to the character set LATIN, and only keeping a few UNICODE columns, the load times were reduced drastically as more rows could be packed into each data block.
This article shows you how significant the impact of wrongly chosen character sets can be: Teradata Tuning Success – The Best ever
Data type consistency is a crucial point but often overseen.
Data Type and Character Set Conversions during query executions are expensive operations (CPU time) and prevent AMP local activities (such as row-key based merge joins). Rows from different tables joined on their primary index are only co-located on the same AMP if data types match. The same holds true for character sets.
If possible, Always define your primary index as unique. It's a very sloppy way of developing if you skip uniqueness because it's more convenient (“no problem with duplicates”).
A unique primary index may not always be possible, for example in the case of row partitioned tables, if the partition column is not included in the primary index. In this case, you can achieve uniqueness by adding a USI (unique secondary index).
There are several (often not obvious) disadvantages if we skip uniqueness. Here are some examples:
If you don't have to make sure that duplicate rows are declined, always use MULTISET tables.
To understand what duplicate row checks are, you need to know how Teradata distributes the data over the AMPs using hashing.
Each time a row is inserted into a table, the parsing engine determines a hash value and uses the hashmap to determine the AMP that should write the row to its logical disk.
Therefore a message consisting of TableId (to uniquely identify the table), RowHash (the hash value), and the content of the Primary Index Columns is sent to the AMP.
Why is the value of the primary index columns sent?
This is because two different Primary Index values can generate the same row hash. It's rare, but it happens.
This means that in order to sort a row into the data block at the right place, not only the hash value must be compared, but also the content of the primary index columns.
Since no two ident rows may be inserted into a SET table, each new row to be inserted must be compared against all existing values (“duplicate row check”).
The number of required comparisons increases exponentially and finally leads to one or more AMPs being overloaded in skewed insert statements.
If you cannot define a MULTISET table, then a duplicate row check can be prevented by creating a unique secondary index!
More details about SET and MULTISET tables you can find here: Teradata Multiset versus Set Tables
Delete secondary indexes and join indexes from tables before loading them with a large number of rows, even if the loading method does not require this (transactional load methods such as BTEQ, TPUMP).
It is usually better to remove the indexes, load the tables, and then create the indexes.
Here is an example of why this approach is advantageous
Once I had to solve the following problem:
A user had started an insert statement, which was not finished after 2 hours. Finally, this user became impatient and aborted the query.
This led to a rollback. I was notified the following day when the query was already in “ABORTED” status for another 10 hours.
I analyzed the situation:
I decided to cancel the rollback because this query had a massive impact on the performance of the Teradata system.
This is always an emergency solution because the table is then in an inconsistent state!
The combination of the SET table, 4 secondary indices, and the enormous skew had led to this catastrophic result.
The huge number of duplicate row checks performed by only one AMP was a major problem. But of course, the maintenance of 4 NUSIs additionally had an enormous negative effect on the performance of the system in this situation.
This problem could have been prevented with the above method
If you are using temporary tables (volatile, global temporary or physical tables), you should at first consider improving all areas mentioned above (data types, character sets, primary index, etc.) to support your workload in the best possible way.
It is also a good practice to normalize columns in temporary tables. This prevents functions from having to be applied to these columns to obtain the required part of the information.
Functions applied to columns usually create a problem in joins as well as in WHERE conditions:
Make them atomic. Splitting will help to improve the performance of joining, as you can avoid joining conditions such as “ON TABLE1.PK = SUBSTRING(TABLE2.PK,1,5).
Starting with Teradata 14 it is possible to create data blocks up to a size of 1MB.
Please note the following:
The larger the data blocks, the more data can be copied from the mass memory to the main memory in one read operation. However, more main memory is needed per data block.
In the worst case, Teradata reads a 1MB block of data to return a single row.
For tactical workload on a Teradata system, smaller data blocks are usually recommended, as individual rows are queried.
Larger data blocks are recommended for strategic workloads (which often include a full table scan and aggregations).
With the Teradata Version 16.00, some interesting features in the area of Space Management have been introduced, which can bring enormous improvement in the daily activities in the area of administration and loading of the data warehouse.
In my current environment, it regularly happens that the daily loading of the data warehouse is interrupted because there is not enough permanent space available in the databases.
As we all know, the maximum available space is defined when a database is created. Here is the syntax to create a database with 80 GB perm space:
CREATE DATABASE Customer AS PERM = 800000000000;
This even distribution of the available PermSpace to all AMPs is based on the assumption that all rows of all tables are distributed evenly using the primary index. But what happens if this is not the case?
Let's assume that we want to load a table with 11GB of data, and the designer of the table structure has chosen the Primary Index so clumsy that all rows are assigned to exactly one AMP. Since the space limit on each AMP is 10GB,
we will receive the following error message:
2644: No more room in database Customer.
This means that although the remaining 7 AMPs each have 10GB available, the table cannot be loaded.
The maximum available space in a database is shown in table DBC.DATABASESPACE in the MaxPermSpace column. The currently occupied space can be seen in Column CurrentPermSpace. In CurrentPermSpace, however, only the entire occupied space of the database is displayed, not the occupied space per AMP.
I've often seen surprised faces of developers who were surprised that when there was enough space (MaxPermSpace – CurrentPermSpace) the error message “2644: No more room in database Customer” appeared.
In order to reduce or completely prevent such problems, Teradata 16.00 offers the possibility to globally define the available space.
Therefore the new DBC Table GlobalDBSpace, as well as an extended syntax for creating a database was introduced:
CREATE DATABASE Customer AS PERM = 800000000000 SKEW = 10 PERCENT;
The 10 PERCENT Skew given when creating the database means the following:
The total PermSpace available is still 80GB. Nothing's gonna change that. However, each of the 8 available AMPs is allowed to hold up to 11GB of data if it does not exceed the absolute upper limit of 80GB of the database.
In our example from above, in which the clumsy selection of the primary index would insert all rows on exactly one AMP, the error “2644 No more room in database Customer” would no longer occur. However, such a situation is recorded in the Software Event Log.
While the feature described above simplifies the handling of skew tables, another feature was introduced with Teradata 16.00, which has the purpose to prevent loads from crashing if not enough database space has been defined:
This is defined via DBSControl GlobalSPaceSoftLimit and specified as a percentage.
In principle, a percentage is defined here above which the defined PermSpace may be exceeded without the error “2644 No more room in database Customer” occurring.
For example, if the database is defined on an 8 AMP system with 80GB PermSpace, and a soft limit of 10%, up to 11GB PermSpace can be used per AMP.
Again, an event is triggered and written to the Software Event Log. This allows database administrators to analyze the situation
Today I am going to show you how to identify the source problem to tune the performance of query utilizing the unnecessary amount of IO, what UII indicator is, and how WHERE clause placed on a VIEW is evaluated.
This case shows that the WHERE clause placed on VIEW for 2 Partitioned tables resulted in excessive block reads and applying WHERE condition afterward. You need to be fully aware of the amount of data query reads and the amount of data query really needs.
Performance tuning boosted the performance 5 fold. The query is completed 4-5 times faster and utilizes 4-5 times fewer system resources:
Here is the target query for today which populates large fact table of the data mart.
– INSERT INTO … SELECT FROM …
– The query is run on a daily basis
– Over 1000M rows inserted (1 billion)
Query performance statistics stored in Teradata QueryLog signifies that the query is suspicious and performing badly in terms of unnecessary IO, excessive Spool Spool space and CPU utilization.
Take a look at Teradata performance metrics:
What knowledge can we discover from this data?
What is particularly wrong and how can we tune and improve the query?
First of all, let us describe what UII (Unnecessary IO indicator) is and what it can tell us.
UII is calculated as SumIO / (SumCPU * 1000). The metric is used to make an idea about query efficiency for consuming CPU and IO resources. If UII is relatively high then it could mean that many data blocks are read but a relatively small proportion of them is actually processed.
Let us examine source View DDL through which the target table is populated.
A-ha! Here we see that latest 6 month are extracted from the view.
The tables are partitioned by columns used in WHERE clause:
But only last month’s data is inserted in the target table! The INSERT statement is as:
So, the query simply reads 5 months data in vain with no particular reason. The thing is that before final query applies any WHERE clause, the whole data is read.
It means every data block under this VIEW is read first, and then non-relevant data is eliminated. But why read unnecessary data? Let us figure out how to force Teradata extract only relevant partitions.
Here is the answer:
1. We might put the WHERE clause from the final query INSIDE the source VIEW DDL.
2. We might create an entirely new VIEW for this ETL process in case if there are some other users who might not want any changes to the VIEW.
This optimization gave us 5 times boost in performance. The query is completed 4-5 times faster and utilizes 4-5 times fewer system resources.
Detailed query metrics below:
Artemiy Kozyr is Data Engineer at Sberbank, Moscow with Master Degree in CS.
He has 5 years of Experience in Data Warehousing, ETL, Visualization for Financial Institutions.
Decomposable columns can be split into smaller units. Decomposable columns mean that the physical data model is not normalized.
If we don't use a decomposable column for data retrieval, there is no negative impact on performance.
Still, if the column is used for data access, it should be decomposed into several columns which can be utilized independently for indexed access. You will get increased performance because there will be fewer full table scans.
Additionally, if the columns created from a decomposable column are NUSI columns, the possibilities for NUSI bitmapping are increased:
Here is an example:
In both cases typically a full table scan is required to access the data.
By decomposing the column ProductCode, we give the Optimizer a chance to use existing NUSIs:
Often we can't change the physical data model, and we have to find workarounds to avoid the negative performance impact of decomposable columns. Here is a simple but efficient solution:
Table Product is just a lookup containing all distinct ProductCodes. The Optimizer will probably do a full table scan on the Product Table with a subsequent merge join to the Transaction table. As the Product table is small, the cost of full table scan is negligible.
Still, the first choice is to avoid the implementation of decomposable columns whenever possible.
All the relational algebra operations, such as Select, Intersection, Product, Union, Difference, Project, Join, Division, Merge, etc. can also be performed on the Relational Database Model. Operations on the Relational Database Model are facilitated with the help of different conditional expressions, various key attributes, pre-defined constraints, etc. For example selection of information of the customer, who is living in some city for more than 20 years.
Constraints: Constraints are logic rules that are used to ensure data consistency or avoid certainly unacceptable operations on the data.
Relational Model: Summary
This data model is used by any business that sells the products above and services, including banks. Insurance Companies, credit card companies, and brokerages.
A comprehensive Plan of Implementation is Drawn-out by the Design Architects, for the implementations
Mapping the Source Data columns to the Destination Data columns
Converting the Logical Model into Physical Data Model by Modeling Tools