Decomposable columns can be split into smaller units. Decomposable columns mean that the physical data model is not normalized.
If we don’t use a decomposable column for data retrieval, there is no negative impact on performance.
Still, if the column is used for data access, it should be decomposed into several columns which can be utilized independently for indexed access. You will get increased performance because there will be fewer full table scans.
Additionally, if the columns created from a decomposable column are NUSI columns, the possibilities for NUSI bitmapping are increased:
Here is an example:
In both cases typically a full table scan is required to access the data.
By decomposing the column ProductCode, we give the Optimizer a chance to use existing NUSIs:
Often we can’t change the physical data model, and we have to find workarounds to avoid the negative performance impact of decomposable columns. Here is a simple but efficient solution:
Table Product is just a lookup containing all distinct ProductCodes. The Optimizer will probably do a full table scan on the Product Table with a subsequent merge join to the Transaction table. As the Product table is small, the cost of full table scan is negligible.
Still, the first choice is to avoid the implementation of decomposable columns whenever possible.
Compression is a popular way of saving IOs, by allowing to pack more rows into each data block. Data blocks are the smallest unit transferred between hard drives and main memory, the most expensive operation in any database system.
Compression can be done automatically by the system (block level compression) or is applied by the user (MVC). Another option is to use compression software, such as the one offered by Prise Ltd. (check it out at https://www.prisetools.com/products/compress-wizard).
While developers tend to use Multivalue Compression (MVC), other opportunities to cut resource usage are most times overseen or left out for several reasons. Very often the PDM i.e. the proper implementation of data types and character sets is neglected.
Compression depends on the underlying data types. Therefore, before starting any compression activity, we have to make sure that the correct data types and character sets are used in the PDM. Otherwise, any redesign of the PDM will force us to rework compressions.
While the correct choice of data types and compression both help to cut resource usage, there are even some advantages of the right data type/ character set choice which compression doesn’t have.
The right data types will reduce resource usage independently from the data content, while compression has no space or resource usage benefit for unique columns.
The choice of correct data types always helps to cut resource usage (by allowing to pack more rows into each data block).
The selection of the right data types & character sets is part of physical data modeling. Unfortunately, often this step is skipped because it causes costs early in the implementation process.
Many times, data types are directly taken over from the stage area into the PDM of your core model, and from there they are directly moved into the data marts.
While this approach overcomes costs at the begin of the implementation process, we don’t recommend it. The costs will only be transferred to a later stage when it is much harder to do changes on the PDM.
Only because a column definition is a DECIMAL(38,n) in the source system doesn’t mean it will ever hold such huge numbers. Remember, space usage of decimals is in the range of 1 to 8 bytes. The same is valid for integer values (byte int -> 1 byte, big int -> 8 bytes)
In my experience, many decimals could be stored in barely 4 bytes, but because of laziness, time pressure & cost reasons, developers often save them in the largest decimal number available(which consumes a total of 8 bytes).
A single decimal, if improved in the way described above, need 4 bytes less per table row. Consider the possible space reduction for a table containing billions of rows, and don’t forget that this improvement is before applying compressions!
Similarly, using character set LATIN instead of UNICODE helps to cut space usage of a character field by 50% per character. Often I even see simple codes, such as “ACTIVE,” “SUSPENDED,” or “CANCELED” stored as UNICODE characters!
As a Teradata Consultant, you never should leave out this important optimization step. Yes, it takes time when we create the PDM, but it will pay off at the end.
I think I gave you enough arguments to insist on this activities. Don’t forget about it in your next project!
If you have any questions about all this, please ask in the comments! I’ll be paying close attention and answering as many as I can. Thank you for reading. Whatever this blog has become, I owe it all to you.
The creation of tables in an ever-recurring task, so I thought it was time to offer a checklist.1
The Primary Index is used for various applications. It usually is designed for even row distribution and as the main access path to your data.
But there are scenarios where it’s could be beneficial to create a table without a primary index (NOPI).
Staging is one example: By defining a stage tables without a primary index, we can speed upload processing (for example, fastloads will not need to sort and distribute rows in the final load phase). If this helps to improve overall load performance, of course, depends on the later steps, you have to consider if and how they get slowed down because of a missing primary index.
More details about the usage of NOPI tables you can find here: NOPI Tables on Teradata2
Almost always, your main goal is to have a Primary Index which distributes the table rows evenly across all AMPs. Consequently, it is necessary to know your data. Don’t rely on small amounts of test data as a base for your primary index decision. Data demographics of production data can change significantly. I experienced this several times; I will give you one example of an epic fail from real life:
Once upon a time in a European Telecommunication company: The main customer table’s primary index was mapped 1:1 from the correspondent source system column, let’s call it “customer_id”. During the development phase of the project, several files with customer test data were delivered, supporting the chosen primary index. As rows were distributed evenly across all AMPs, load performance was excellent.
The whole physical data model was optimized to support shop members to look up the customer status during calls (i.e. tactical workload). Furthermore, the client’s table could be joined very efficient with other related tables, such as call details to support reporting.
Everything looked perfect until we entered production phase. On the first day, the load times for the customer table increased from about 30 minutes to 6 hours with a subsequent “no more disk space” error.
What happened? During the test phase, all delivered files were related to “contract” customers. No one ever thought about testing “prepaid” customers. It turned out that all prepaid customers were assigned to a single “customer_id” in the source system…
The result was a massive redesign, a project delay of several months, and a complete loss of reputation.
In rare cases, you would like to “design” a skewed primary index, for example, if you want to force all rows into one data block of one AMP. This way you could achieve rowhash locking instead of table locks and other advantages. For details, please read our article All Rows on one AMP by Design.3
While it saves time to leave out such considerations, in the end, it never pays off. Of course, you could define each single character column as UNICODE. But such an approach comes with two disadvantages: you will waste valuable disk space and decrease the performance of requests related to these tables.
If you are unlucky, such a sloppy approach will lead to a disaster. Here another example from real life. It recently happened in a major European bank:
We had to deal with tables, each of them carrying several hundreds of columns. Load times (fastload) were horrible: 6 hours for a ridiculous 50GB of data!
When I was called to the project to analyze these performance issues, it turned out that each single column was defined with character set UNICODE. Some simple calculations revealed that data rows were huge: Only 1-2 rows were fitting into each data block.
Put briefly, a redesign of dozens of table DDL’s was required. By changing all columns to character set LATIN, and only keeping a few as UNICODE, load times could be reduced drastically as more rows could be packed into each block.
This article shows you how significant the impact of wrongly chosen character sets can be: Teradata Tuning Success – The Best ever4
Data type consistency is a crucial point but often overseen. Data Type and Character Set Conversions during query executions are expensive operations (CPU time) and may prevent AMP local activities (such as joins). Rows from different tables joined via their primary index are only co-located on the same AMP if data types match. The same holds true for character sets.5
I would recommend: always define your primary index unique if possible. Again, it’s a very sloppy way of developing if you skip uniqueness because it’s more convenient (“no problem with duplicates”).
Hint: This may not always be possible, for example in the case of row partitioned tables, if the partition column is not included in the primary index.
There are several (often not obvious) disadvantages if we skip uniqueness. Here are some examples:
If you don’t have to make sure that duplicate rows are declined, always use MULTISET tables. They don’t need a costly duplicate row check when data is inserted. Furthermore, in the case of a skewed primary index, the difference in performance, when adding rows, can become huge.
More details about SET and MULTISET tables you can find here: Teradata Multiset versus Set Tables7
If you have secondary indexes or join indexes defined on your table, it’s always a good idea to drop them before inserting big amounts of data into already populated tables. You can recreate the indexes afterward, and it will be much cheaper, especially if something goes wrong, like in the following example:
Recently, I worked on the following problem:
Somebody had executed an insert statement. As is was still not finished after hours, the same person decided to abort the session. As a result, a rollback started.
I was informed the next day when the request was executing already 12 hours…
I rapidly collected the facts:
I decided to cancel the rollback as it was clear that it would run for at least 30 more hours!
Without the 4 NUSIs, the rollback would have been much faster. But maintaining five rows per table row (4 index sub-table rows plus the base table row) was increasing this disaster.8
If you are working with any temporary tables (volatile, global temporary or physical), you should always consider improving all areas mentioned above (data types, character sets, primary index, etc.) to support your workload in the best possible way.
Furthermore, it’s a good practice to split columns which contain different types of information into separate columns. Make them atomic. Splitting will help to improve the performance of joining, as you can avoid join conditions such as “ON TABLE1.PK = SUBSTRING(TABLE2.PK,1,5).9
Since Teradata 14 we can define data blocks as big as 1024KB. If you increase data block size, the performance improvement can be high if you are dealing with enormous rows or have a lot of strategic queries scanning all rows.
As a Teradata SQL tuner, most of the times we have to deal with skew. In my experience, up to 90% of SQL performance issues are related to skew. But this article is not about skew. Sometimes a slight change is enough to produce miracles.
In this article, we I show the possible impact of wrongly chosen character sets on SQL performance. A topic which is often overlooked or ignored.
Recently, while checking the QryLog table, my attention was drawn to a DELETE statement. It was consuming a huge amount of CPU seconds and causing a lot of IOs. In fact, each day this one statement caused more IOs than all the other workload together.
Even worse, it looked like the resource consumption was growing rapidly (on average, each day 50 million rows were added) , as you can see in below chart:
The mentioned DELETE statement executes against a huge table “TheTable”. “TheTable” contains almost 3 billion rows. A subquery is used to identify the rows to be deleted (please note that table “TheVolatileTable”, being used for filtering, contains exactly 1 row):
Below you can see the table definitions of both involved tables:
This is the table used for filtering the rows to be deleted (it is populated with exactly 1 row):
In the original setup, the volatile table is created with this syntax:
CREATE TABLE AS (<SUBQUERY>);
This is a comfortable method for creating and populating tables in one step. Still, the author of the query has overseen that it’s important to take care about data types and character sets. On the affected Teradata system, character columns are created by default with character set UNICODE . Unfortunately, the table “TheTable” defines these character columns as character set LATIN, causing a mismatch.
The resulting execution plan for the bad performing DELETE statement looks like this:
There are several issues with this execution plan. First of all, the join columns of the huge table “TheTable” have to be converted from LATIN to UNICODE. This has to be done for each row: 3 billion times!
This is an expensive operation, which is consuming a lot of CPU time.
But that is not all: The huge table is spooled into 50 hash partitions in preparation for the consecutive join. A lot of resources are needed to spool such a huge table containing billions of rows. Unfortunately, this is the prefered option for the Optimizer, because it has to deal with the character set conversion.
Below you can see the amount of IOs and CPU seconds being used:
The tuning of this query was simple, but very effective. By manually creating the volatile table, I was able to choose character set LATIN for the character columns, matching the character sets of both tables:
Below you can see the improved execution plan:
In the improved execution plan, the Teradata Optimizer is doing a dynamic hash join. Dynamic hash joins don’t spool to rehash the join columns, but calculate the row hash “on the fly”, row by row. No spooling of the 3 billion rows is required!
This small change caused that the resource consumption is reduced immensely:
|IO Count||CPU Time||SpoolUsage|
This tuning success shows how it is possible to achieve huge improvements with very little effort. Always identify your worst workload at first and try to improve it. Tuning your worst queries should always be one of your major goals.
Wide textual information often is stored in columns of data type VARCHAR(). This data type can hold up to 32,000 (UNICODE) or 64,000 (LATIN) characters.
The impact of storing comprehensive text is that only a small number of rows fits into each data block.
The effect of storing large text is that only a relatively limited number of rows fits into each data block: In this case, full table scans are more expensive(more data blocks are required to store the whole set of data).
Luckily, there is another possibility to store extensive textual information:
Character large objects (data type CLOB) can be a great alternative for storing comprehensive textual information. The data of a CLOB column is stored as a separate sub-table. Separating them from the base table can give a significant performance advantage for queries which are accessing the table with a full table scan: More rows can fit into each data block.
As a consequence, fewer data blocks have to be moved from disk to FSG cache, vastly reducing the number of disk IOs.
Here is an example table, we are assuming that column “Desc” of each table row contains extensive textual information:
CREATE TABLE Varchar_Table
ID INTEGER NOT NULL,
) PRIMARY INDEX (ID);
In the absence of a secondary index, all data blocks have to be moved from disk to FSG cache, even if we are only interested in the content of column ID. As we are a assuming that all rows contain extensive textual information in column Desc this will be expensive from a resource consumption point of view:
SELECT ID FROM Varchar_Table WHERE COl1 = ‘X’ ;
By changing the data type from VARCHAR() to CLOB, resource consumption (disk IOs) can be decreased significantly; much more rows are packed into each data block. Therefore fewer blocks have to be moved from the drive into the FSG cache:
CREATE TABLE CLOB_Table
ID INTEGER NOT NULL,
Desc CHARACTER LARGE OBJECT
) PRIMARY INDEX (ID);
SELECT ID FROM Varchar_Table WHERE COl1 = ‘X’; — fewer blocks have to be moved from disk to FSG cache
No performance advantage is measurable if the CLOB column is part of the select list. In this case, the base table plus the sub-table holding the CLOB data have to be accessed:
SELECT ID,Desc FROM Varchar_Table WHERE COl1 = ‘X’ ;
Continuing the Data Type considerations, here are more issues that demand attention from time to time.
Aim for nullability settings that are compliant with the data model and unified across all tables. The cost of a mandatory column left open for null entries is both on the storage side and on the optimizer side. An extra presence bit per column has to be reserved and a null option to be considered when making the SQL execution plan.
Double-check whenever you extract parts of one source table attribute into a target table attribute so that you do not create phantom variance and incongruence of values. Assume a source attribute ORDER_NUMBER can be structured differently depending on where the order originated from:
AAA.123.456O or BBB.123.456. Assume further that the positions of 456 represent a customer type code. The “O” stands for order. The application creates it that way. If you merely extract the number using a substring starting from position 5, your result will not be entirely numeric – which you might not even realize as you load into a character field! – And there are invalid customer type codes. Surprisingly wrong reporting results for customer type aggregations despite no load fails and quantitative checks passed will keep you busy at a later stage.
This is one of the noblest yet hardest-to-achieve goals of a warehouse, at least as long as we humans are allowed to type in values in source system applications. Nevertheless, we need to reach out for it, just as we tidy up our apartments, not in the expectation of a one-time final orderliness.
In contrast to cleanliness, which refers to the adequate processing of data, I like to call sobriety a state of redundancy-free and simplified attribute content down to the point where it cannot be pruned further without loss of information.
First, values should be all written out or all coded and abbreviated. A threefold categorization of “A,” “B” and “DEFAULT” could be converted to A,” “B” and “D” without loss of information, but size needs down to one character byte instead of several. Taking this further, could dozens of full-text entries be converted to official codes so that your table does not take in extra calories?
An anonymized real-world example:
An ENTITY_NAME, let’s say a branch name, is stored like this, thousands of times each:
“ABCD – ABCD Division of ABCD Industries Fantasia”
Why not cut it down to “ABCD” and store the meaning of the ENTITY_NAME in a tiny lookup table?
By this, I could cut down the ENTITY_NAME from VARCHAR(200) to CHAR(4). The savings can be estimated to (average length of current entries-4+2) bytes per record. The table I derived this example from had 177 million records at the time I wrote these lines.
Second, summarize and prune. Although this involves changing the table content somewhat, it will not alter the meaning of the content while fewer distinct values emerge. They then have a higher likeliness to be subject to compression. If relevant, you will need fewer partitions. Address and name fields are always a good place to start.
Derived from what I just saw on a production table, imagine a street field mapped 1 to 1 from the source where we have stored “Main Street,” “Main Str.” as well as “Mainstreet” thousands of times each. Once a standard form of street name representation is defined, the distinct values can probably be cut in half.
Even if you are of the opinion that the blanks or abbreviations have to be preserved because they carry a, however, a subtle difference in meaning, a hundred different ways of saying “no input” don’t have to. Cut down all kinds of meaningless input to no more than two values, one meaning NULL and one meaning that the original input is unusable, be it because of refusal, pranks or severe typing errors that cannot be reverse-engineered.
Propose and plan this and the chances are that your reporting and analytics department colleague will add you as a new friend.
A Success Story to Inspire You
Recently I tested a data type resizing over dozens of tables. These tables were already compressed and mostly partitioned and indexed well. Unfortunately, they were implemented under either significant uncertainty concerning the actual content or lack of due diligence in this respect. Pruning character and decimal fields closer to the reality of the table content saved 5 to 25% of the used table space, even for small tables, with extremes of 35% and more.