The goal of Teradata SQL tuning is to cut resource usage. Two measures we have to attend:
These are absolute values. They are not influenced by any concurrent system activity and stay stable for a particular execution plan.
Don't use execution times as your optimization target. Many irrelevant factors will affect run times:
Session blocking, workload delays, heavy concurrent workload, etc.
The most expensive task for any RDBMS is to move data from the mass storage devices to memory.
Many of the techniques & ideas described below reduce the number of transferred data blocks (from disk to memory and vice versa). Some of them help to cut CPU consumption.
The most important optimization task is to aid the Teradata Optimizer with complete and correct statistics.
We have to pay particular attention to 3 basic situations and always collect full statistics:
By following these three rules, many problems related to missing statistics will vanish.
The goal of our Teradata SQL tuning activities is to find the right balance between good query plans and the time needed to make sure useful statistical information.
The most simple way to discover missing statistics is by turning on diagnostics:
DIAGNOSTIC HELPSTATS ON FOR SESSION;
The above statement adds statistics recommendation at the end of each explained statement:
EXPLAIN SELECT * FROM WHERE Column= 'value';
Test each recommendation separately. Don't add all recommendations at once, but wisely chose the ones which improve query performance.
Ensure that no plan step only has “no confidence”. Steps with “no confidence” are a sure sign for heuristics on non-indexed columns. Something must avoid at all costs!
The Optimizer is doing an excellent job of detecting stale statistics and extrapolating.
Before release 14.10, Teradata detected table growth by comparing two random AMP samples. One taken during statistics collection, the other during query execution.
Since Teradata Release 14.10, deleted and inserted rows can be tracked (the UDI counts). For extrapolation, the Optimizer will prefer UDI counts over random AMP sample comparison.
Still, the best way to ensure up to date statistics is to collect them often.
If we need to find stale statistics, we have to compare estimations against the real row count. There are two places where we can use this information:
The statistics histograms give us the same information the Optimizer has. It's used to create the execution plan:
Why did I say “almost the same information”? Statistics improve each time the Optimizer gets new insight. They can come from earlier join steps, single table predicates, and aggregations.
Furthermore, different methods of extrapolation will adjust estimations shown in the histograms. Furthermore, various methods of extrapolation will correct estimates shown in the histograms.
If you need the estimations after extrapolation you can use the following statement:
SHOW CURRENT STATISTICS VALUES ON <Table>;
Here is a “single table predicate” example. It demonstrates the usage of derived estimations:
SELECT * FROM <Table1>t01 INNER JOIn <Table2> t02 ON t01.Key = t02.Key WHERE t01.Column IN (1,2,3);
The retrieve step estimation for <Table1> is 3 distinct values. This information flows into the joining step. The retrieve step estimation for <Table1> is 3 distinct values. This information flows into the joining step. If the average number of rows per value for both tables is about 1 (unique data), the resulting spool for the join is three rows.
A straightforward approach to detect stale statistics is this:
SELECT * FROM WHERE Column= 1;
Value 1 is not available in <Table>As value=1 is not available in <Column> the Optimizer will estimate the result set to be the average rows per value. I guess many of you expect zero rows?
More information is available here: Teradata Statistics Basics
My last piece of advice:
Keep a copy of both the execution plans – before and after the change – to see the impact of your change.
When it comes to Primary Index choice, we must create a fair balance, weighing opposing requirements against each other:
Even data distribution and join performance.
Two tables can only be joined if they have the same primary index. In other words: The rows of both tables need to be on the same AMP.
Design your queries in a way, that the Primary Index is used for joining as much as possible, as this is the cheapest way of joining.
AMP-local joining also is possible if the join condition includes columns that are not in the primary index. But if the join condition is not including all primary index columns, the rows of one or both tables have to be relocated (and maybe sorted).
If you need a different primary index to improve performance, you can use volatile tables or temporary tables. Create them with the same structure and content as the original tables but with the needed primary index. The use of temporary tables is particularly useful if your query consists of many join steps.
Indexes give the Optimizer more data access paths. They are improving highly selective retrieve actions. Unfortunately, indexes consume permanent disk space, and they require maintenance when the underlying base table changes.
Recommendation:
If indexes are not used by the Optimizer, and not useful in the PDM design, drop them immediately. They only will waste space and resources.
Unique secondary index (USI) allows for direct row access (like the primary index). The non-unique secondary index (NUSI) requires a full index subtable scan on all AMPs.
The NUSI can be an advantage over the base table access if the index subtable is smaller than the base table. Covering NUSI's are more useful than non-covering ones: There are base table lookups needed, and no costs created.
The difference between an index and partitioning is that indexes are sub-tables. Partitioning is another way of structuring the base table. Partitioning allows the Optimizer to limit access to the data blocks of a partition. The advantage of partitioning is that partition elimination always is used. Index usage has preconditions. For example, the NUSI will not be used without statistics collected on the index columns.
Still, there is a significant advantage of indexing: We create the index; we check if it's used. We drop it if it's not used. Have you ever partitioned a 200 Terabyte table for test reasons? I guess this is not what we like to do.
Another disadvantage of partitioning comes to my mind:
Join performance worsens if partitions of both tables don't match or one of the tables is not partitioned.
Whenever working with partitioning, you have to keep the data warehouse architecture in mind. Decide if your solution fits into it. Partitioned tables use different join techniques. If tables have different partitions, this has an adverse impact on join performance!
Conclusion: There is no one-size-fits-all method. You have to check what works best for you!
Often queries performance improves when the query is rewritten. Here are some ideas:
Query rewriting allows improving performance, sometimes even when all other techniques fail.
Query rewriting is a very powerful way to improve performance. Still, it often requires understanding the business logic (“can I replace this left join with an inner join?”).
It is possible to rewrite a query in a purely technical way. Still, understanding the business logic of a query reveals more tuning opportunities.
See also our article on query rewriting here: Teradata Tuning – Query Rewriting
I was not sure if I should add the physical data model to this list for one reason. Often we can't do any significant changes in the physical database design. Too many consumers on top of the database (such as reporting tools, data marts) would require a redesign.
Still, if we can improve the physical model, this is one of the most productive changes.
The best advice I can give:
Keep your core model normalized, denormalize in your data marts. Many of you will not agree. I can live with this. My experience is that early denormalization causes bad performance, especially when done without apparent reason. Check all columns of all tables if the information stored cannot be further broken down. Ensure also that they have the same data types and character sets. Columns which are containing more than one piece of information force the user to use expression joins. Most likely the Optimizer will not use any statistics and the primary index for joining.
Ensure that all primary key columns are defined as NOT NULL. Add default values where appropriate. If you store codes, there is no reason to use UNICODE. It will just waste space.
Apply Multi Value compression on all tables: More rows can fit into each data block. Data blocks are the smallest unit transferred between disk and memory. More rows per data block lead to less disk IOs and better performance.
Please consider that the above advice can only be a loose collection of ideas about how to fix a broken data model.
There are several unique optimization opportunities which you should consider:
Observing a query while it's running allows detecting the critical steps. Most people use Viewpoint for real-time monitoring. I prefer another tool called dbmon. The author is a guy from Teradata Austria (I hate the slowness of Viewpoint).
Bad SQL performance is either caused by:
That's my way of real-time monitoring:
I wait for a step that is either skewed or in which the estimated input rows to a join don't make sense. (such as having two spools with millions of rows joined with a product join). I concentrate my optimization on the steps of the previously mentioned type.
If the skew is the issue, I will analyze the join column value skew. If estimations are wrong, I will go over the statistics and make sure that they are up to date.
Issues with statistics can be fixed quickly. Skew issues can be quite stubborn. Query rewriting is always my last option unless I find something foolish and easy to repair.
Always measure resource usage before and after the optimization. As I said earlier: query run times are no reliable test!
Here is a SQL query you can use in your daily work to extract appropriate steps from the query log. You have to set a different QUERYBAND for each query version you are running to be able to distinguish them. You need “select” access to “DBC.DBQLOGTBL”.
SET QUERY_BAND = 'Version=1;' FOR SESSION; SELECT AMPCPUTIME, (FIRSTRESPTIME-STARTTIME DAY(2) TO SECOND(6)) RUNTIME, SPOOLUSAGE/1024**3 AS SPOOL_IN_GB, CAST(100-((AMPCPUTIME/(HASHAMP()+1))*100/NULLIFZERO(MAXAMPCPUTIME)) AS INTEGER) AS CPU_SKEW, MAXAMPCPUTIME*(HASHAMP()+1) AS CPU_IMPACT, AMPCPUTIME*1000/NULLIFZERO(TOTALIOCOUNT) AS LHR FROM DBC.DBQLOGTBL WHERE QUERYBAND = 'Version=1;'
The query will return:
The goal is to cut total CPU usage, consumed spool space and skew.
Tactical workload requires a very particular skillset. The best tuner will fail if he doesn't recognize that he is dealing with a tactical workload. I have seen complete projects failing because developers ignored this important fact.
The skillset required for the strategic workload is unique.
I strongly recommend to read this post which explains all the details:
The worst case is the following: 2 tables are joined which both have no statistics. To make sure that there are not even summary statistics on table level, we create both tables completely new and insert rows. The two tables are filled from a dummy table which serves as the basis in all our examples:
CREATE SET TABLE DWHPRO.STATS1 ,FALLBACK , NO BEFORE JOURNAL, NO AFTER JOURNAL, CHECKSUM = DEFAULT, DEFAULT MERGEBLOCKRATIO, MAP = TD_MAP1 ( PI INTEGER NOT NULL, COL1 INTEGER NOT NULL, COL2 INTEGER NOT NULL) PRIMARY INDEX ( PI );
We insert 50 rows into the table stats1
CREATE SET TABLE DWHPRO.STATS2 ,FALLBACK , NO BEFORE JOURNAL, NO AFTER JOURNAL, CHECKSUM = DEFAULT, DEFAULT MERGEBLOCKRATIO, MAP = TD_MAP1 ( PI INTEGER NOT NULL, COL1 INTEGER NOT NULL, COL2 INTEGER NOT NULL) PRIMARY INDEX ( PI );
We insert 150 rows into the table stats2
As the following SQL shows, there are no statistics on either table. The Optimizer must, therefore, estimate the cardinality of the two tables on the basis of a random AMP sample over the primary index:
SHOW STATISTICS VALUES ON STATS1; [3624] There are no statistics defined for the table
SHOW STATISTICS VALUES ON STATS2; [3624] There are no statistics defined for the table
We will perform a simple INNER JOIN of the two tables on a column that is not indexed:
EXPLAIN SELECT * FROM STATS2 inner JOIN STATS1 ON STATS1.COL1 = STATS2.COL1; 1) First, we lock DWHPRO.STATS1 in TD_MAP1 for read on a reserved RowHash to prevent global deadlock. 2) Next, we lock DWHPRO.STATS2 in TD_MAP1 for read on a reserved RowHash to prevent global deadlock. 3) We lock DWHPRO.STATS1 in TD_MAP1 for read, and we lock DWHPRO.STATS2 in TD_MAP1 for read. 4) We do an all-AMPs RETRIEVE step in TD_MAP1 from DWHPRO.STATS1 by way of an all-rows scan with no residual conditions into Spool 2 (all_amps), which is duplicated on all AMPs in TD_Map1. The size of Spool 2 is estimated with low confidence to be 96 rows (2,400 bytes). The estimated time for this step is 0.00 seconds. 5) We do an all-AMPs JOIN step in TD_Map1 from Spool 2 (Last Use) by way of an all-rows scan, which is joined to DWHPRO.STATS2 by way of an all-rows scan with no residual conditions. Spool 2 and DWHPRO.STATS2 are joined using a dynamic hash join, with a join condition of ("COL1 = DWHPRO.STATS2.COL1"). The result goes into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with no confidence to be 600 rows ( 27,000 bytes). The estimated time for this step is 0.01 seconds. 6) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.01 seconds.
From the execution plan, we can see that the estimated number of rows in the result set is 600. Furthermore, we can see that the estimated number of rows in table Stats1 is 96. However, we have to halve this number because the table is copied to all 2 AMPs. I.e. the optimizer estimates the number of rows in table Stats1 at 48, which is quite close to the real number of 50 rows.
To get the number of estimated rows in table Stats2 we can use the following trick:
EXPLAIN SELECT * FROM STATS2; 1) First, we lock DWHPRO.STATS2 in TD_MAP1 for read on a reserved RowHash to prevent global deadlock. 2) Next, we lock DWHPRO.STATS2 in TD_MAP1 for read. 3) We do an all-AMPs RETRIEVE step in TD_MAP1 from DWHPRO.STATS2 by way of an all-rows scan with no residual conditions into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with low confidence to be 156 rows (5,148 bytes). The estimated time for this step is 0.01 seconds. 4) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.01 seconds.
We can see from the execution plan that the number of estimated rows is 156.
Does the question arise how the estimate of the result set of 600 rows is made? The answer is that the following heuristic is applied:
SQRT(number of rows of the bigger table) * (Number of rows from the smaller table). The result is rounded up to the nearest whole number. (Many thanks to Diether Noeth who gave me this information in the Teradata forum):
Let's do the math for our example:
SQRT(156) = 12,489996 * 48 => 599,519808, which is rounded to 600.
If there are at least current statistics on the cardinality of the tables, these are used for the Heuristic:
COLLECT STATS COLUMN(PI) ON STATS1; COLLECT STATS COLUMN(PI) ON STATS2; EXPLAIN SELECT * FROM STATS2 inner JOIN STATS1 ON STATS1.COL1 = STATS2.COL1 EXPLAIN SELECT * FROM STATS2 inner JOIN STATS1 ON STATS1.COL1 = STATS2.COL1; 1) First, we lock DWHPRO.STATS1 in TD_MAP1 for read on a reserved RowHash to prevent global deadlock. 2) Next, we lock DWHPRO.STATS2 in TD_MAP1 for read on a reserved RowHash to prevent global deadlock. 3) We lock DWHPRO.STATS1 in TD_MAP1 for read, and we lock DWHPRO.STATS2 in TD_MAP1 for read. 4) We do an all-AMPs RETRIEVE step in TD_MAP1 from DWHPRO.STATS1 by way of an all-rows scan with no residual conditions into Spool 2 (all_amps), which is duplicated on all AMPs in TD_Map1. The size of Spool 2 is estimated with high confidence to be 100 rows ( 2,500 bytes). The estimated time for this step is 0.00 seconds. 5) We do an all-AMPs JOIN step in TD_Map1 from Spool 2 (Last Use) by way of an all-rows scan, which is joined to DWHPRO.STATS2 by way of an all-rows scan with no residual conditions. Spool 2 and DWHPRO.STATS2 are joined using a dynamic hash join, with a join condition of ("COL1 = DWHPRO.STATS2.COL1"). The result goes into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with no confidence to be 613 rows ( 27,585 bytes). The estimated time for this step is 0.01 seconds. 6) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.01 seconds.
The estimate for the result set has increased to 613 rows. But has it improved? Let's just run the query:
SELECT * FROM STATS2 inner JOIN STATS1 ON STATS1.COL1 = STATS2.COL1 -> SELECT completed. 0 rows returned.
As we can see, the query does not return any rows at all. So the estimation has even got a bit worse due to the statistics at the table level.
So it can be a bit confusing to look at both execution plans again. The retrieving step in the first plan (without statistics) has only low confidence, the retrieving step in the second plan (with statistics) has high confidence and still, the estimate with low confidence is better.
This was the first part about Join Estimations. In the next articles, we will take statistics step by step and see how this affects the optimizer's estimates.
Heuristics are applied whenever there are no collected statistics available for a nonindexed column. Heuristics are rules of thumb to estimate the number of selected data rows. They are used in WHERE condition predicates and for join planning.
Heuristics estimate the number of selected data rows to be a certain percentage of the table cardinality. Maybe you wonder, how the optimizer knows the table cardinality?
The optimizer can use collected summary statistics or collected primary index statistics to estimate table cardinality. In the absence of collected statistics, the optimizer still can retrieve a random AMP sample to estimate table cardinality.
All heuristic estimations done by the Teradata Optimizer are based on the estimated table cardinality.
In this post, we will analyze several queries, and their estimates based on heuristics. Furthermore, we will learn how to replace heuristic estimations with more accurate ones; we want to avoid heuristic estimates, as they are only providing inaccurate statistical information to the Teradata Optimizer.
Statistics on nonindexed columns play a significant role in WHERE condition predicates. We will start with very simple examples to teach the base concepts of heuristics and will introduce more complex setups afterward.
Before we dive deeper into the details of heuristics, one important note:
Don’t forget: The following information is related to nonindexed columns. Whenever we refer to the “number of rows selected” or a “percentage of rows selected,” the table row estimation is based on collected summary or index statistics, or on a random AMP sample. Just keep in mind, when we refer to table row counts, that we mean the estimations, not real numbers.
Just keep in mind, when we refer to table row counts, that we mean the estimations, not real table row numbers.
Here is the test table we will use in all our examples. The table row count is 100.000, which is the base for all heuristic estimations the optimizer is doing.
￼￼
We will start our tour with a very simple example, having only one WHERE condition predicate for a nonindexed column (”segment” is neither primary index nor any other index of the customer table)
Here is our test query:
We collected statistics on the primary index column of table “customer”. The row count estimation derived from these statistics is 100.000.
We did not collect statistics on the nonindexed column “segment”
The real number of rows in “segment 1” is 20.000.
As column “segment” is a nonindexed column, the optimizer applies heuristic estimations.
The number of selected rows is estimated to be 10% of the table rows.
The hard coded rule is 10% without any exception
100.000 * 10% = 10.000 rows
The optimizer underestimated the real number of data rows by 50%! If we would have 0 rows in segment 1, the estimation would still be 10.000 rows, and in this case, the optimizer would drastically overestimate the number of selected rows.
We consider above “10% Rule” as the base estimation the optimizer uses whenever there is only one WHERE condition predicate available for a single value.
Our first example carved out, why heuristic estimates should be avoided at all cost: They are often far away from reality.
In the next example, we will add another WHERE condition predicate for a second nonindexed column to the query.
We did not collect statistics on any of the columns used in the predicates. This is our test query:
SELECT * FROM Customer WHERE SEGMENT = 1 AND Age = 25;
The heuristic rule applied for nonindexed columns which are AND combined is:
For each AND combined predicate, we multiply the initial estimate by 75%.
Let’s calculate the estimation for our example query:
10% from 100.000 data rows = 10.000 rows. This is the base estimation. We multiply the base estimation by 75%, i.e. 10.000 * 75%. The result is 7.500 rows.
We could extend our example and add one more WHERE condition predicate, this would lead us to the following total estimate:
((100.000 * 10%) *75%) * 75% = 5.625 data rows.
Finally, by adding one more additional WHERE condition predicate, we would be down to 4.219 rows.
Each additional “AND”-combined WHERE condition predicate decreases the number of estimated rows:
In the next example, we will learn how heuristic estimates are implemented for WHERE condition predicates on different OR combined columns.
SELECT * FROM Customer WHERE SEGMENT = 1 OR Age = 25;
The rule for OR combined predicates for different columns is as follows:
The estimate for each predicate is the base estimation we defined before (i.e. 10% of the table rows)
We know that the estimated number of rows in the customer table is 100.000.
The estimation for the result set of the query is (100.000 * 10%) + (100.000 * 10%), or 20.000 rows (2 times the base estimation)
Like in the previous example, estimations will be either too high or too low, and most probably never will match the real number of rows being selected.
Of course, there is an upper limit for the estimation. If we combine 11 conditions with “OR”, the estimate will not be 110%, but 100%.
These were the most simple heuristic rules, knowing them gives you already a very good insight, and they will help you to understand execution plan estimations.
The heuristic estimations for OR combined predicates on the same columns are more complex.
Nevertheless, we decoded the applied algorithm and carved out the basic rules and building blocks.
First of all, the optimizer distinguishes between ranges of values and single values.
We will start with the estimations for single value predicates.
SELECT * FROM Customer WHERE Age in (20,22,24);
SELECT * FROM Customer WHERE Age = 20 OR AGE = 22 OR Age = 24;
The optimizer will assume that 10% of the table rows will be selected, but only for the first two predicates!
Starting with the second predicate, 1% of the table cardinality will be added for all predicates (including the first one). Starting with the third predicate, no base estimation (10% of table cardinality) will be added!
For our example this means:
(100.000 * 10%) for the first predicate
(100.000 * 10%) for the second predicate
(3 * (100.000 * 1%)) = 3.000 for the three predicates.
The total estimation, therefore, is 23.000 rows.
Estimations for Ranges of Values
Here are examples for range predicates. There is no gap between selected values.
SELECT * FROM Customer WHERE Age in (20,21,22);
SELECT * FROM Customer WHERE Age BETWEEN 20 AND 22;
SELECT * FROM Customer WHERE Age = 20 OR AGE = 21 OR Age = 22;
If our WHERE condition contains only one range predicate, the estimation will be 20% of the table rows, in our example, this is 20.000 rows.
It doesn’t matter how many values the ranges contains. The estimation equally will be 20% for this query:
SELECT * FROM Customer WHERE Age BETWEEN 1 AND 1000;
If we add a second range to our query, estimation rises to 2 times 20% or 40%:
If we add a third range, a more detailed estimation is applied.
The estimate for the first and second range predicate is each 10% (remember, in te previous example with to ranges, it was 20% per range predicate).
There will be no 10% estimations added after the second range predicate, but similarly to the estimations for single values, the number of distinct values in the range is counted and for each value, the estimation is increased by 1%.
SELECT * FROM Customer WHERE Age IN (10,11,12) OR AGE IN (20,21,22) OR AGE IN (30,31,32);
For our example this means:
(100.000 * 10%) for the first range predicate
(100.000 * 10%) for the second range predicate
(9 * (100.000 * 1%)) = 9.000 for the nine values.
The total estimation, therefore, is 29.000 rows.
The estimation for one range predicate is always 20%, the number of distinct range values is not changing the estimation.
The estimation for two range predicates increases to 40%, and the number of distinct range values is not changing the estimation.
If our query has 3 or more range predicates, the estimation for the first 2 ranges is 20% (10% plus 10%), each distinct value, of all ranges, increases the estimation by 1%.
We will now come to the question: “What is the impact on heuristic estimations if we collect statistics one or more of the WHERE condition predicates”?
In this case, the optimizer considers all WHERE conditions with collected statistics and chooses the estimation of the one with the highest selectivity as the starting point for further estimations.
In other words: The estimations will not start with the usual heuristic estimation (10%, 20%, etc.), but with a much better estimation, based on collected statistics. Here is an example:
IN above example, at first, we select statistics on column CustomerAge:
COLLECT STATISTICS COLUMN(Age) ON Customer;
We assume that the estimation for customers of age 25 is 5.000 data rows.
We execute the select statement:
SELECT * FROM Customer WHERE Segment = 1 AND Age = 25;
Both columns are nonindexed.
As we collected on exactly one column statistics, they will be the starting point for the estimations, namely 5.000 data rows (the selectivity for CustomerAge=25)
Like in our previous example related to AND combined conditions, for the second column the optimizer will apply heuristics:
5.000 * 75% = 3.750 data rows.
Whenever there is at least one column with collected statistics available, this can drastically increase the quality of the estimation.
I have another example for you, this time we combine three columns with “AND”, two of them having collected statistics, one without statistics.
Of course, all columns, like always, are nonindexed columns. But I assume this is clear by now.
We collect statistic on two columns:
COLLECT STATISTICS COLUMN(CustomerAge) ON Customer;
COLLECT STATISTICS COLUMN(Gender) ON Customer;
The collected estimation for customer age is 5.000 rows, the estimation for Gender=’U’ (for value “unknown”) is just 100 rows.
Like in the previous example, the optimizer picks the column with the highest selectivity to make its initial estimation: In our example, his is 100 rows.
Although we have a second column with collected statistics, they are of no use to the optimizer: The second and thirst column estimations are again calculated with above 75% rule:
(100 * 75%) * 75% = 56.25 rows (which will be rounded to 57 rows).
SELECT * FROM Customer WHERE CustomerId = 1 AND CustomerAge = 25 AND Gender=’U’;
Here is the last example demonstrating a combination of “OR” combined columns with and without statistics (sorry, no graphic available):
We collect statistics on CustomerAge.
COLLECT STATISTICS COLUMN(CustomerAge) ON Customer;
SELECT * FROM Customer WHERE CustomerId = 1 OR CustomerAge = 25;
The optimizer will simply sum up estimates taken from collected statistics and heuristic estimates for columns without collected statsistics:
Again, the estimation for CustomerAge=25 is 5.000 rows. Heuristics for CustomerId=1 give 10.000 rows (100.000 * 10%).
The total estimation is 15.000 rows.
We could go on forever with examples for heuristic estimations, but we will stop here with some final examples for frequently used WHERE condition predicates:
Teradata Collect Statistics Using Sample
Collect Statistics in Teradata – Part 2
Teradata Statistics Basics
As experienced Teradata tuning specialists, we are convinced that the two most important factors to obtain good system performance are the choice of a good primary index and the timely and appropriate collection of statistics.
Teradata uses a cost based optimizer. If we collect stats in Teradata, we provide the Optimizer with the data demographics it needs to generate good execution plans.
All cost based optimizers require statistical information about the data and hardware, such as CPU, Disk, Memory, and Processors.
The optimizer will create several plans to choose from and use the one with the lowest usage of resources. It will not evaluate each possible plan (which could be billions) but uses algorithms to identify the best candidates and chooses the best out of them.
The costs for each plan are derived from table row count of the physical tables and spools generated by operations such as selections, joins, and projections.
Accurate estimations are crucial to getting optimal plans. Providing statistical information for performance optimization is critical to optimal query plans.
Teradata distinguishes between statistics about tables and columns. Table statistics are for example the number of rows, the average row size and the number of rows per data block.
The column statistics are going more into detail, and store information about each column value (the column histograms) or the whole column (summary column statistics).
Summary column statistics are, for example, the number of distinct values, the lowest and the highest value.
Detailed column statistics are for example the number of rows per value. And information about the most common column values – which is used by the optimizer for skew detection.
The steps of interest for a Teradata Tuning Specialist are selections, joins, and aggregations.
We can think about retrieve steps as being the equivalent to the WHERE condition predicates of our query. “Aggregation steps” are generated by GROUP BY statements.
Statistics impact in which order Teradata executed joins and they determine which join strategy is used.
When we talk about join plan, we have to consider join types and data relocation steps:
Each time we join rows from two tables together, they have to be located on the same AMP. In the best case, both tables have the same primary index, and no data relocation is needed. In the worst case, both tables have to be rehashed by the join columns.
The Optimizer uses statistics to answer the following questions:
Selections
Is it cheaper to use an index or a full table scan? The Optimizer will always use the primary index or a Unique Secondary Index. There are no statistics considered, as it’s any way the cheapest way to access table rows.
But statistics help the optimizer to decide whether or not to do a full table scan or use a nonunique secondary index, a join index, or a hash index.
Join cost estimation for partitioned tables relies on statistic estimations on the partition columns and the so-called “dummy” partition column. We will explain this in detail later when we talk about best practices for statistic collection.
Join Steps
The optimizer has to find out which join type should be utilized (such as merge join, product join, hash join) and what’s the cheapest way to bring the rows from two tables to the same AMP.
Furthermore, it has to find out the optimal join order. The mentioned three variables wich have to be optimized at the same time give plenty of room to create a bad execution plan.
Only current statistics ensure that the optimizer can create the best execution plan possible.
As a join preparation step, rows can be duplicated to all AMPs or redistributed by the row hash of the join columns. Some join types require the tables to be sorted by row hash. Each of these preparation steps causes costs which have to weighted against each other.
Statistics allow the optimizer to take a higher risk when creating the execution plan and give it greater confidence in the estimation it uses.
Here is one example:
The hash join performs great if the smaller table of a join can be kept entirely in memory, but has bad performance if the memory is not sufficient. Without statistics, the optimizer may avoid the hash join. Information about table cardinalities and the available main memory are used to optimize the execution plan.
To summarize: Without statistics, query execution times would be unpredictable. They are the heart of the system!
See also: Teradata Statistics Basics
Statistics provide the Optimizer with the data demographics it needs to generate good execution plans. It will create several plans to choose from and use the one with the lowest cost of resources.
All cost based optimizers use statistical information about data and hardware.
Teradata uses statistical information about tables (summary statistics) and columns. Table metrics are for example the number of rows, the average row size, and the number of rows per data block.
Statistics about columns are more detailed than summary statistics. The number of rows per value, the number of distinct values, and the lowest and highest value. Information about the most common column values used by the optimizer for join planning.
An execution plan consists of join-, retrieve- or aggregation steps. Retrieve steps are caused by WHERE conditions. Aggregation steps by GROUP BY statements.
Statistics impact in which order the Optimizer executes the plan steps and which technique it uses.
The Optimizer uses statistics to answer the following questions:
Is it cheaper to use an index or a full table scan? The Optimizer will quickly choose and use any primary index, and it will also easily pick a Unique Secondary Index, but statistics help the optimizer to decide whether to do a full table scan or use a nonunique secondary index and if nonunique secondary index bit mapping should be utilized.
Which join type should be used (such as merge join, product join, hash join, etc.). When two rows are joined, they must be located on the same AMP (this means that the join columns have to be the primary index of both tables). The Optimizer has to find out what’s the cheapest way to bring the rows together: For example, rows can be duplicated to all AMPs or redistributed by the row hash of the join columns. Last but not least, the optimizer has to decide about the ideal join order.
Above you saw a few examples how statistics influence the execution plan.
Statistics allow the Optimizer to create more risky execution plans by increasing the confidence in its estimations.
As an example, let’s consider the hash join which performs great if the smaller table can be kept entirely in memory, but has bad performance if the memory is not enough.
Without collected statistics, the hash join might be avoided to mitigate the risk of bad performance.
Collected statistics are only a starting point for estimations. They are adjusted throughout the optimization process when more information from various database constraints such as CHECK constraints, referential integrity constraints, and query predicates are available. Adjusted estimations are called the “derived statistics”.
The Optimizer further refines the derived statistics each time it gets new details from:
– A join step
– Single table predicates (a WHERE condition)
– An Aggregation step
Here is one example:
Assume we have a WHERE condition like this one: “WHERE column IN (1,2,3). The Optimizer derives that the maximum number of distinct values of the result set will be 3 and uses this enhanced piece of information from now on to improve the estimations of the next execution steps.
Without collected statistics, the optimizer solely relies on random-AMP sampling when the Optimizer builds the execution plan. This approach gives good estimations if the data is evenly distributed (unique or nearly unique). Random-AMP sampling results in wrong estimates for skewed data.
Biased value histogram entries store exact cardinalities. Statistics which are up to date should reflect the number of rows returned in the retrieve step of a biased value.
In different words: The Optimizer can easily determine the number of rows being retrieved by looking up the biases value in the histogram.
Unfortunately, this is not always the case.
I created a test scenario, populating a test table with 50.000 rows containing 100 distinct dates:
CREATE MULTISET TABLE Statistics_Estimation1
(
TheDate DATE FORMAT ‘YYYY-MM-DD' NOT NULL,
) PRIMARY INDEX (TheDate);
INSERT INTO Statistics_Estimation1
SELECT DATE'2010-01-01′ + (ROW_NUMBER() OVER (ORDER BY 1) MOD 100) AS TheDate,
FROM SYS_CALENDAR.CALENDAR
SAMPLE 50000;
In order to have fresh statistics index I executed the following statement. The system can pack all 50.000 rows into 100 histograms, all of them will be stored as biased histogram values.
This is an important need in our test setup, as it ensures that we can expect that the estimated number of rows matches always 100% the retrieved number of rows:
COLLECT STATS ON Statistics_Estimation1 COLUMN(TheDate);
We can now check the histograms, by issuing below statement:
SHOW STATISTICS VALUES ON Statistics_Estimation1 ;
/** Biased: Value, Frequency **/
/* 1 */ DATE ‘2010-01-01', 515,
/* 2 */ DATE ‘2010-01-02', 498,
/* 3 */ DATE ‘2010-01-03', 510,
/* 4 */ DATE ‘2010-01-04', 482,
/* 5 */ DATE ‘2010-01-05', 490,
/* 6 */ DATE ‘2010-01-06', 518,
/* 7 */ DATE ‘2010-01-07', 510,
/* 8 */ DATE ‘2010-01-08', 512,
/* 9 */ DATE ‘2010-01-09', 493,
/* 10 */ DATE ‘2010-01-10', 513,
/* 11 */ DATE ‘2010-01-11', 498,
/* 12 */ DATE ‘2010-01-12', 513,
/* 13 */ DATE ‘2010-01-13', 509,
/* 14 */ DATE ‘2010-01-14', 500,
/* 15 */ DATE ‘2010-01-15', 491,
/* 16 */ DATE ‘2010-01-16', 489,
/* 17 */ DATE ‘2010-01-17', 481,
/* 18 */ DATE ‘2010-01-18', 508,
/* 19 */ DATE ‘2010-01-19', 496,
/* 20 */ DATE ‘2010-01-20', 505,
/* 21 */ DATE ‘2010-01-21', 498,
/* 22 */ DATE ‘2010-01-22', 496,
/* 23 */ DATE ‘2010-01-23', 505,
/* 24 */ DATE ‘2010-01-24', 504,
/* 25 */ DATE ‘2010-01-25', 485,
/* 26 */ DATE ‘2010-01-26', 508,
/* 27 */ DATE ‘2010-01-27', 494,
/* 28 */ DATE ‘2010-01-28', 507,
/* 29 */ DATE ‘2010-01-29', 495,
/* 30 */ DATE ‘2010-01-30', 501,
/* 31 */ DATE ‘2010-01-31', 486,
/* 32 */ DATE ‘2010-02-01', 517,
/* 33 */ DATE ‘2010-02-02', 501,
/* 34 */ DATE ‘2010-02-03', 505,
/* 35 */ DATE ‘2010-02-04', 496,
/* 36 */ DATE ‘2010-02-05', 514,
/* 37 */ DATE ‘2010-02-06', 492,
/* 38 */ DATE ‘2010-02-07', 496,
/* 39 */ DATE ‘2010-02-08', 504,
/* 40 */ DATE ‘2010-02-09', 486,
/* 41 */ DATE ‘2010-02-10', 509,
/* 42 */ DATE ‘2010-02-11', 490,
/* 43 */ DATE ‘2010-02-12', 499,
/* 44 */ DATE ‘2010-02-13', 487,
/* 45 */ DATE ‘2010-02-14', 513,
/* 46 */ DATE ‘2010-02-15', 517,
/* 47 */ DATE ‘2010-02-16', 498,
/* 48 */ DATE ‘2010-02-17', 494,
/* 49 */ DATE ‘2010-02-18', 484,
/* 50 */ DATE ‘2010-02-19', 500,
/* 51 */ DATE ‘2010-02-20', 488,
/* 52 */ DATE ‘2010-02-21', 483,
/* 53 */ DATE ‘2010-02-22', 484,
/* 54 */ DATE ‘2010-02-23', 497,
/* 55 */ DATE ‘2010-02-24', 487,
/* 56 */ DATE ‘2010-02-25', 493,
/* 57 */ DATE ‘2010-02-26', 515,
/* 58 */ DATE ‘2010-02-27', 479,
/* 59 */ DATE ‘2010-02-28', 505,
/* 60 */ DATE ‘2010-03-01', 518,
/* 61 */ DATE ‘2010-03-02', 491,
/* 62 */ DATE ‘2010-03-03', 484,
/* 63 */ DATE ‘2010-03-04', 510,
/* 64 */ DATE ‘2010-03-05', 511,
/* 65 */ DATE ‘2010-03-06', 494,
/* 66 */ DATE ‘2010-03-07', 493,
/* 67 */ DATE ‘2010-03-08', 500,
/* 68 */ DATE ‘2010-03-09', 531,
/* 69 */ DATE ‘2010-03-10', 497,
/* 70 */ DATE ‘2010-03-11', 512,
/* 71 */ DATE ‘2010-03-12', 481,
/* 72 */ DATE ‘2010-03-13', 500,
/* 73 */ DATE ‘2010-03-14', 510,
/* 74 */ DATE ‘2010-03-15', 487,
/* 75 */ DATE ‘2010-03-16', 511,
/* 76 */ DATE ‘2010-03-17', 515,
/* 77 */ DATE ‘2010-03-18', 513,
/* 78 */ DATE ‘2010-03-19', 517,
/* 79 */ DATE ‘2010-03-20', 516,
/* 80 */ DATE ‘2010-03-21', 516,
/* 81 */ DATE ‘2010-03-22', 504,
/* 82 */ DATE ‘2010-03-23', 493,
/* 83 */ DATE ‘2010-03-24', 510,
/* 84 */ DATE ‘2010-03-25', 486,
/* 85 */ DATE ‘2010-03-26', 495,
/* 86 */ DATE ‘2010-03-27', 497,
/* 87 */ DATE ‘2010-03-28', 481,
/* 88 */ DATE ‘2010-03-29', 501,
/* 89 */ DATE ‘2010-03-30', 486,
/* 90 */ DATE ‘2010-03-31', 512,
/* 91 */ DATE ‘2010-04-01', 487,
/* 92 */ DATE ‘2010-04-02', 501,
/* 93 */ DATE ‘2010-04-03', 490,
/* 94 */ DATE ‘2010-04-04', 508,
/* 95 */ DATE ‘2010-04-05', 518,
/* 96 */ DATE ‘2010-04-06', 478,
/* 97 */ DATE ‘2010-04-07', 494,
/* 98 */ DATE ‘2010-04-08', 503,
/* 99 */ DATE ‘2010-04-09', 494,
/* 100 */ DATE ‘2010-04-10′, 512,
Now we execute the explain for below query:
Explain SELECT TheDate FROM Statistics_Estimation1 WHERE TheDate BETWEEN DATE'2010-02-14′ AND DATE'2010-02-16';
1) First, we do a single-AMP RETRIEVE step from
DWHPRO.Statistics_Estimation1 by way of the primary index
“DWHPRO.Statistics_Estimation1.TheDate = DATE ‘2010-02-14′” with
no residual conditions into Spool 1 (group_amps), which is built
locally on that AMP. The size of Spool 1 is estimated with high
confidence to be 510 rows (15,810 bytes). The estimated time for
this step is 0.01 seconds.
2) Next, we do a single-AMP RETRIEVE step from
DWHPRO.Statistics_Estimation1 by way of the primary index
“DWHPRO.Statistics_Estimation1.TheDate = DATE ‘2010-02-15′” with
no residual conditions into Spool 1 (group_amps), which is built
locally on that AMP. The size of Spool 1 is estimated with high
confidence to be 1,020 rows (31,620 bytes). The estimated time
for this step is 0.01 seconds.
3) We do a single-AMP RETRIEVE step from
DWHPRO.Statistics_Estimation1 by way of the primary index
“DWHPRO.Statistics_Estimation1.TheDate = DATE ‘2010-02-16′” with
no residual conditions into Spool 1 (group_amps), which is built
locally on that AMP. The size of Spool 1 is estimated with high
confidence to be 1,530 rows (47,430 bytes). The estimated time
for this step is 0.01 seconds.
4) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 0.02 seconds.
The Optimizer retrieves the requested primary index values via single-AMP primary index accesses, as the number of selected rows is small compared with the total number of table rows.
Did you recognize that the estimation is 1530 rows? Isn't it strange? If we sum up the 3 histogram values for our 3 dates (2010-02-14,2010-02-15 and 2010-02-16) we can see that it should be 1528 rows!
/* 45 */ DATE ‘2010-02-14', 513,
/* 46 */ DATE ‘2010-02-15', 517,
/* 47 */ DATE ‘2010-02-16', 498,
513+517+498 = 1528
Furthermore, the Optimizer estimates 510 rows for each single-AMP access…did we detect a bug in the Optimizer?
I would say no.
What happens is that some internal optimization takes place, which might show slightly wrong estimation but makes the estimation process faster. The Optimizer simply adds up the values of all histograms, and divides the sum by the number of histograms. In our example:
1528/3 = 509,33
The ceiling of the resulting number is used to distribute the estimation evenly across all single-AMP estimations i.e. 510 – 510 – 510
As a result, we end up with an estimation which is slightly wrong.
We can easily prove this, by increasing the number of retrieved rows, forcing the Optimizer to do a full table scan instead of single-AMP lookups:
SELECT TheDate FROM Statistics_Estimation1 WHERE TheDate BETWEEN DATE'2010-02-14′ AND DATE'2010-02-28′;
By extending the date range to 2010-02-28, the plan changes:
Explain SELECT TheDate FROM Statistics_Estimation1 WHERE TheDate BETWEEN DATE'2010-02-14′ AND DATE'2010-02-28′;
1) First, we lock DWHPRO.Statistics_Estimation1 for read on a
reserved RowHash to prevent global deadlock.
2) Next, we lock DWHPRO.Statistics_Estimation1 for read.
3) We do an all-AMPs RETRIEVE step from DWHPRO.Statistics_Estimation1
by way of an all-rows scan with a condition of (
“(DWHPRO.Statistics_Estimation1.TheDate >= DATE ‘2010-02-14') AND
(DWHPRO.Statistics_Estimation1.TheDate <= DATE ‘2010-02-28')”)
into Spool 1 (group_amps), which is built locally on the AMPs.
The size of Spool 1 is estimated with high confidence to be 7,438
rows (230,578 bytes). The estimated time for this step is 0.11
seconds.
4) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 0.11 seconds.
As you can easily prove by summing up the histogram values below, the estimation of 7,438 now matches exactly the number of rows being retrieved.
/* 45 */ DATE ‘2010-02-14', 513,
/* 46 */ DATE ‘2010-02-15', 517,
/* 47 */ DATE ‘2010-02-16', 498,
/* 48 */ DATE ‘2010-02-17', 494,
/* 49 */ DATE ‘2010-02-18', 484,
/* 50 */ DATE ‘2010-02-19', 500,
/* 51 */ DATE ‘2010-02-20', 488,
/* 52 */ DATE ‘2010-02-21', 483,
/* 53 */ DATE ‘2010-02-22', 484,
/* 54 */ DATE ‘2010-02-23', 497,
/* 55 */ DATE ‘2010-02-24', 487,
/* 56 */ DATE ‘2010-02-25', 493,
/* 57 */ DATE ‘2010-02-26', 515,
/* 58 */ DATE ‘2010-02-27', 479,
/* 59 */ DATE ‘2010-02-28', 505,
Whenever reading an execution plan, try to figure out how the estimations are built. Over time, you will become better and better in understanding estimations and will be able to solve statistic problems.
See also:
Teradata Statistics – A Case Study
Teradata Statistics Extrapolation for Equality Predicates