Archive

Category Archives for "Statistics"
9

Teradata SQL Tuning – 6 Golden Rules you must never forget!

1. Ensuring the Completeness and Correctness of Statistics

Teradata SQL Tuning begins by providing the optimizer with the statistics it needs. This must always be done as a first step. Statistics influence the execution plan. It is not only important to define all required statistics, but also to keep them as up-to-date as possible.

There is a simple reason why I put statistics first: Often performance problems are solved simply by changing the execution plan.

Which columns require which type of statistics (full statistics collectin, random-AMP samples or sample statistics) depends primarily on the data demography.

Random-AMP sampling

Random AMP sampling is usually sufficient for columns with many different values. The table should contain significantly more rows than there are AMPs in the system.
The following applies to random-AMP sampling: The more distinct the column values are, the better the estimate will be.

If the column values are skewed, there is a risk that the random-AMP sample will be taken from an AMP that is not representative of the data. Skewed data leads to overestimation or underestimation of the number of rows.

Good candidates for Random-AMP sampling are unique indices, i.e., UPI and USI.

Random-AMP sampling only takes place on indexed columns.

Therefore, You should always collect statistics on non-indexed columns that are used in WHERE conditions. Whether full or sample statistics are collected depends on how unique the column values are.

Fully collected Statistics

Fully collected statistics are required for skewed column values and columns with a small number of distinct values (NUPI and NUSI)

Statistics for PPI Tables

The Teradata Optimizer has unique statistic requirements for PPI tables.

The following statistics should additionally be collected on PPI tables:

Dummy Column “PARTITION”

Statistics on dummy column “PARTITION” tell the optimizer the number of the empty statistics.

Dummy column “PARTITION” + PI column

These statistics are required if the partitioning columns are not part of the primary index. In this case, the same primary index value can exist in different partitions. Statistics on dummy column “PARTITION” + PI allow the optimizer to estimate the cost of the sliding window and rowkey based merge join and of dynamic partition elimination steps.

Below statement can be used to determine which statistics the optimizer would additionally need in a SQL statement:

DIAGNOSIS HELPSTATS ON FOR SESSION;

This statement displays a list of suggested statistics at the end of the Execution Plan (given by the EXPLAIN statement) and the Optimizer’s opinion of their value (confidence levels).

By gradually adding these statistics, you can test their influence on the execution plan.

Identify Stale Statistics

There are several ways to identify stale statistics. The easiest way is to split the SQL statement and test each partial statement individually. Splitting is done merely by comparing the estimated number of rows (as shown in the Explain output) with the actual number of records returned by the query.

The above-described approach is particularly suitable if the entire SQL statement does not execute in a reasonable time.

Here’s an example:

SELECT t01.*
FROM
<Tablename1> t01
INNER JOIN
<Tablename2> t02
ON
t01.PK = t02.PK
WHERE
t01.<Column name> = 1 AND t02.<Column name> = 2;
The above query can be divided into two parts for testing:
SELECT * FROM <Tablename1> WHERE <Columnnname> = 1;
SELECT * FROM <Tablename2> WHERE <Columnnname> = 2; 

If you execute both sub-queries, and the number of rows returned differs significantly from the estimate in EXPLAIN, the statistics may be obsolete.

2. Teradata SQL Tuning and the Primary Index Choice

Teradata SQL Tuning is not complete without choosing the best possible Primary Index. Design your queries so that the primary index can be used in the joins.

All join columns must be part of the primary index. If only one column of the primary index is missing in the join condition, the result is a different row hash (the order how they are defined in the primary index doesn’t matter)

Still, the join condition can contain additional columns. These are then applied as residual conditions after locating the rows via row hash.

To execute a join between two tables, the rows of both tables must be co-located on the same AMP.

This is true when they have the same primary index. In this case, the optimizer can use a join strategy that requires no redistribution of rows.

If for some reason you cannot change the primary indexes, or if you need a specific primary index for a particular SQL statement, create a volatile table (or a true temporary table) with the same structure and content as the original table but with a different primary index.

This is the only way to ensure that all AMPs start and finish their work simultaneously. This is exactly what parallel data processing is all about.

There is something else to consider when selecting the primary index: Since every change of a value of a primary index column means a new row hash, these rows must be redistributed.
This can still be acceptable if it happens now and then and only in a few rows. However, if the primary index is very volatile, this has a negative effect on performance.

3. Teradata SQL Tuning with Indexing & Partitioning

Using indexes or partitioning is another way to improve query performance.

While secondary indexes in Teradata are stored in a sub-table and therefore require extra space, partitioning is just another way to structure the rows on the mass storage device.

When it is the best option to use partitioning and when indexing is better depends on the workload type. Sometimes a combined usage of partitioning and indexes is the best approach (i.e. creating an index on a partitioned table).

Tactical queries normally access one or a few data rows. The USI is therefore often an optimal choice if the access can’t be covered with a primary index. Also, a join index that represents a copy of the base table and only changes the primary index can be helpful here.

Partitioned tables are often used for strategic queries in which a series of data (for example, the sales for the current year) is aggregated.

Disadvantages of Indexing

When we work with indexing techniques, we need to keep an eye on the entire data warehouse architecture and decide whether our solution fits. Indexes can have a negative impact on the ETL process for several reasons.

Loading tools such as Fastload require secondary indexes and join indexes to be removed before loading.

The index sub-tables have to be managed by Teradata. Insert, Delete and Update statements require that in addition to the actual table, the index sub-table must be maintained.

If potentially useful indexes are not used by the optimizer and they are not helpful in the entire PDM design, drop them immediately. You’re wasting space and resources.

4. Query Rewriting

The performance of a query can often be improved by rewriting the query.

Personally, I prefer to consider an SQL statement as a black box and to limit optimizations to technical methods first.

Here are a few examples:

  • EXISTS instead of IN
  • Modularizing a large SQL statement
  • UNION ALL instead of UNION
  • DISTINCT instead of GROUP BY

Not having to understand the content and business logic of a query, I do not need to contact the author of the query. The purely technical optimizations are usually not that risky.

Only when I reach the limits of purely technical optimization do I begin to deal with business logic. Understanding business logic often opens up additional optimization opportunities.

Rewriting of queries often solves performance problems, even when all other techniques have failed.

5. Teradata SQL Tuning with Real-Time Monitoring

Teradata SQL tuning requires to watch the query running in real-time. Monitoring a query in viewpoint at runtime helps to identify the critical steps.

In Viewpoint you should look for the following steps:

  • Steps that take a long time to finish and consume a huge amount of resources.
  • Steps that are skewed, especially if the skew factor reaches 99%, as this means that all work is done by only one AMP or a few AMPs.

Analyzing the bad query

We have to think about the root cause of the bad query step, here are a few ideas:

  • Does the base table have a skewed primary index (static skew)?
  • Does the redistribution of rows cause skew (dynamic skew)?
  • Are massive amounts of data redistributed because of poor join preparation due to missing or stale statistics?
  • Are there several hash collisions during the execution of the INSERT statement?

Stale and missing statistics typically lead to incorrect decisions in join preparation (copying a table to all AMPs instead of rehashing) and to the use of incorrect join methods (for example, product join instead of merge join).

6. Comparison of Resource Usage

Another important task in Teradata SQL Tuning is measuring resource consumption before and after the optimization of the query.

Query run times are misleading because they can be affected by the simultaneous workload of other queries over which we have no control.

In performance tuning, we prefer to use absolute key figures that are independent of what else is running on our Teradata system.

Below is a query that gives you a detailed insight into how good each step of the SQL statement to be optimized is. To execute this query, you must have read access to table DBC.DBQLOGTBL (or related views):

Do not forget to give each version of the query to be optimized its own query band so that you can uniquely identify it in the table DBC.DBQLOGTBL:

SET QUERY_BAND = 'Version=1;' FOR SESSION;
SELECT
   AMPCPUTIME,
   (FIRSTRESPTIME-STARTTIME DAY(2) TO SECOND(6)) RUNTIME,
   SPOOLUSAGE/1024**3 AS SPOOL_IN_GB,
   CAST(100-((AMPCPUTIME/(HASHAMP()+1))*100/NULLIFZERO(MAXAMPCPUTIME)) AS INTEGER) AS
CPU_SKEW,
   MAXAMPCPUTIME*(HASHAMP()+1) AS CPU_IMPACT,
   AMPCPUTIME*1000/NULLIFZERO(TOTALIOCOUNT) AS LHR
FROM
   DBC.DBQLOGTBL
WHERE
     QUERYBAND = 'Version=1;'
;

The above query will return the following measures:

  • The total CPU Usage
  • The Spool Space needed
  • The LHR (ratio between CPU and IO usage)
  • The CPU Skew
  • The Skew Impact on the CPU

Our goal is to decrease total CPU usage, consumed spool space and skewing on the CPU. The LHR is optimally around 1.00

See also:
Teradata Golden Tuning Tipps 2017 – Take your Skills to the next Level!
Teradata Tuning – Query Rewriting
Tactical Workload Tuning on Teradata
The Primary Index Choice
Tutorial Point Tuning Ideas

Teradata Quiz
Teradata Quiz
Developer: Roland Wenzlofsky
Price: Free

Join Estimations without any Statistics

The worst case is the following: 2 tables are joined which both have no statistics. To make sure that there are not even summary statistics on table level, we create both tables completely new and insert rows. The two tables are filled from a dummy table which serves as the basis in all our examples:

CREATE SET  TABLE DWHPRO.STATS1 ,FALLBACK ,
     NO BEFORE JOURNAL,
     NO AFTER JOURNAL,
     CHECKSUM = DEFAULT,
     DEFAULT MERGEBLOCKRATIO,
     MAP = TD_MAP1
     (
      PI INTEGER NOT NULL,
      COL1 INTEGER NOT NULL,
      COL2 INTEGER NOT NULL)
PRIMARY INDEX ( PI );

We insert 50 rows into the table stats1

CREATE SET  TABLE DWHPRO.STATS2 ,FALLBACK ,
     NO BEFORE JOURNAL,
     NO AFTER JOURNAL,
     CHECKSUM = DEFAULT,
     DEFAULT MERGEBLOCKRATIO,
     MAP = TD_MAP1
     (
      PI INTEGER NOT NULL,
      COL1 INTEGER NOT NULL,
      COL2 INTEGER NOT NULL)
PRIMARY INDEX ( PI );

We insert 150 rows into the table stats2

As the following SQL shows, there are no statistics on either table. The Optimizer must, therefore, estimate the cardinality of the two tables on the basis of a random AMP sample over the primary index:

SHOW STATISTICS VALUES ON STATS1;
[3624] There are no statistics defined for the table  
SHOW STATISTICS VALUES ON STATS2;
[3624] There are no statistics defined for the table  

We will perform a simple INNER JOIN of the two tables on a column that is not indexed:

EXPLAIN SELECT * FROM STATS2 inner JOIN STATS1 ON STATS1.COL1 = STATS2.COL1;

 1) First, we lock DWHPRO.STATS1 in TD_MAP1 for read on a reserved
      RowHash to prevent global deadlock.
   2) Next, we lock DWHPRO.STATS2 in TD_MAP1 for read on a reserved
      RowHash to prevent global deadlock.
   3) We lock DWHPRO.STATS1 in TD_MAP1 for read, and we lock
      DWHPRO.STATS2 in TD_MAP1 for read.
   4) We do an all-AMPs RETRIEVE step in TD_MAP1 from DWHPRO.STATS1 by
      way of an all-rows scan with no residual conditions into Spool 2
      (all_amps), which is duplicated on all AMPs in TD_Map1.  The size
      of Spool 2 is estimated with low confidence to be 96 rows (2,400
      bytes).  The estimated time for this step is 0.00 seconds.
   5) We do an all-AMPs JOIN step in TD_Map1 from Spool 2 (Last Use) by
      way of an all-rows scan, which is joined to DWHPRO.STATS2 by way
      of an all-rows scan with no residual conditions.  Spool 2 and
      DWHPRO.STATS2 are joined using a dynamic hash join, with a join
      condition of ("COL1 = DWHPRO.STATS2.COL1").  The result goes into
      Spool 1 (group_amps), which is built locally on the AMPs.  The
      size of Spool 1 is estimated with no confidence to be 600 rows (
      27,000 bytes).  The estimated time for this step is 0.01 seconds.
   6) Finally, we send out an END TRANSACTION step to all AMPs involved
      in processing the request.
   -> The contents of Spool 1 are sent back to the user as the result of
      statement 1.  The total estimated time is 0.01 seconds.

From the execution plan, we can see that the estimated number of rows in the result set is 600. Furthermore, we can see that the estimated number of rows in table Stats1 is 96. However, we have to halve this number because the table is copied to all 2 AMPs. I.e. the optimizer estimates the number of rows in table Stats1 at 48, which is quite close to the real number of 50 rows.

To get the number of estimated rows in table Stats2 we can use the following trick:

EXPLAIN SELECT * FROM STATS2;
1) First, we lock DWHPRO.STATS2 in TD_MAP1 for read on a reserved
      RowHash to prevent global deadlock.
   2) Next, we lock DWHPRO.STATS2 in TD_MAP1 for read.
   3) We do an all-AMPs RETRIEVE step in TD_MAP1 from DWHPRO.STATS2 by
      way of an all-rows scan with no residual conditions into Spool 1
      (group_amps), which is built locally on the AMPs.  The size of
      Spool 1 is estimated with low confidence to be 156 rows (5,148
      bytes).  The estimated time for this step is 0.01 seconds.
   4) Finally, we send out an END TRANSACTION step to all AMPs involved
      in processing the request.
   -> The contents of Spool 1 are sent back to the user as the result of
      statement 1.  The total estimated time is 0.01 seconds.

We can see from the execution plan that the number of estimated rows is 156.

Does the question arise how the estimate of the result set of 600 rows is made? The answer is that the following heuristic is applied:

SQRT(number of rows of the bigger table) * (Number of rows from the smaller table). The result is rounded up to the nearest whole number. (Many thanks to Diether Noeth who gave me this information in the Teradata forum):

Let’s do the math for our example:

SQRT(156) = 12,489996 * 48 => 599,519808, which is rounded to 600.

If there are at least current statistics on the cardinality of the tables, these are used for the Heuristic:

COLLECT STATS COLUMN(PI) ON STATS1;
COLLECT STATS COLUMN(PI) ON STATS2;

EXPLAIN SELECT * FROM STATS2 inner JOIN STATS1 ON STATS1.COL1 = STATS2.COL1

EXPLAIN SELECT * FROM STATS2 inner JOIN STATS1 ON STATS1.COL1 = STATS2.COL1;
 1) First, we lock DWHPRO.STATS1 in TD_MAP1 for read on a reserved
      RowHash to prevent global deadlock.
   2) Next, we lock DWHPRO.STATS2 in TD_MAP1 for read on a reserved
      RowHash to prevent global deadlock.
   3) We lock DWHPRO.STATS1 in TD_MAP1 for read, and we lock
      DWHPRO.STATS2 in TD_MAP1 for read.
   4) We do an all-AMPs RETRIEVE step in TD_MAP1 from DWHPRO.STATS1 by
      way of an all-rows scan with no residual conditions into Spool 2
      (all_amps), which is duplicated on all AMPs in TD_Map1.  The size
      of Spool 2 is estimated with high confidence to be 100 rows (
      2,500 bytes).  The estimated time for this step is 0.00 seconds.
   5) We do an all-AMPs JOIN step in TD_Map1 from Spool 2 (Last Use) by
      way of an all-rows scan, which is joined to DWHPRO.STATS2 by way
      of an all-rows scan with no residual conditions.  Spool 2 and
      DWHPRO.STATS2 are joined using a dynamic hash join, with a join
      condition of ("COL1 = DWHPRO.STATS2.COL1").  The result goes into
      Spool 1 (group_amps), which is built locally on the AMPs.  The
      size of Spool 1 is estimated with no confidence to be 613 rows (
      27,585 bytes).  The estimated time for this step is 0.01 seconds.
   6) Finally, we send out an END TRANSACTION step to all AMPs involved
      in processing the request.
   -> The contents of Spool 1 are sent back to the user as the result of
      statement 1.  The total estimated time is 0.01 seconds.

The estimate for the result set has increased to 613 rows. But has it improved? Let’s just run the query:

SELECT * FROM STATS2 inner JOIN STATS1 ON STATS1.COL1 = STATS2.COL1
-> SELECT completed. 0 rows returned.

As we can see, the query does not return any rows at all. So the estimation has even got a bit worse due to the statistics at the table level.

So it can be a bit confusing to look at both execution plans again. The retrieving step in the first plan (without statistics) has only low confidence, the retrieving step in the second plan (with statistics) has high confidence and still, the estimate with low confidence is better.

This was the first part about Join Estimations. In the next articles, we will take statistics step by step and see how this affects the optimizer’s estimates.

2

Teradata Statistics – Heuristic Estimations

teradata statisticsHeuristics are applied whenever there are no collected statistics available for a nonindexed column. Heuristics are rules of thumb to estimate the number of selected data rows. They are used in WHERE condition predicates and for join planning.

Heuristics estimate the number of selected data rows to be a certain percentage of the table cardinality. Maybe you wonder, how the optimizer knows the table cardinality?

The optimizer can use collected summary statistics or collected primary index statistics to estimate table cardinality. In the absence of collected statistics, the optimizer still can retrieve a random AMP sample to estimate table cardinality.

All heuristic estimations done by the Teradata Optimizer are based on the estimated table cardinality.
In this post, we will analyze several queries, and their estimates based on heuristics. Furthermore, we will learn how to replace heuristic estimations with more accurate ones; we want to avoid heuristic estimates, as they are only providing inaccurate statistical information to the Teradata Optimizer.

Statistics on nonindexed columns play a significant role in WHERE condition predicates. We will start with very simple examples to teach the base concepts of heuristics and will introduce more complex setups afterward.

Before we dive deeper into the details of heuristics, one important note:

Don’t forget: The following information is related to nonindexed columns. Whenever we refer to the “number of rows selected” or a “percentage of rows selected,” the table row estimation is based on collected summary or index statistics, or on a random AMP sample. Just keep in mind, when we refer to table row counts, that we mean the estimations, not real numbers.

Just keep in mind, when we refer to table row counts, that we mean the estimations, not real table row numbers.
Here is the test table we will use in all our examples. The table row count is 100.000, which is the base for all heuristic estimations the optimizer is doing.

teradata
We will start our tour with a very simple example, having only one WHERE condition predicate for a nonindexed column (”segment” is neither primary index nor any other index of the customer table)

Here is our test query:
teradata heuristics

We collected statistics on the primary index column of table “customer”. The row count estimation derived from these statistics is 100.000.

We did not collect statistics on the nonindexed column “segment”

The real number of rows in “segment 1” is 20.000.

As column “segment” is a nonindexed column, the optimizer applies heuristic estimations.

The number of selected rows is estimated to be 10% of the table rows.

The hard coded rule is 10% without any exception

100.000 * 10% = 10.000 rows

The optimizer underestimated the real number of data rows by 50%! If we would have 0 rows in segment 1, the estimation would still be 10.000 rows, and in this case, the optimizer would drastically overestimate the number of selected rows.

We consider above “10% Rule” as the base estimation the optimizer uses whenever there is only one WHERE condition predicate available for a single value.
Our first example carved out, why heuristic estimates should be avoided at all cost: They are often far away from reality.

In the next example, we will add another WHERE condition predicate for a second nonindexed column to the query.

We did not collect statistics on any of the columns used in the predicates. This is our test query:

SELECT * FROM Customer WHERE SEGMENT = 1 AND Age = 25;

teradata optimizer
The heuristic rule applied for nonindexed columns which are AND combined is:

For each AND combined predicate, we multiply the initial estimate by 75%.

Let’s calculate the estimation for our example query:

10% from 100.000 data rows = 10.000 rows. This is the base estimation. We multiply the base estimation by 75%, i.e. 10.000 * 75%. The result is 7.500 rows.

We could extend our example and add one more WHERE condition predicate, this would lead us to the following total estimate:

((100.000 * 10%) *75%) * 75% = 5.625 data rows.

Finally, by adding one more additional WHERE condition predicate, we would be down to 4.219 rows.

Each additional “AND”-combined WHERE condition predicate decreases the number of estimated rows:

teradata estimations

 

In the next example, we will learn how heuristic estimates are implemented for WHERE condition predicates on different OR combined columns.
SELECT * FROM Customer WHERE SEGMENT = 1 OR Age = 25;

teradata dwhpro

The rule for OR combined predicates for different columns is as follows:

The estimate for each predicate is the base estimation we defined before (i.e. 10% of the table rows)

We know that the estimated number of rows in the customer table is 100.000.

The estimation for the result set of the query is (100.000 * 10%) + (100.000 * 10%), or 20.000 rows (2 times the base estimation)

Like in the previous example, estimations will be either too high or too low, and most probably never will match the real number of rows being selected.

Of course, there is an upper limit for the estimation. If we combine 11 conditions with “OR”, the estimate will not be 110%, but 100%.

These were the most simple heuristic rules, knowing them gives you already a very good insight, and they will help you to understand execution plan estimations.
The heuristic estimations for OR combined predicates on the same columns are more complex.

Nevertheless, we decoded the applied algorithm and carved out the basic rules and building blocks.

First of all, the optimizer distinguishes between ranges of values and single values.

We will start with the estimations for single value predicates.

SELECT * FROM Customer WHERE Age in (20,22,24);
SELECT * FROM Customer WHERE Age = 20 OR AGE = 22 OR Age = 24;

teradata database

The optimizer will assume that 10% of the table rows will be selected, but only for the first two predicates!

Starting with the second predicate, 1% of the table cardinality will be added for all predicates (including the first one). Starting with the third predicate, no base estimation (10% of table cardinality) will be added!
For our example this means:

(100.000 * 10%) for the first predicate
(100.000 * 10%) for the second predicate

(3 * (100.000 * 1%)) = 3.000 for the three predicates.

The total estimation, therefore, is 23.000 rows.

Estimations for Ranges of Values

Here are examples for range predicates. There is no gap between selected values.

SELECT * FROM Customer WHERE Age in (20,21,22);
SELECT * FROM Customer WHERE Age BETWEEN 20 AND 22;
SELECT * FROM Customer WHERE Age = 20 OR AGE = 21 OR Age = 22;

teradata SQL

If our WHERE condition contains only one range predicate, the estimation will be 20% of the table rows, in our example, this is 20.000 rows.

It doesn’t matter how many values the ranges contains. The estimation equally will be 20% for this query:

SELECT * FROM Customer WHERE Age BETWEEN 1 AND 1000;

teradata interview questions

If we add a second range to our query, estimation rises to 2 times 20% or 40%:

teradata

If we add a third range, a more detailed estimation is applied.
The estimate for the first and second range predicate is each 10% (remember, in te previous example with to ranges, it was 20% per range predicate).

There will be no 10% estimations added after the second range predicate, but similarly to the estimations for single values, the number of distinct values in the range is counted and for each value, the estimation is increased by 1%.

SELECT * FROM Customer WHERE Age IN (10,11,12) OR AGE IN (20,21,22) OR AGE IN (30,31,32);
sql

For our example this means:

(100.000 * 10%) for the first range predicate
(100.000 * 10%) for the second range predicate

(9 * (100.000 * 1%)) = 9.000 for the nine values.

The total estimation, therefore, is 29.000 rows.

The estimation for one range predicate is always 20%, the number of distinct range values is not changing the estimation.

The estimation for two range predicates increases to 40%, and the number of distinct range values is not changing the estimation.

If our query has 3 or more range predicates, the estimation for the first 2 ranges is 20% (10% plus 10%), each distinct value, of all ranges, increases the estimation by 1%.
We will now come to the question: “What is the impact on heuristic estimations if we collect statistics one or more of the WHERE condition predicates”?

In this case, the optimizer considers all WHERE conditions with collected statistics and chooses the estimation of the one with the highest selectivity as the starting point for further estimations.

In other words: The estimations will not start with the usual heuristic estimation (10%, 20%, etc.), but with a much better estimation, based on collected statistics. Here is an example:

statistics

IN above example, at first, we select statistics on column CustomerAge:

COLLECT STATISTICS COLUMN(Age) ON Customer;

We assume that the estimation for customers of age 25 is 5.000 data rows.

We execute the select statement:

SELECT * FROM Customer WHERE Segment = 1 AND Age = 25;

Both columns are nonindexed.

As we collected on exactly one column statistics, they will be the starting point for the estimations, namely 5.000 data rows (the selectivity for CustomerAge=25)

Like in our previous example related to AND combined conditions, for the second column the optimizer will apply heuristics:

5.000 * 75% = 3.750 data rows.

Whenever there is at least one column with collected statistics available, this can drastically increase the quality of the estimation.
I have another example for you, this time we combine three columns with “AND”, two of them having collected statistics, one without statistics.

Of course, all columns, like always, are nonindexed columns. But I assume this is clear by now.

We collect statistic on two columns:
COLLECT STATISTICS COLUMN(CustomerAge) ON Customer;
COLLECT STATISTICS COLUMN(Gender) ON Customer;

The collected estimation for customer age is 5.000 rows, the estimation for Gender=’U’ (for value “unknown”) is just 100 rows.

Like in the previous example, the optimizer picks the column with the highest selectivity to make its initial estimation: In our example, his is 100 rows.

Although we have a second column with collected statistics, they are of no use to the optimizer: The second and thirst column estimations are again calculated with above 75% rule:

(100 * 75%) * 75% = 56.25 rows (which will be rounded to 57 rows).
SELECT * FROM Customer WHERE CustomerId = 1 AND CustomerAge = 25 AND Gender=’U’;

Here is the last example demonstrating a combination of “OR” combined columns with and without statistics (sorry, no graphic available):

We collect statistics on CustomerAge.

COLLECT STATISTICS COLUMN(CustomerAge) ON Customer;

SELECT * FROM Customer WHERE CustomerId = 1 OR CustomerAge = 25;

The optimizer will simply sum up estimates taken from collected statistics and heuristic estimates for columns without collected statsistics:

Again, the estimation for CustomerAge=25 is 5.000 rows. Heuristics for CustomerId=1 give 10.000 rows (100.000 * 10%).

The total estimation is 15.000 rows.

We could go on forever with examples for heuristic estimations, but we will stop here with some final examples for frequently used WHERE condition predicates:

teradata sql

Questions?
If you have any questions about all this, please ask in the comments! I’ll be paying close attention and answering as many as I can. Thank you for reading. Whatever this blog has become, I owe it all to you.

Teradata Collect Statistics Using Sample
Collect Statistics in Teradata – Part 2
Teradata Statistics Basics

 

Collect Statistics in Teradata

Collect Stats in TeradataCollect Stats in Teradata

As experienced Teradata tuning specialists, we are convinced that the two most important factors to obtain good system performance are the choice of a good primary index and the timely and appropriate collection of statistics.

Teradata uses a cost based optimizer.  If we collect stats in Teradata, we provide the Optimizer with the data demographics it needs to generate good execution plans.

All cost based optimizers require statistical information about the data and hardware, such as CPU, Disk, Memory, and Processors.

The optimizer will create several plans to choose from and use the one with the lowest usage of resources. It will not evaluate each possible plan (which could be billions) but uses algorithms to identify the best candidates and chooses the best out of them.

The costs for each plan are derived from table row count of the physical tables and spools generated by operations such as selections, joins, and projections.

Accurate estimations are crucial to getting optimal plans. Providing statistical information for performance optimization is critical to optimal query plans.

Teradata distinguishes between statistics about tables and columns. Table statistics are for example the number of rows, the average row size and the number of rows per data block.

The column statistics are going more into detail, and store information about each column value (the column histograms) or the whole column (summary column statistics).

Summary column statistics are, for example, the number of distinct values, the lowest and the highest value.

Detailed column statistics are for example the number of rows per value. And information about the most common column values – which is used by the optimizer for skew detection.
The steps of interest for a Teradata Tuning Specialist are selections, joins, and aggregations.

We can think about retrieve steps as being the equivalent to the WHERE condition predicates of our query. “Aggregation steps” are generated by GROUP BY statements.

Statistics impact in which order Teradata executed joins and they determine which join strategy is used.

When we talk about join plan, we have to consider join types and data relocation steps:

Each time we join rows from two tables together, they have to be located on the same AMP. In the best case, both tables have the same primary index, and no data relocation is needed. In the worst case, both tables have to be rehashed by the join columns.

The Optimizer uses statistics to answer the following questions:

Selections

Is it cheaper to use an index or a full table scan? The Optimizer will always use the primary index or a Unique Secondary Index. There are no statistics considered, as it’s any way the cheapest way to access table rows.

But statistics help the optimizer to decide whether or not to do a full table scan or use a nonunique secondary index, a join index, or a hash index.

Join cost estimation for partitioned tables relies on statistic estimations on the partition columns and the so-called “dummy” partition column. We will explain this in detail later when we talk about best practices for statistic collection.
Join Steps
The optimizer has to find out which join type should be utilized (such as merge join, product join, hash join) and what’s the cheapest way to bring the rows from two tables to the same AMP.

Furthermore, it has to find out the optimal join order. The mentioned three variables wich have to be optimized at the same time give plenty of room to create a bad execution plan.

Only current statistics ensure that the optimizer can create the best execution plan possible.

As a join preparation step, rows can be duplicated to all AMPs or redistributed by the row hash of the join columns. Some join types require the tables to be sorted by row hash. Each of these preparation steps causes costs which have to weighted against each other.

Statistics allow the optimizer to take a higher risk when creating the execution plan and give it greater confidence in the estimation it uses.

Here is one example:

The hash join performs great if the smaller table of a join can be kept entirely in memory, but has bad performance if the memory is not sufficient. Without statistics, the optimizer may avoid the hash join. Information about table cardinalities and the available main memory are used to optimize the execution plan.

To summarize: Without statistics, query execution times would be unpredictable. They are the heart of the system!

See also: Teradata Statistics Basics

Questions?
If you have any questions about all this, please ask in the comments! I’ll be paying close attention and answering as many as I can. Thank you for reading. Whatever this blog has become, I owe it all to you.

5

Teradata Statistics Basics

What are Teradata Statistics?

Statistics provide the Optimizer with the data demographics it needs to generate good execution plans. It will create several plans to choose from and use the one with the lowest cost of resources.

All cost based optimizers use statistical information about data and hardware.

Teradata uses statistical information about tables (summary statistics) and columns. Table metrics are for example the number of rows, the average row size,  and the number of rows per data block.

Statistics about columns are more detailed than summary statistics.  The number of rows per value, the number of distinct values, and the lowest and highest value. Information about the most common column values used by the optimizer for join planning.

Statistics are used to improve the execution plans.

An execution plan consists of join-,  retrieve- or aggregation steps. Retrieve steps are caused by WHERE conditions.  Aggregation steps by GROUP BY statements.

Statistics impact in which order the Optimizer executes the plan steps and which technique it uses.

The Optimizer uses statistics to answer the following questions:

Retrieve Steps:

Is it cheaper to use an index or a full table scan? The Optimizer will quickly choose and use any primary index, and it will also easily pick a Unique Secondary Index, but statistics help the optimizer to decide whether to do a full table scan or use a nonunique secondary index and if nonunique secondary index bit mapping should be utilized.

Join Steps:

Which join type should be used (such as merge join, product join, hash join, etc.). When two rows are joined, they must be located on the same AMP (this means that the join columns have to be the primary index of both tables). The Optimizer has to find out what’s the cheapest way to bring the rows together: For example, rows can be duplicated to all AMPs or redistributed by the row hash of the join columns. Last but not least, the optimizer has to decide about the ideal join order.

Above you saw a few examples how statistics influence the execution plan.

Statistics allow the Optimizer to create more risky execution plans by increasing the confidence in its estimations.

As an example, let’s consider the hash join which performs great if the smaller table can be kept entirely in memory, but has bad performance if the memory is not enough.

Without collected statistics, the hash join might be avoided to mitigate the risk of bad performance.

Collected statistics are only a starting point for estimations.  They are adjusted throughout the optimization process when more information from various database constraints such as CHECK constraints, referential integrity constraints, and query predicates are available. Adjusted estimations are called the “derived statistics”.

The Optimizer further refines the derived statistics each time it gets new details from:

– A join step
– Single table predicates (a WHERE condition)
– An Aggregation step

Here is one example:

Assume we have a WHERE condition like this one: “WHERE column IN (1,2,3). The Optimizer derives that the maximum number of distinct values of the result set will be 3 and uses this enhanced piece of information from now on to improve the estimations of the next execution steps.

Without collected statistics, the optimizer solely relies on random-AMP sampling when the Optimizer builds the execution plan. This approach gives good estimations if the data is evenly distributed (unique or nearly unique). Random-AMP sampling results in wrong estimates for skewed data.

3

Don’t get fooled by the Statistic Estimations

Biased value histogram entries store exact cardinalities. Statistics which are up to date should reflect the number of rows returned in the retrieve step of a biased value.

In different words: The Optimizer can easily determine the number of rows being retrieved by looking up the biases value in the histogram.

Unfortunately,  this is not always the case.

I created a test scenario, populating a test table with 50.000 rows containing  100 distinct dates:

CREATE MULTISET TABLE Statistics_Estimation1
(
TheDate DATE FORMAT ‘YYYY-MM-DD’ NOT NULL,
) PRIMARY INDEX (TheDate);
INSERT INTO Statistics_Estimation1
SELECT DATE’2010-01-01′ + (ROW_NUMBER() OVER (ORDER BY 1) MOD 100) AS TheDate,
FROM SYS_CALENDAR.CALENDAR
SAMPLE 50000;

In order to have fresh statistics index I executed the following statement. The system can pack all 50.000 rows into 100 histograms, all of them will be stored as biased histogram values.

This is an important need in our test setup, as it ensures that we can expect that the estimated number of rows matches always 100% the retrieved number of rows:

COLLECT STATS ON Statistics_Estimation1 COLUMN(TheDate);

We can now check the histograms, by issuing below statement:

SHOW  STATISTICS VALUES ON Statistics_Estimation1 ;

/** Biased: Value, Frequency **/
/* 1 */ DATE ‘2010-01-01’, 515,
/* 2 */ DATE ‘2010-01-02’, 498,
/* 3 */ DATE ‘2010-01-03’, 510,
/* 4 */ DATE ‘2010-01-04’, 482,
/* 5 */ DATE ‘2010-01-05’, 490,
/* 6 */ DATE ‘2010-01-06’, 518,
/* 7 */ DATE ‘2010-01-07’, 510,
/* 8 */ DATE ‘2010-01-08’, 512,
/* 9 */ DATE ‘2010-01-09’, 493,
/* 10 */ DATE ‘2010-01-10’, 513,
/* 11 */ DATE ‘2010-01-11’, 498,
/* 12 */ DATE ‘2010-01-12’, 513,
/* 13 */ DATE ‘2010-01-13’, 509,
/* 14 */ DATE ‘2010-01-14’, 500,
/* 15 */ DATE ‘2010-01-15’, 491,
/* 16 */ DATE ‘2010-01-16’, 489,
/* 17 */ DATE ‘2010-01-17’, 481,
/* 18 */ DATE ‘2010-01-18’, 508,
/* 19 */ DATE ‘2010-01-19’, 496,
/* 20 */ DATE ‘2010-01-20’, 505,
/* 21 */ DATE ‘2010-01-21’, 498,
/* 22 */ DATE ‘2010-01-22’, 496,
/* 23 */ DATE ‘2010-01-23’, 505,
/* 24 */ DATE ‘2010-01-24’, 504,
/* 25 */ DATE ‘2010-01-25’, 485,
/* 26 */ DATE ‘2010-01-26’, 508,
/* 27 */ DATE ‘2010-01-27’, 494,
/* 28 */ DATE ‘2010-01-28’, 507,
/* 29 */ DATE ‘2010-01-29’, 495,
/* 30 */ DATE ‘2010-01-30’, 501,
/* 31 */ DATE ‘2010-01-31’, 486,
/* 32 */ DATE ‘2010-02-01’, 517,
/* 33 */ DATE ‘2010-02-02’, 501,
/* 34 */ DATE ‘2010-02-03’, 505,
/* 35 */ DATE ‘2010-02-04’, 496,
/* 36 */ DATE ‘2010-02-05’, 514,
/* 37 */ DATE ‘2010-02-06’, 492,
/* 38 */ DATE ‘2010-02-07’, 496,
/* 39 */ DATE ‘2010-02-08’, 504,
/* 40 */ DATE ‘2010-02-09’, 486,
/* 41 */ DATE ‘2010-02-10’, 509,
/* 42 */ DATE ‘2010-02-11’, 490,
/* 43 */ DATE ‘2010-02-12’, 499,
/* 44 */ DATE ‘2010-02-13’, 487,
/* 45 */ DATE ‘2010-02-14’, 513,
/* 46 */ DATE ‘2010-02-15’, 517,
/* 47 */ DATE ‘2010-02-16’, 498,
/* 48 */ DATE ‘2010-02-17’, 494,
/* 49 */ DATE ‘2010-02-18’, 484,
/* 50 */ DATE ‘2010-02-19’, 500,
/* 51 */ DATE ‘2010-02-20’, 488,
/* 52 */ DATE ‘2010-02-21’, 483,
/* 53 */ DATE ‘2010-02-22’, 484,
/* 54 */ DATE ‘2010-02-23’, 497,
/* 55 */ DATE ‘2010-02-24’, 487,
/* 56 */ DATE ‘2010-02-25’, 493,
/* 57 */ DATE ‘2010-02-26’, 515,
/* 58 */ DATE ‘2010-02-27’, 479,
/* 59 */ DATE ‘2010-02-28’, 505,
/* 60 */ DATE ‘2010-03-01’, 518,
/* 61 */ DATE ‘2010-03-02’, 491,
/* 62 */ DATE ‘2010-03-03’, 484,
/* 63 */ DATE ‘2010-03-04’, 510,
/* 64 */ DATE ‘2010-03-05’, 511,
/* 65 */ DATE ‘2010-03-06’, 494,
/* 66 */ DATE ‘2010-03-07’, 493,
/* 67 */ DATE ‘2010-03-08’, 500,
/* 68 */ DATE ‘2010-03-09’, 531,
/* 69 */ DATE ‘2010-03-10’, 497,
/* 70 */ DATE ‘2010-03-11’, 512,
/* 71 */ DATE ‘2010-03-12’, 481,
/* 72 */ DATE ‘2010-03-13’, 500,
/* 73 */ DATE ‘2010-03-14’, 510,
/* 74 */ DATE ‘2010-03-15’, 487,
/* 75 */ DATE ‘2010-03-16’, 511,
/* 76 */ DATE ‘2010-03-17’, 515,
/* 77 */ DATE ‘2010-03-18’, 513,
/* 78 */ DATE ‘2010-03-19’, 517,
/* 79 */ DATE ‘2010-03-20’, 516,
/* 80 */ DATE ‘2010-03-21’, 516,
/* 81 */ DATE ‘2010-03-22’, 504,
/* 82 */ DATE ‘2010-03-23’, 493,
/* 83 */ DATE ‘2010-03-24’, 510,
/* 84 */ DATE ‘2010-03-25’, 486,
/* 85 */ DATE ‘2010-03-26’, 495,
/* 86 */ DATE ‘2010-03-27’, 497,
/* 87 */ DATE ‘2010-03-28’, 481,
/* 88 */ DATE ‘2010-03-29’, 501,
/* 89 */ DATE ‘2010-03-30’, 486,
/* 90 */ DATE ‘2010-03-31’, 512,
/* 91 */ DATE ‘2010-04-01’, 487,
/* 92 */ DATE ‘2010-04-02’, 501,
/* 93 */ DATE ‘2010-04-03’, 490,
/* 94 */ DATE ‘2010-04-04’, 508,
/* 95 */ DATE ‘2010-04-05’, 518,
/* 96 */ DATE ‘2010-04-06’, 478,
/* 97 */ DATE ‘2010-04-07’, 494,
/* 98 */ DATE ‘2010-04-08’, 503,
/* 99 */ DATE ‘2010-04-09’, 494,
/* 100 */ DATE ‘2010-04-10′, 512,

Now we execute the explain for below query:

Explain SELECT TheDate FROM Statistics_Estimation1 WHERE TheDate BETWEEN DATE’2010-02-14′ AND DATE’2010-02-16’;

1) First, we do a single-AMP RETRIEVE step from
DWHPRO.Statistics_Estimation1 by way of the primary index
“DWHPRO.Statistics_Estimation1.TheDate = DATE ‘2010-02-14′” with
no residual conditions into Spool 1 (group_amps), which is built
locally on that AMP. The size of Spool 1 is estimated with high
confidence to be 510 rows (15,810 bytes). The estimated time for
this step is 0.01 seconds.
2) Next, we do a single-AMP RETRIEVE step from
DWHPRO.Statistics_Estimation1 by way of the primary index
“DWHPRO.Statistics_Estimation1.TheDate = DATE ‘2010-02-15′” with
no residual conditions into Spool 1 (group_amps), which is built
locally on that AMP. The size of Spool 1 is estimated with high
confidence to be 1,020 rows (31,620 bytes). The estimated time
for this step is 0.01 seconds.
3) We do a single-AMP RETRIEVE step from
DWHPRO.Statistics_Estimation1 by way of the primary index
“DWHPRO.Statistics_Estimation1.TheDate = DATE ‘2010-02-16′” with
no residual conditions into Spool 1 (group_amps), which is built
locally on that AMP. The size of Spool 1 is estimated with high
confidence to be 1,530 rows (47,430 bytes). The estimated time
for this step is 0.01 seconds.
4) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 0.02 seconds.

The Optimizer retrieves the requested primary index values via single-AMP primary index accesses, as the number of selected rows is small compared with the total number of table rows.

Did you recognize that the estimation is 1530 rows? Isn’t it strange? If we sum up the 3 histogram values for our 3 dates (2010-02-14,2010-02-15 and 2010-02-16) we can see that it should be 1528 rows!

/* 45 */ DATE ‘2010-02-14’, 513,
/* 46 */ DATE ‘2010-02-15’, 517,
/* 47 */ DATE ‘2010-02-16’, 498,

513+517+498 = 1528

Furthermore, the Optimizer estimates 510 rows for each single-AMP access…did we detect a bug in the Optimizer?

I would say no.

What happens is that some internal optimization takes place, which might show slightly wrong estimation but makes the estimation process faster. The Optimizer simply adds up the values of all histograms, and divides the sum by the number of histograms. In our example:

1528/3 = 509,33

The ceiling of the resulting number is used to distribute the estimation evenly across all single-AMP estimations i.e. 510 – 510 – 510
As a result, we end up with an estimation which is slightly wrong.

We can easily prove this, by increasing the number of retrieved rows, forcing the Optimizer to do a full table scan instead of single-AMP lookups:

SELECT TheDate FROM Statistics_Estimation1 WHERE TheDate BETWEEN DATE’2010-02-14′ AND DATE’2010-02-28′;

By extending the date range to 2010-02-28, the plan changes:

Explain SELECT TheDate FROM Statistics_Estimation1 WHERE TheDate BETWEEN DATE’2010-02-14′ AND DATE’2010-02-28′;

1) First, we lock DWHPRO.Statistics_Estimation1 for read on a
reserved RowHash to prevent global deadlock.
2) Next, we lock DWHPRO.Statistics_Estimation1 for read.
3) We do an all-AMPs RETRIEVE step from DWHPRO.Statistics_Estimation1
by way of an all-rows scan with a condition of (
“(DWHPRO.Statistics_Estimation1.TheDate >= DATE ‘2010-02-14’) AND
(DWHPRO.Statistics_Estimation1.TheDate <= DATE ‘2010-02-28’)”)
into Spool 1 (group_amps), which is built locally on the AMPs.
The size of Spool 1 is estimated with high confidence to be 7,438
rows (230,578 bytes). The estimated time for this step is 0.11
seconds.
4) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 0.11 seconds.

As you can easily prove by summing up the histogram values below, the estimation of 7,438 now matches exactly the number of rows being retrieved.

/* 45 */ DATE ‘2010-02-14’, 513,
/* 46 */ DATE ‘2010-02-15’, 517,
/* 47 */ DATE ‘2010-02-16’, 498,
/* 48 */ DATE ‘2010-02-17’, 494,
/* 49 */ DATE ‘2010-02-18’, 484,
/* 50 */ DATE ‘2010-02-19’, 500,
/* 51 */ DATE ‘2010-02-20’, 488,
/* 52 */ DATE ‘2010-02-21’, 483,
/* 53 */ DATE ‘2010-02-22’, 484,
/* 54 */ DATE ‘2010-02-23’, 497,
/* 55 */ DATE ‘2010-02-24’, 487,
/* 56 */ DATE ‘2010-02-25’, 493,
/* 57 */ DATE ‘2010-02-26’, 515,
/* 58 */ DATE ‘2010-02-27’, 479,
/* 59 */ DATE ‘2010-02-28’, 505,

Whenever reading an execution plan, try to figure out how the estimations are built. Over time, you will become better and better in understanding estimations and will be able to solve statistic problems.

See also:
Teradata Statistics – A Case Study
Teradata Statistics Extrapolation for Equality Predicates

1 2 3 5
>