# Cracking The Teradata Join Estimation Code

## What is Teradata Join Estimation?

This article shows how **Teradata Join Estimation** works when no statistics have been collected. We show which heuristics are used to estimate the number of rows, and why it is important to always collect statistics on all join columns.

## Teradata Join Estimation Heuristics

The worst case is the following: 2 tables are joined which both have no collected statistics. To make sure that there are not even summary statistics on table level, we create both tables completely new and insert rows. The two tables are filled from a dummy table which serves as the basis in all our examples:

CREATESETTABLEDWHPRO.STATS1 ,FALLBACK,NOBEFOREJOURNAL,NOAFTERJOURNAL,CHECKSUM=DEFAULT,DEFAULTMERGEBLOCKRATIO,MAP= TD_MAP1 ( PI INTEGERNOTNULL, COL1 INTEGERNOTNULL, COL2 INTEGERNOTNULL)PRIMARYINDEX( PI );

We insert 50 rows into the table stats1

CREATESETTABLEDWHPRO.STATS2 ,FALLBACK,NOBEFOREJOURNAL,NOAFTERJOURNAL,CHECKSUM=DEFAULT,DEFAULTMERGEBLOCKRATIO,MAP= TD_MAP1 ( PI INTEGERNOTNULL, COL1 INTEGERNOTNULL, COL2 INTEGERNOTNULL)PRIMARYINDEX( PI );

We insert 150 rows into the table stats2

As the following SQL shows, there are no statistics on either table. The Optimizer must, therefore, estimate the cardinality of the two tables on the basis of a random AMP sample over the primary index:

SHOWSTATISTICSVALUESONSTATS1;[3624] There are no statistics defined for the table

SHOWSTATISTICSVALUESONSTATS2;[3624] There are no statistics defined for the table

We will perform a simple INNER JOIN of the two tables on a column that is not indexed:

EXPLAINSELECT*FROMSTATS2innerJOINSTATS1ONSTATS1.COL1 = STATS2.COL1; 1) First, we lock DWHPRO.STATS1 in TD_MAP1 for read on a reserved RowHash to prevent global deadlock. 2) Next, we lock DWHPRO.STATS2 in TD_MAP1 for read on a reserved RowHash to prevent global deadlock. 3) We lock DWHPRO.STATS1 in TD_MAP1 for read, and we lock DWHPRO.STATS2 in TD_MAP1 for read. 4) We do an all-AMPs RETRIEVE step in TD_MAP1 from DWHPRO.STATS1 by way of an all-rows scan with no residual conditions into Spool 2 (all_amps), which is duplicated on all AMPs in TD_Map1. The size of Spool 2 is estimated withlow confidenceto be96 rows(2,400 bytes). The estimated time for this step is 0.00 seconds. 5) We do an all-AMPs JOIN step in TD_Map1 from Spool 2 (Last Use) by way of an all-rows scan, which is joined to DWHPRO.STATS2 by way of an all-rows scan with no residual conditions. Spool 2 and DWHPRO.STATS2 are joined using a dynamic hash join, with a join condition of ("COL1 = DWHPRO.STATS2.COL1"). The result goes into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated withno confidenceto be600 rows( 27,000 bytes). The estimated time for this step is 0.01 seconds. 6) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.01 seconds.

From the execution plan, we can see that the estimated number of rows in the result set is 600. Furthermore, we can see that the estimated number of rows in table Stats1 is 96. However, we have to halve this number because the table is copied to all 2 AMPs. I.e. the optimizer estimates the number of rows in table Stats1 at 48, which is quite close to the real number of 50 rows.

To get the number of estimated rows in table Stats2 we can use the following trick:

EXPLAINSELECT*FROMSTATS2;1) First, we lock DWHPRO.STATS2 in TD_MAP1 for read on a reserved RowHash to prevent global deadlock. 2) Next, we lock DWHPRO.STATS2 in TD_MAP1 for read. 3) We do an all-AMPs RETRIEVE step in TD_MAP1 from DWHPRO.STATS2 by way of an all-rows scan with no residual conditions into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with low confidence to be156 rows(5,148 bytes). The estimated time for this step is 0.01 seconds. 4) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.01 seconds.

We can see from the execution plan that the number of estimated rows is 156.

Does the question arise how the estimate of the result set of 600 rows is made? The answer is that the following heuristic is applied:

**SQRT(number of rows of the bigger table) * (Number of rows from the smaller table). The result is rounded up to the nearest whole number. (Many thanks to Diether Noeth who gave me this information in the Teradata forum):**

Let's do the math for our example:

SQRT(156) = 12,489996 * 48 => 599,519808, which is rounded to 600.

If there are at least current statistics on the cardinality of the tables, these are used for the Heuristic:

COLLECTSTATSCOLUMN(PI)ONSTATS1;COLLECTSTATSCOLUMN(PI)ONSTATS2;EXPLAINSELECT*FROMSTATS2innerJOINSTATS1ONSTATS1.COL1 = STATS2.COL1 EXPLAIN SELECT * FROM STATS2 inner JOIN STATS1 ON STATS1.COL1 = STATS2.COL1; 1) First, we lock DWHPRO.STATS1 in TD_MAP1 for read on a reserved RowHash to prevent global deadlock. 2) Next, we lock DWHPRO.STATS2 in TD_MAP1 for read on a reserved RowHash to prevent global deadlock. 3) We lock DWHPRO.STATS1 in TD_MAP1 for read, and we lock DWHPRO.STATS2 in TD_MAP1 for read. 4) We do an all-AMPs RETRIEVE step in TD_MAP1 from DWHPRO.STATS1 by way of an all-rows scan with no residual conditions into Spool 2 (all_amps), which is duplicated on all AMPs in TD_Map1. The size of Spool 2 is estimated withhigh confidenceto be100rows ( 2,500 bytes). The estimated time for this step is 0.00 seconds. 5) We do an all-AMPs JOIN step in TD_Map1 from Spool 2 (Last Use) by way of an all-rows scan, which is joined to DWHPRO.STATS2 by way of an all-rows scan with no residual conditions. Spool 2 and DWHPRO.STATS2 are joined using a dynamic hash join, with a join condition of ("COL1 = DWHPRO.STATS2.COL1"). The result goes into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated withno confidenceto be613rows ( 27,585 bytes). The estimated time for this step is 0.01 seconds. 6) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.01 seconds.

The estimate for the result set has increased to 613 rows. But has it improved? Let's just run the query:

SELECT*FROMSTATS2innerJOINSTATS1ONSTATS1.COL1 = STATS2.COL1 -> SELECT completed. 0 rows returned.

As we can see, the query does not return any rows at all. So the estimation has even got a bit worse due to the statistics at the table level.

So it can be a bit confusing to look at both execution plans again. The retrieving step in the first plan (without statistics) has only low confidence, the retrieving step in the second plan (with statistics) has high confidence and still, the estimate with low confidence is better.

This was the first part about Join Estimations. In the next articles, we will take statistics step by step and see how this affects the optimizer's estimates.