Understanding Teradata Join Estimation: Heuristics and Importance of Statistics Collection

Roland Wenzlofsky

May 2, 2023

minutes reading time

What is Teradata Join Estimation?

This article demonstrates the functioning of Teradata Join Estimation in the absence of statistics. It presents the heuristics employed to estimate row count and emphasizes the cruciality of collecting statistics on all join columns.

Teradata Join Estimation Heuristics

The worst scenario involves joining two tables without any collected statistics. We create both tables from scratch and insert rows to prevent summary statistics at the table level. We populate the two tables by drawing from a dummy table that is the foundation for all our illustrations.

CREATE SET  TABLE DWHPRO.STATS1 ,FALLBACK ,
     NO BEFORE JOURNAL,
     NO AFTER JOURNAL,
     CHECKSUM = DEFAULT,
     DEFAULT MERGEBLOCKRATIO,
     MAP = TD_MAP1
     (
      PI INTEGER NOT NULL,
      COL1 INTEGER NOT NULL,
      COL2 INTEGER NOT NULL)
PRIMARY INDEX ( PI );

We add 50 rows to the stats1 table.

CREATE SET  TABLE DWHPRO.STATS2 ,FALLBACK ,
     NO BEFORE JOURNAL,
     NO AFTER JOURNAL,
     CHECKSUM = DEFAULT,
     DEFAULT MERGEBLOCKRATIO,
     MAP = TD_MAP1
     (
      PI INTEGER NOT NULL,
      COL1 INTEGER NOT NULL,
      COL2 INTEGER NOT NULL)
PRIMARY INDEX ( PI );

We add 150 rows to the stats2 table.

The SQL query indicates that neither table has any statistics. Consequently, the Optimizer needs to estimate the cardinality of both tables by conducting a random AMP sample over the primary index.

SHOW STATISTICS VALUES ON STATS1;
[3624] There are no statistics defined for the table

SHOW STATISTICS VALUES ON STATS2;
[3624] There are no statistics defined for the table

We will execute a basic INNER JOIN of the two tables on an unindexed column.

EXPLAIN SELECT * FROM STATS2 inner JOIN STATS1 ON STATS1.COL1 = STATS2.COL1;

 1) First, we lock DWHPRO.STATS1 in TD_MAP1 for read on a reserved
      RowHash to prevent global deadlock.
   2) Next, we lock DWHPRO.STATS2 in TD_MAP1 for read on a reserved
      RowHash to prevent global deadlock.
   3) We lock DWHPRO.STATS1 in TD_MAP1 for read, and we lock
      DWHPRO.STATS2 in TD_MAP1 for read.
   4) We do an all-AMPs RETRIEVE step in TD_MAP1 from DWHPRO.STATS1 by
      way of an all-rows scan with no residual conditions into Spool 2
      (all_amps), which is duplicated on all AMPs in TD_Map1.  The size
      of Spool 2 is estimated with low confidence to be 96 rows (2,400
      bytes).  The estimated time for this step is 0.00 seconds.
   5) We do an all-AMPs JOIN step in TD_Map1 from Spool 2 (Last Use) by
      way of an all-rows scan, which is joined to DWHPRO.STATS2 by way
      of an all-rows scan with no residual conditions.  Spool 2 and
      DWHPRO.STATS2 are joined using a dynamic hash join, with a join
      condition of ("COL1 = DWHPRO.STATS2.COL1").  The result goes into
      Spool 1 (group_amps), which is built locally on the AMPs.  The
      size of Spool 1 is estimated with no confidence to be 600 rows (
      27,000 bytes).  The estimated time for this step is 0.01 seconds.
   6) Finally, we send out an END TRANSACTION step to all AMPs involved
      in processing the request.
   -> The contents of Spool 1 are sent back to the user as the result of
      statement 1.  The total estimated time is 0.01 seconds.

The execution plan forecasts a result set with approximately 600 rows, while Table Stats1 is anticipated to contain roughly 96 rows. Due to table replication across 2 AMPs, this number must be halved. As such, the Optimizer predicts 48 rows in Table Stats1, which aligns closely with the row count of 50.

To get the number of estimated rows in Table Stats2, we can use the following trick:

EXPLAIN SELECT * FROM STATS2;
1) First, we lock DWHPRO.STATS2 in TD_MAP1 for read on a reserved
      RowHash to prevent global deadlock.
   2) Next, we lock DWHPRO.STATS2 in TD_MAP1 for read.
   3) We do an all-AMPs RETRIEVE step in TD_MAP1 from DWHPRO.STATS2 by
      way of an all-rows scan with no residual conditions into Spool 1
      (group_amps), which is built locally on the AMPs.  The size of
      Spool 1 is estimated with low confidence to be 156 rows (5,148
      bytes).  The estimated time for this step is 0.01 seconds.
   4) Finally, we send out an END TRANSACTION step to all AMPs involved
      in processing the request.
   -> The contents of Spool 1 are sent back to the user as the result of
      statement 1.  The total estimated time is 0.01 seconds.

The execution plan shows an estimated row count of 156.

How is the estimate for a result set of 600 rows determined? The estimate is based on the following heuristic approach:

SQRT(number of rows of the bigger table) * (Number of rows from the smaller table). The result is rounded up to the nearest whole number. (Many thanks to Diether Noeth, who gave me this information in the Teradata forum):

Let’s do the math for our example:

SQRT(156) = 12,489996 * 48 => 599,519808, rounded to 600.

If there are at least current statistics on the cardinality of the tables, these are used for the heuristic:

COLLECT STATS COLUMN(PI) ON STATS1;
COLLECT STATS COLUMN(PI) ON STATS2;

EXPLAIN SELECT * FROM STATS2 inner JOIN STATS1 ON STATS1.COL1 = STATS2.COL1

EXPLAIN SELECT * FROM STATS2 inner JOIN STATS1 ON STATS1.COL1 = STATS2.COL1;
 1) First, we lock DWHPRO.STATS1 in TD_MAP1 for read on a reserved
      RowHash to prevent global deadlock.
   2) Next, we lock DWHPRO.STATS2 in TD_MAP1 for read on a reserved
      RowHash to prevent global deadlock.
   3) We lock DWHPRO.STATS1 in TD_MAP1 for read, and we lock
      DWHPRO.STATS2 in TD_MAP1 for read.
   4) We do an all-AMPs RETRIEVE step in TD_MAP1 from DWHPRO.STATS1 by
      way of an all-rows scan with no residual conditions into Spool 2
      (all_amps), which is duplicated on all AMPs in TD_Map1.  The size
      of Spool 2 is estimated with high confidence to be 100 rows (
      2,500 bytes).  The estimated time for this step is 0.00 seconds.
   5) We do an all-AMPs JOIN step in TD_Map1 from Spool 2 (Last Use) by
      way of an all-rows scan, which is joined to DWHPRO.STATS2 by way
      of an all-rows scan with no residual conditions.  Spool 2 and
      DWHPRO.STATS2 are joined using a dynamic hash join, with a join
      condition of ("COL1 = DWHPRO.STATS2.COL1").  The result goes into
      Spool 1 (group_amps), which is built locally on the AMPs.  The
      size of Spool 1 is estimated with no confidence to be 613 rows (
      27,585 bytes).  The estimated time for this step is 0.01 seconds.
   6) Finally, we send out an END TRANSACTION step to all AMPs involved
      in processing the request.
   -> The contents of Spool 1 are sent back to the user as the result of
      statement 1.  The total estimated time is 0.01 seconds.

The estimated result set has increased to 613 rows. Shall we execute the query to assess any improvement?

SELECT * FROM STATS2 inner JOIN STATS1 ON STATS1.COL1 = STATS2.COL1
-> SELECT completed. 0 rows returned.

The query produces no results, indicating a decrease in accuracy resulting from the statistics at the table level.

Comparing both execution plans can lead to confusion. In the first plan (without statistics), the retrieval step has low confidence. Conversely, in the second plan (with statistics), the retrieval step has a high confidence level, although the estimate with low confidence is still superior.

This is the introductory section on Join Estimations. Subsequent articles will delve into statistical analysis and its impact on the Optimizer’s estimates.