Join Estimations without any Statistics

0
420

The worst case is the following: 2 tables are joined which both have no statistics. To make sure that there are not even summary statistics on table level, we create both tables completely new and insert rows. The two tables are filled from a dummy table which serves as the basis in all our examples:

CREATE SET  TABLE DWHPRO.STATS1 ,FALLBACK ,
     NO BEFORE JOURNAL,
     NO AFTER JOURNAL,
     CHECKSUM = DEFAULT,
     DEFAULT MERGEBLOCKRATIO,
     MAP = TD_MAP1
     (
      PI INTEGER NOT NULL,
      COL1 INTEGER NOT NULL,
      COL2 INTEGER NOT NULL)
PRIMARY INDEX ( PI );

We insert 50 rows into the table stats1

CREATE SET  TABLE DWHPRO.STATS2 ,FALLBACK ,
     NO BEFORE JOURNAL,
     NO AFTER JOURNAL,
     CHECKSUM = DEFAULT,
     DEFAULT MERGEBLOCKRATIO,
     MAP = TD_MAP1
     (
      PI INTEGER NOT NULL,
      COL1 INTEGER NOT NULL,
      COL2 INTEGER NOT NULL)
PRIMARY INDEX ( PI );

We insert 150 rows into the table stats2

As the following SQL shows, there are no statistics on either table. The Optimizer must, therefore, estimate the cardinality of the two tables on the basis of a random AMP sample over the primary index:

SHOW STATISTICS VALUES ON STATS1;
[3624] There are no statistics defined for the table  
SHOW STATISTICS VALUES ON STATS2;
[3624] There are no statistics defined for the table  

We will perform a simple INNER JOIN of the two tables on a column that is not indexed:

EXPLAIN SELECT * FROM STATS2 inner JOIN STATS1 ON STATS1.COL1 = STATS2.COL1;

 1) First, we lock DWHPRO.STATS1 in TD_MAP1 for read on a reserved
      RowHash to prevent global deadlock.
   2) Next, we lock DWHPRO.STATS2 in TD_MAP1 for read on a reserved
      RowHash to prevent global deadlock.
   3) We lock DWHPRO.STATS1 in TD_MAP1 for read, and we lock
      DWHPRO.STATS2 in TD_MAP1 for read.
   4) We do an all-AMPs RETRIEVE step in TD_MAP1 from DWHPRO.STATS1 by
      way of an all-rows scan with no residual conditions into Spool 2
      (all_amps), which is duplicated on all AMPs in TD_Map1.  The size
      of Spool 2 is estimated with low confidence to be 96 rows (2,400
      bytes).  The estimated time for this step is 0.00 seconds.
   5) We do an all-AMPs JOIN step in TD_Map1 from Spool 2 (Last Use) by
      way of an all-rows scan, which is joined to DWHPRO.STATS2 by way
      of an all-rows scan with no residual conditions.  Spool 2 and
      DWHPRO.STATS2 are joined using a dynamic hash join, with a join
      condition of ("COL1 = DWHPRO.STATS2.COL1").  The result goes into
      Spool 1 (group_amps), which is built locally on the AMPs.  The
      size of Spool 1 is estimated with no confidence to be 600 rows (
      27,000 bytes).  The estimated time for this step is 0.01 seconds.
   6) Finally, we send out an END TRANSACTION step to all AMPs involved
      in processing the request.
   -> The contents of Spool 1 are sent back to the user as the result of
      statement 1.  The total estimated time is 0.01 seconds.

From the execution plan, we can see that the estimated number of rows in the result set is 600. Furthermore, we can see that the estimated number of rows in table Stats1 is 96. However, we have to halve this number because the table is copied to all 2 AMPs. I.e. the optimizer estimates the number of rows in table Stats1 at 48, which is quite close to the real number of 50 rows.

To get the number of estimated rows in table Stats2 we can use the following trick:

EXPLAIN SELECT * FROM STATS2;
1) First, we lock DWHPRO.STATS2 in TD_MAP1 for read on a reserved
      RowHash to prevent global deadlock.
   2) Next, we lock DWHPRO.STATS2 in TD_MAP1 for read.
   3) We do an all-AMPs RETRIEVE step in TD_MAP1 from DWHPRO.STATS2 by
      way of an all-rows scan with no residual conditions into Spool 1
      (group_amps), which is built locally on the AMPs.  The size of
      Spool 1 is estimated with low confidence to be 156 rows (5,148
      bytes).  The estimated time for this step is 0.01 seconds.
   4) Finally, we send out an END TRANSACTION step to all AMPs involved
      in processing the request.
   -> The contents of Spool 1 are sent back to the user as the result of
      statement 1.  The total estimated time is 0.01 seconds.

We can see from the execution plan that the number of estimated rows is 156.

Does the question arise how the estimate of the result set of 600 rows is made? The answer is that the following heuristic is applied:

SQRT(number of rows of the bigger table) * (Number of rows from the smaller table). The result is rounded up to the nearest whole number. (Many thanks to Diether Noeth who gave me this information in the Teradata forum):

Let’s do the math for our example:

SQRT(156) = 12,489996 * 48 => 599,519808, which is rounded to 600.

If there are at least current statistics on the cardinality of the tables, these are used for the Heuristic:

COLLECT STATS COLUMN(PI) ON STATS1;
COLLECT STATS COLUMN(PI) ON STATS2;

EXPLAIN SELECT * FROM STATS2 inner JOIN STATS1 ON STATS1.COL1 = STATS2.COL1

EXPLAIN SELECT * FROM STATS2 inner JOIN STATS1 ON STATS1.COL1 = STATS2.COL1;
 1) First, we lock DWHPRO.STATS1 in TD_MAP1 for read on a reserved
      RowHash to prevent global deadlock.
   2) Next, we lock DWHPRO.STATS2 in TD_MAP1 for read on a reserved
      RowHash to prevent global deadlock.
   3) We lock DWHPRO.STATS1 in TD_MAP1 for read, and we lock
      DWHPRO.STATS2 in TD_MAP1 for read.
   4) We do an all-AMPs RETRIEVE step in TD_MAP1 from DWHPRO.STATS1 by
      way of an all-rows scan with no residual conditions into Spool 2
      (all_amps), which is duplicated on all AMPs in TD_Map1.  The size
      of Spool 2 is estimated with high confidence to be 100 rows (
      2,500 bytes).  The estimated time for this step is 0.00 seconds.
   5) We do an all-AMPs JOIN step in TD_Map1 from Spool 2 (Last Use) by
      way of an all-rows scan, which is joined to DWHPRO.STATS2 by way
      of an all-rows scan with no residual conditions.  Spool 2 and
      DWHPRO.STATS2 are joined using a dynamic hash join, with a join
      condition of ("COL1 = DWHPRO.STATS2.COL1").  The result goes into
      Spool 1 (group_amps), which is built locally on the AMPs.  The
      size of Spool 1 is estimated with no confidence to be 613 rows (
      27,585 bytes).  The estimated time for this step is 0.01 seconds.
   6) Finally, we send out an END TRANSACTION step to all AMPs involved
      in processing the request.
   -> The contents of Spool 1 are sent back to the user as the result of
      statement 1.  The total estimated time is 0.01 seconds.

The estimate for the result set has increased to 613 rows. But has it improved? Let’s just run the query:

SELECT * FROM STATS2 inner JOIN STATS1 ON STATS1.COL1 = STATS2.COL1
-> SELECT completed. 0 rows returned.

As we can see, the query does not return any rows at all. So the estimation has even got a bit worse due to the statistics at table level.

So it can be a bit confusing to look at both execution plans again. The retrieve step in the first plan (without statistics) has only low confidence, the retrieve step in the second plan (with statistics) has high confidence and still the estimate with low confidence is better.

This was the first part about Join Estimations. In the next articles, we will take statistics step by step and see how this affects the optimizer’s estimates.

LEAVE A REPLY

Please enter your comment!
Please enter your name here