Using Teradata Check Constraints for SQL Tuning

Using Teradata Check Constraints for SQL Tuning

Without Teradata CHECK Constraint

If a non-indexed column is used in a retrieving step (i.e. WHERE condition) and without Teradata Check Constraints, the Optimizer uses heuristics for missing statistics to estimate the number of rows in the answer set:

The above estimation is 7,342 rows, which corresponds to 10% of the number of table rows. This is the heuristic the Teradata optimizer always uses in the absence of statistics on the non-indexed column.

CREATE TABLE Heuristics_Without_Check
(
	PK INTEGER NOT NULL,
	GENDER BYTEINT NOT NULL
) PRIMARY INDEX (PK);

INSERT INTO Heuristics_Without_Check
SELECT
	ROW_NUMBER() OVER (ORDER BY 1) AS PK,
	RANDOM(0,1)
FROM SYS_CALENDAR.CALENDAR;
-> 73,414 Rows inserted

--> Collect PI Statistics to have the real row number:

COLLECT STATISTICS COLUMN(PK) ON Heuristics_Without_Check;

SELECT * FROM Heuristics_Without_Check WHERE GENDER = 0;

Explain SELECT * FROM Heuristics_Without_Check WHERE GENDER = 0;

  1) First, we lock DWHPRO.Heuristics_Without_Check in TD_MAP1 for read
     on a reserved RowHash to prevent global deadlock.
  2) Next, we lock DWHPRO.Heuristics_Without_Check in TD_MAP1 for read.
  3) We do an all-AMPs RETRIEVE step in TD_MAP1 from
     DWHPRO.Heuristics_Without_Check by way of an all-rows scan with a
     condition of ("DWHPRO.Heuristics_Without_Check.GENDER = 0") into
     Spool 1 (group_amps), which is built locally on the AMPs.  The
     size of Spool 1 is estimated with no confidence to be 7,342 rows (
     190,892 bytes).  The estimated time for this step is 0.03 seconds.
  4) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
  -> The contents of Spool 1 are sent back to the user as the result of
     statement 1.  The total estimated time is 0.03 seconds.

But with a simple trick, the estimates can be massively improved even if the statistics on the non-indexed column are missing. The optimizer uses existing CHECK constraints to optimize the estimates.

With CHECK Constraint

The following forms are used for this:

Estimated Rows = Total Rows x (1 / Possible Unique Values)

Since the RANDOM function we use to load our test table only returns the values 0 and 1 we can add the following CHECK constraint to our test table:

CREATE TABLE Heuristics_With_Check
(
	PK INTEGER NOT NULL,
	GENDER BYTEINT NOT NULL CHECK (GENDER IN (0,1))
) PRIMARY INDEX (PK);

INSERT INTO Heuristics_With_Check
SELECT
	ROW_NUMBER() OVER (ORDER BY 1) AS PK,
	RANDOM(0,1)
FROM SYS_CALENDAR.CALENDAR;
-> 73,414 Rows inserted

--> Collect PI Statistics to have the real row number:

COLLECT STATISTICS COLUMN(PK) ON Heuristics_With_Check;

If we apply the above formula, we expect the number of estimated rows in the ResultSet as:

Estimated rows = 73,414 * (1 / 2) = 36,707 rows

Let's take a look at the Execution Plan:

SELECT * FROM Heuristics_With_Check WHERE GENDER = 0;

Explain SELECT * FROM Heuristics_With_Check WHERE GENDER = 0;

  1) First, we lock DWHPRO.Heuristics_With_Check in TD_MAP1 for read on
     a reserved RowHash to prevent global deadlock.
  2) Next, we lock DWHPRO.Heuristics_With_Check in TD_MAP1 for read.
  3) We do an all-AMPs RETRIEVE step in TD_MAP1 from
     DWHPRO.Heuristics_With_Check by way of an all-rows scan with a
     condition of ("DWHPRO.Heuristics_With_Check.GENDER = 0") into
     Spool 1 (group_amps), which is built locally on the AMPs.  The
     size of Spool 1 is estimated with low confidence to be 36,707 rows
     (954,382 bytes).  The estimated time for this step is 0.07 seconds.
  4) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
  -> The contents of Spool 1 are sent back to the user as the result of
     statement 1.  The total estimated time is 0.07 seconds.


Finally, we run the query using the table with the CHECK constraint, and actually the estimate is now 36,707 rows and therefore much better than the 10% heuristic. The real number of rows is almost the same as the estimation as the RANDOM function creates about the same number of zeros and ones. In general, this approach can improve estimations significantly:


SELECT COUNT(*) FROM Heuristics_With_Check WHERE GENDER = 0;
--> 36,711 rows

A Warning when using CHECK constraints

You should not start now to define check constraints on columns just to improve the estimates. Don't forget that checking the rules involves additional effort (and creating secondary indexes).

More information about Teradata CHECK constraints you can find here:
Check Constraints explained in Detail

DWH Pro Admin
 

>