Improving Row Estimates with Teradata Check Constraints

Roland Wenzlofsky

April 23, 2023

minutes reading time


Without Teradata CHECK Constraint

When a non-indexed column is utilized in a retrieval operation (i.e. WHERE statement) and lacks Teradata Check Constraints, the Optimizer employs heuristics to approximate the number of rows in the resulting set by inferring missing statistics.

The estimated number of rows is 7,342, equivalent to 10% of the table’s total rows. This heuristic is utilized by the Teradata optimizer when statistics on the non-indexed column are not available.

CREATE TABLE Heuristics_Without_Check
(
	PK INTEGER NOT NULL,
	GENDER BYTEINT NOT NULL
) PRIMARY INDEX (PK);

INSERT INTO Heuristics_Without_Check
SELECT
	ROW_NUMBER() OVER (ORDER BY 1) AS PK,
	RANDOM(0,1)
FROM SYS_CALENDAR.CALENDAR;
-> 73,414 Rows inserted

--> Collect PI Statistics to have the real row number:

COLLECT STATISTICS COLUMN(PK) ON Heuristics_Without_Check;

SELECT * FROM Heuristics_Without_Check WHERE GENDER = 0;

Explain SELECT * FROM Heuristics_Without_Check WHERE GENDER = 0;

  1) First, we lock DWHPRO.Heuristics_Without_Check in TD_MAP1 for read
     on a reserved RowHash to prevent global deadlock.
  2) Next, we lock DWHPRO.Heuristics_Without_Check in TD_MAP1 for read.
  3) We do an all-AMPs RETRIEVE step in TD_MAP1 from
     DWHPRO.Heuristics_Without_Check by way of an all-rows scan with a
     condition of ("DWHPRO.Heuristics_Without_Check.GENDER = 0") into
     Spool 1 (group_amps), which is built locally on the AMPs.  The
     size of Spool 1 is estimated with no confidence to be 7,342 rows (
     190,892 bytes).  The estimated time for this step is 0.03 seconds.
  4) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
  -> The contents of Spool 1 are sent back to the user as the result of
     statement 1.  The total estimated time is 0.03 seconds.

But with a simple trick, the estimates can be massively improved even if the non-indexed column statistics are missing. The Optimizer uses existing CHECK constraints to optimize the estimates.

With CHECK Constraint

The following forms are used for this:

Estimated Rows = Total Rows x (1 / Possible Unique Values)

Since the RANDOM function we use to load our test table only returns the values 0 and 1, we can add the following CHECK constraint to our test table:

CREATE TABLE Heuristics_With_Check
(
	PK INTEGER NOT NULL,
	GENDER BYTEINT NOT NULL CHECK (GENDER IN (0,1))
) PRIMARY INDEX (PK);

INSERT INTO Heuristics_With_Check
SELECT
	ROW_NUMBER() OVER (ORDER BY 1) AS PK,
	RANDOM(0,1)
FROM SYS_CALENDAR.CALENDAR;
-> 73,414 Rows inserted

--> Collect PI Statistics to have the real row number:

COLLECT STATISTICS COLUMN(PK) ON Heuristics_With_Check;

Applying the aforementioned formula produces an anticipated quantity of rows in the ResultSet.

Estimated rows = 73,414 * (1 / 2) = 36,707 rows

Examining the Execution Plan:

SELECT * FROM Heuristics_With_Check WHERE GENDER = 0;

Explain SELECT * FROM Heuristics_With_Check WHERE GENDER = 0;

  1) First, we lock DWHPRO.Heuristics_With_Check in TD_MAP1 for read on
     a reserved RowHash to prevent global deadlock.
  2) Next, we lock DWHPRO.Heuristics_With_Check in TD_MAP1 for read.
  3) We do an all-AMPs RETRIEVE step in TD_MAP1 from
     DWHPRO.Heuristics_With_Check by way of an all-rows scan with a
     condition of ("DWHPRO.Heuristics_With_Check.GENDER = 0") into
     Spool 1 (group_amps), which is built locally on the AMPs.  The
     size of Spool 1 is estimated with low confidence to be 36,707 rows
     (954,382 bytes).  The estimated time for this step is 0.07 seconds.
  4) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
  -> The contents of Spool 1 are sent back to the user as the result of
     statement 1.  The total estimated time is 0.07 seconds.


We execute the query on the table with the CHECK constraint and obtain a more accurate estimate of 36,707 rows. This estimate is vastly superior to the 10% heuristic. The number of rows closely aligns with the estimation because the RANDOM function generates a roughly equal number of zeros and ones. Adopting this approach can considerably enhance estimate precision.


SELECT COUNT(*) FROM Heuristics_With_Check WHERE GENDER = 0;
--> 36,711 rows

A Warning when using CHECK constraints

Avoid defining check constraints on columns at this time to enhance the estimates. Keep in mind that verifying the regulations necessitates extra exertion and the establishment of secondary indexes.

More information about Teradata CHECK constraints you can find here:
Check Constraints explained in Detail

{"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

You might also like

>