Optimizing Teradata Queries: From No Index to Hashed NUSI

DWH Pro Admin

April 28, 2023

minutes reading time


The initial situation without any index

In this blog, I will demonstrate how to optimize a query using Teradata’s tools.

We will begin with the following test scenario:

CREATE TABLE Orders
(
	OrderId  BIGINT NOT NULL,
	CustomerId	BIGINT NOT NULL,
	Amount		BIGINT
) PRIMARY INDEX (OrderId);
INSERT INTO Orders 
SELECT     
  ROW_NUMBER() OVER (ORDER BY 1),     
  RANDOM(1,10000) AS CustomerId,     
  RANDOM(1,100)  
FROM SYS_CALENDAR.CALENDAR;  

The data is evenly distributed. To demonstrate the query’s selectivity for the tested indexes we will define later, I assigned a significant portion of rows the same Primary Index value that will be used for the index.

UPDATE ORDERS SET CustomerId = 9999 WHERE CustomerId > 5000;
SELECT CustomerId,COUNT(*) FROM ORDERS GROUP BY 1
ORDER BY 2 DESC;
CustomerIdCount(*)
99997383
783430
500318

We will generate a test table with sufficient dummy data and subsequently gather statistics on all columns to enable the optimizer to produce optimal estimates.

COLLECT STATS ON ORDERS COLUMN(OrderId); 
COLLECT STATS ON ORDERS COLUMN(CustomerId); 
COLLECT STATS ON ORDERS COLUMN(Amount);

Without Index Usage

We aim to optimize this query by running it without an index at the moment.

SELECT * FROM Orders WHERE CustomerId = 3000;

This is the Execution Plan:

Explain 
SELECT * FROM Orders WHERE CustomerId = 3000;

  1) First, we lock DWHPRO.Orders in TD_MAP1 for read on a reserved
     RowHash to prevent global deadlock.
  2) Next, we lock DWHPRO.Orders in TD_MAP1 for read.
  3) We do an all-AMPs RETRIEVE step in TD_MAP1 from DWHPRO.Orders by
     way of an all-rows scan with a condition of (
     "DWHPRO.Orders.CustomerId = 3000") into Spool 1 (group_amps),
     which is built locally on the AMPs.  The size of Spool 1 is
     estimated with high confidence to be 8 rows (360 bytes).  The
     estimated time for this step is 0.02 seconds.
  4) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
  -> The contents of Spool 1 are sent back to the user as the result of
     statement 1.  The total estimated time is 0.02 seconds.

The optimizer has no choice but to make a full table scan because the WHERE condition does not correspond to the table’s primary index.

Teradata NUSI Usage

Given the high selectivity of the query, implementing a NUSI is appropriate. Consequently, we will generate a NUSI corresponding to the column specified in the WHERE condition.

CREATE INDEX (CustomerId) ON Orders;

We rerun our query.

SELECT * FROM Orders WHERE CustomerId = 3000;

This is the Execution Plan:

Explain SELECT * FROM Orders WHERE CustomerId = 3000;

  1) First, we lock DWHPRO.Orders in TD_MAP1 for read on a reserved
     RowHash to prevent global deadlock.
  2) Next, we lock DWHPRO.Orders in TD_MAP1 for read.
  3) We do an all-AMPs RETRIEVE step in TD_MAP1 from DWHPRO.Orders by
     way of index # 4 "DWHPRO.Orders.CustomerId = 3000" with no
     residual conditions into Spool 1 (group_amps), which is built
     locally on the AMPs.  The size of Spool 1 is estimated with high
     confidence to be 8 rows (360 bytes).  The estimated time for this
     step is 0.01 seconds.
  4) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
  -> The contents of Spool 1 are sent back to the user as the result of
     statement 1.  The total estimated time is 0.01 seconds.

Since the NUSI is selective enough, the optimizer uses it. This is already an improvement over the previous scenario without an index. But is there still better?

The “Hashed NUSI”

Implementation of the NUSI has led to significant enhancements in performance.

Although NUSI has benefits, it also has a drawback; it operates as an ALL-AMP system, with NUSI index rows residing on the same AMP as the base table rows and not distributed by hashing.

Let’s create a hash-distributed NUSI. I will demonstrate how to do this in the following example.

CREATE JOIN INDEX Orders_JI
AS
SELECT CustomerId,ROWID FROM ORDERS PRIMARY INDEX (CustomerId);
COLLECT STATS ON Orders_JI COLUMN(CustomerId);

COLLECT STATS ON Orders_JI COLUMN(CustomerId);

The join index provides the same benefits as NUSI but only necessitates a single AMP to locate the ROWIDs of the base table rows. It can subsequently retrieve them in the same manner as NUSI.

We re-run our query.

SELECT * FROM Orders WHERE CustomerId = 3000;

This is the Execution Plan:

Explain SELECT * FROM Orders WHERE CustomerId = 3000;

  1) First, we lock DWHPRO.Orders in TD_MAP1 for read on a reserved
     RowHash to prevent global deadlock.
  2) Next, we lock DWHPRO.Orders in TD_MAP1 for read.
  3) We do a single-AMP RETRIEVE step from DWHPRO.ORDERS_JI by way of
     the primary index "DWHPRO.ORDERS_JI.CustomerId = 3000" with no
     residual conditions into Spool 2 (all_amps), which is
     redistributed by the hash code of (DWHPRO.ORDERS_JI.Field_1026) to
     few or all AMPs in TD_Map1.  Then we do a SORT to order Spool 2 by
     the sort key in spool field1.  The size of Spool 2 is estimated
     with high confidence to be 8 rows (248 bytes).  The estimated time
     for this step is 0.00 seconds.
  4) We do an all-AMPs JOIN step in TD_Map1 from Spool 2 (Last Use) by
     way of an all-rows scan, which is joined to DWHPRO.Orders by way
     of an all-rows scan with no residual conditions.  Spool 2 and
     DWHPRO.Orders are joined using a row id join, with a join
     condition of ("Field_1 = DWHPRO.Orders.ROWID").  The result goes
     into Spool 1 (group_amps), which is built locally on the AMPs.
     The size of Spool 1 is estimated with low confidence to be 8 rows
     (360 bytes).  The estimated time for this step is 0.01 seconds.
  5) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
  -> The contents of Spool 1 are sent back to the user as the result of
     statement 1.  The total estimated time is 0.01 seconds.

The join index does not require uniqueness and can be located through a single AMP.

The Improved “Hashed NUSI”

A final enhancement is available for our Join Index, which maintains the Index Rows, illustrated in the graphic below:

Teradata Join Index
Uncompressed Join Index

The NUSI stores all ROWIDs associated with a single index value cohesively.

This format is easily achieved with a Join Index, using the following syntax (noting the brackets).

The second set of parentheses condenses the column combinations, which in our scenario, solely refers to the ROWID that directs to the base table.

CREATE JOIN INDEX Orders_JI
AS
SELECT (CustomerId),(ROWID) FROM ORDERS PRIMARY INDEX (CustomerId);
COLLECT STATS ON Orders_JI COLUMN(CustomerId);

COLLECT STATS ON Orders_JI COLUMN(CustomerId);

The syntax stores identical ROWIDs from NUSI in an index row. If a single index row is insufficient, multiple index rows are used.

Why is the compression of the Teradata join index so important?

We aim to maximize its potential. Initially, we began without an index but have now reached a join index via the NUSI that outperforms it by solely requiring one AMP.

Compressing data not only saves storage space but also reduces the number of required I/Os.

SELECT TABLENAME,SUM(CURRENTPERM) AS Perm 
FROM DBC.TABLESIZE 
WHERE DATABASENAME = 'DWHPRO' 
AND TABLENAME LIKE 'Orders_JI%'
GROUP BY 1;
TablenamePerm
Orders_JI1 875 968
Orders_JI_Comp1 114 112

The condensed index is merely 60% the size of the uncompressed index.

Final Words

This example demonstrates the multitude of query optimization possibilities available with Teradata.

The Join Index is not always the superior option, just as the NUSI is not always the optimal choice.

Crafting an optimal index landscape necessitates significant expertise and understanding.

{"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

You might also like

>