How To Speed Up Your PPI Tables With A Secondary Index

Roland Wenzlofsky

January 2, 2020

minutes reading time


When designing Partitioned Primary Index Tables in Teradata, you have several options. Depending on the expected workload, you have to decide which design is the best.

In this article, I’ll show you the effect of the different types of partitioning on the access path chosen by the Teradata Optimizer. This, in turn, affects performance.

First, we need a test scenario. I have therefore created two identical tables. The only difference is that one is partitioned and the other is not:

CREATE TABLE Schedule_PPI
(
 ScheduleId BIGINT,
 StartDate DATE FORMAt 'YYYY-MM-DD'
) PRIMARY INDEX (ScheduleId)
PARTITION BY RANGE_N (
StartDate BETWEEN DATE '1900-01-01' AND DATE '2100-12-31' 
EACH INTERVAL '1' MONTH, NO RANGE);
CREATE TABLE Schedule_NPPI
(
	ScheduleId	BIGINT,
	StartDate	DATE FORMAt 'YYYY-MM-DD'
) PRIMARY INDEX (ScheduleId)
;

We load these tables with an identical set of test data:

INSERT INTO Schedule_PPI
SELECT 
 ROW_NUMBER() OVER (ORDER BY 1) AS ScheduleId,
 CALENDAR_DATE
FROM 
(
 SELECT * FROM SYS_CALENDAR.CALENDAR
 UNION ALL
 SELECT * FROM SYS_CALENDAR.CALENDAR
 UNION ALL
 SELECT * FROM SYS_CALENDAR.CALENDAR
 UNION ALL
 SELECT * FROM SYS_CALENDAR.CALENDAR
 UNION ALL
 SELECT * FROM SYS_CALENDAR.CALENDAR
 UNION ALL
 SELECT * FROM SYS_CALENDAR.CALENDAR
 UNION ALL
 SELECT * FROM SYS_CALENDAR.CALENDAR
 UNION ALL
 SELECT * FROM SYS_CALENDAR.CALENDAR
 UNION ALL
 SELECT * FROM SYS_CALENDAR.CALENDAR
 UNION ALL
 SELECT * FROM SYS_CALENDAR.CALENDAR
) t01;

COLLECT STATISTICS COLUMN(ScheduleId) ON Schedule_PPI; 
COLLECT STATISTICS COLUMN(StartDate) ON Schedule_PPI; 
COLLECT STATISTICS COLUMN(PARTITION) ON Schedule_PPI;
INSERT INTO Schedule_NPPI
SELECT 
 ROW_NUMBER() OVER (ORDER BY 1) AS ScheduleId,
 CALENDAR_DATE
FROM 
(
 SELECT * FROM SYS_CALENDAR.CALENDAR
 UNION ALL
 SELECT * FROM SYS_CALENDAR.CALENDAR
 UNION ALL
 SELECT * FROM SYS_CALENDAR.CALENDAR
 UNION ALL
 SELECT * FROM SYS_CALENDAR.CALENDAR
 UNION ALL
 SELECT * FROM SYS_CALENDAR.CALENDAR
 UNION ALL
 SELECT * FROM SYS_CALENDAR.CALENDAR
 UNION ALL
 SELECT * FROM SYS_CALENDAR.CALENDAR
 UNION ALL
 SELECT * FROM SYS_CALENDAR.CALENDAR
 UNION ALL
 SELECT * FROM SYS_CALENDAR.CALENDAR
 UNION ALL
 SELECT * FROM SYS_CALENDAR.CALENDAR
) t01;

COLLECT STATISTICS COLUMN(ScheduleId) ON Schedule_NPPI; 
COLLECT STATISTICS COLUMN(StartDate) ON Schedule_NPPI;

Query Only Via The Partition Columns

First, we will execute a query on both tables (PPI and NPPI) in which the primary index is contained in the WHERE condition. We start with the partitioned table:

SELECT *
FROM Schedule_PPI
WHERE StartDate BETWEEN DATE '2019-10-31' AND DATE'2019-10-31';

Let us now look at the resulting execution plan in detail:

Explain 
SELECT *
FROM Schedule_PPI
WHERE StartDate BETWEEN DATE '2019-10-31' AND DATE'2019-10-31';

  1) First, we lock DWHPRO.Schedule_PPI in TD_MAP1 for read on a
     reserved RowHash in a single partition to prevent global deadlock.
  2) Next, we lock DWHPRO.Schedule_PPI in TD_MAP1 for read on a single
     partition.
  3) We do an all-AMPs RETRIEVE step in TD_MAP1 from a single partition
     of DWHPRO.Schedule_PPI with a condition of (
     "DWHPRO.Schedule_PPI.StartDate = DATE '2019-10-31'") with a
     residual condition of ("DWHPRO.Schedule_PPI.StartDate = DATE
     '2019-10-31'") into Spool 1 (group_amps), which is built locally
     on the AMPs.  The size of Spool 1 is estimated with high
     confidence to be 10 rows (390 bytes).  The estimated time for this
     step is 0.00 seconds.
  4) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
  -> The contents of Spool 1 are sent back to the user as the result of
     statement 1.  The total estimated time is 0.00 seconds.

An All-AMP retrieve step is required, but only a single partition is read ((all days of October 2019 are in the same partition).

Teradata PPI
All-AMP – One Partition Access

Next, we execute the same query on the NPPI table:

SELECT *
FROM Schedule_NPPI
WHERE StartDate BETWEEN DATE '2019-10-31' AND DATE'2019-10-31';

This is the resulting execution plan:

Explain SELECT *
FROM Schedule_NPPI
WHERE StartDate  BETWEEN DATE '2019-10-31' AND DATE'2019-10-31';

  1) First, we lock DWHPRO.Schedule_NPPI in TD_MAP1 for read on a
     reserved RowHash to prevent global deadlock.
  2) Next, we lock DWHPRO.Schedule_NPPI in TD_MAP1 for read.
  3) We do an all-AMPs RETRIEVE step in TD_MAP1 from
     DWHPRO.Schedule_NPPI by way of an all-rows scan with a condition
     of ("DWHPRO.Schedule_NPPI.StartDate = DATE '2019-10-31'") into
     Spool 1 (group_amps), which is built locally on the AMPs.  The
     size of Spool 1 is estimated with high confidence to be 10 rows (
     390 bytes).  The estimated time for this step is 0.15 seconds.
  4) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
  -> The contents of Spool 1 are sent back to the user as the result of
     statement 1.  The total estimated time is 0.15 seconds.

As you can see, for a Teradata NPPI Table, it is again All-AMP access, but this time, a full table scan must be performed:

Teradata Partitioned Primary Index
All-AMP Full Table Scan

As we have seen, the PPI table brings us a performance advantage in this case because the rows of the table do not have to be searched in all data blocks but are located locally in a partition.

The Optimizer’s estimation with 0.00 seconds and 0.15 seconds expresses this exceptionally well.

Query Only Via The Primary Index

Next, let’s look at what happens when we only query the primary index but not the partition columns.

First, we execute the query on the PPI table again:

SELECT *
FROM Schedule_PPI
WHERE ScheduleId = 100;

The Optimizer provides us with the following execution plan:

Explain SELECT *
FROM Schedule_PPI
WHERE ScheduleId = 100;

  1) First, we do a single-AMP RETRIEVE step from all partitions of
     DWHPRO.Schedule_PPI by way of the primary index
     "DWHPRO.Schedule_PPI.ScheduleId = 100" with no residual conditions
     into Spool 1 (one-amp), which is built locally on that AMP.  The
     size of Spool 1 is estimated with high confidence to be 1 row (39
     bytes).  The estimated time for this step is 0.02 seconds.
  -> The contents of Spool 1 are sent back to the user as the result of
     statement 1.  The total estimated time is 0.02 seconds.

Because the Primary Index is known, a single-AMP Retrieve Step is executed. However, Teradata must probe all partitions to see if a row with the primary index you are looking for exists.

Teradata Partitioning
One AMP – All Partitions

Next, we execute the query about the Primary Index on the NPPI table:

SELECT *
FROM Schedule_NPPI
WHERE ScheduleId = 100;

The Explain Statement shows us the following execution plan:

Explain SELECT *
FROM Schedule_NPPI
WHERE ScheduleId = 100;

  1) First, we do a single-AMP RETRIEVE step from DWHPRO.Schedule_NPPI
     by way of the primary index "DWHPRO.Schedule_NPPI.ScheduleId = 100"
     with no residual conditions into Spool 1 (one-amp), which is built
     locally on that AMP.  The size of Spool 1 is estimated with high
     confidence to be 1 row (39 bytes).  The estimated time for this
     step is 0.00 seconds.
  -> The contents of Spool 1 are sent back to the user as the result of
     statement 1.  The total estimated time is 0.00 seconds.

In this case, the searched row can be found directly with the rowhash on the AMP.

Teradata Primary Index
One-AMP Primary Index Access

Improvement Of The Access Via Teradata Secondary Index

As you know, the Primary Index cannot be unique if not all partition columns are included.

As we saw in the example where we queried a Teradata PPI table using only the Primary Index, it was necessary to check each partition to see if it contained the Primary Index value we were looking for.

But there is a simple trick to prevent this: By defining a Unique secondary index (USI) on the primary index columns of the partitioned table:

CREATE UNIQUE INDEX (ScheduleId) ON Schedule_PPI;

Let’s see how this affects the execution plan of the query:

Explain 
SELECT *
FROM Schedule_PPI
WHERE ScheduleId = 100;

  1) First, we do a two-AMP RETRIEVE step in TD_MAP1 from
     DWHPRO.Schedule_PPI by way of unique index # 4
     "DWHPRO.Schedule_PPI.ScheduleId = 100" with no residual conditions.
     The estimated time for this step is 0.01 seconds.
  -> The row is sent directly back to the user as the result of
     statement 1.  The total estimated time is 0.01 seconds.

This has resulted in the following advantages:

  • No probing of the partitions is necessary
  • A Rowhash lock is used instead of a table lock

But what can I do if the Primary Index is not unique, but I still want to increase the performance?

Then there is still the possibility of defining a NUSI on the Primary Index Columns:

CREATE INDEX (ScheduleId) ON Schedule_PPI;
SELECT *
FROM Schedule_PPI
WHERE ScheduleId = 100;
Explain 
SELECT *
FROM Schedule_PPI
WHERE ScheduleId = 100;

  1) First, we do a single-AMP RETRIEVE step from all partitions of
     DWHPRO.Schedule_PPI by way of the primary index
     "DWHPRO.Schedule_PPI.ScheduleId = 100" with no residual conditions
     into Spool 1 (one-amp), which is built locally on that AMP.  The
     size of Spool 1 is estimated with high confidence to be 1 row (39
     bytes).  The estimated time for this step is 0.02 seconds.
  -> The contents of Spool 1 are sent back to the user as the result of
     statement 1.  The total estimated time is 0.02 seconds.

Like the Teradata USI, the NUSI also has some advantages if a Teradata PPI table is queried only via the primary index:

  • No probing of Partitions needed
  • Rowhash locking instead of table lock
  • Multiload into a table with NUSI is possible

How Teradata Queries PPI/NPPI Tables?

How is a Teradata PPI table queried if only the PPI columns are used in the WHERE condition?

With an All-AMP retrieve step accessing relevant partitions only.

How is a Teradata NPPI table queried if the columns used in the WHERE condition are not indexed?

With a full table scan

How is a Teradata PPI table queried if only the Primary Index columns are used in the WHERE condition?

With a Single-AMP Retrieve step probing all partitions

How is a Teradata NPPI table queried if the Primary Index columns are used in the WHERE condition?

With a Single-AMP Primary Index Direct Access

How can a Teradata PPI table query be improved if it only takes place via the primary index but not the partition columns?

Using a secondary index (USI, NUSI) on the primary index columns.
The index prevents a table-level lock and allows the use of a Rowhash Lock. It also eliminates the need for partition probing.

Here are a few more articles about the Teradata Partitioned Primary Index Table:

Teradata Partitioning Strategies Anyone Would Be Proud Of(Opens in a new browser tab)

Teradata Partitioning Strategies Anyone Would Be Proud Of(Opens in a new browser tab)

{"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

You might also like

>