GUEST POST by David Wellman © Ward Analytics Ltd 2017. All Rights Reserved

All Rows on One AMP – By Design (Part 2)

In an earlier article (link to All Rows on One AMP – By Design) I showed an alternative design for small reference tables which ensures that all rows in the table are stored on a single AMP. The design allows for single-AMP retrieval of the entire table, usually before joining, which uses fewer AWTs.

During a recent Teradata education class that I was teaching a student commented that most if not all of their reference tables were too large to use that approach because they all contain history rows (effectively they are ‘slowly changing dimensions’). I was asked if my proposed design could be extended for slightly larger reference tables.

This article discusses four design options that I’ve investigated as a result of those discussions.

During this investigation and testing, I also started working on Teradata 16.10 which includes the “Multiple Hash Maps” (MHM) feature which provides some options for the designs included here.

Three Scenarios– Small, Medium and Large

During the investigation and testing of this requirement, I quickly realized that the design needs to cater for different sizes of reference table. For this article I’ve described these as ‘Small’, ‘Medium’ and ‘Large’, with the definitions of these terms shown below:

 

Small The entire table fits into 1 or 2 data blocks (typically 1 – 2 MB of PERM space).

I suspect that such tables will often not contain any history of changes, the table only holds ‘current rows’.

Medium All rows (both ‘current’ and ‘history’) fit into a few data blocks, probably less than one cylinder (@11MB).

The ‘current rows’ fit into 1 or 2 data blocks (typically 1 – 2 MB of PERM space).

Large The ‘current rows’ fit into 1 or 1 data blocks (typically 1 – 2 MB of PERM space) but the ‘history rows’ can grow as large as required (to meet business requirements).

 

What is the requirement?

This design is based on the following requirements which I have come across at some customer sites. They may not exactly match yours, but I hope are very similar.

  • The table either contains ‘current’ rows and optionally ‘history’ rows.
  • The table contains ‘start date’ and ‘end date’ columns which define when the data in the row is effective.
  • The current row is indicated by a specific value in the ’end date’ column. In a Teradata system this is often “NULL”, “3500-12-31” or “9999-12-31”.
  • Typical queries using this table only require ‘current’ rows.

What are the new design options?

With the requirements stated above my suggested design depends on two factors:

  1. Is the reference table ‘Small’, ‘Medium’ or ‘Large’.
  2. What version of the Teradata DBMS is being used? The important factor being “< TD 16.10” or “>= TD 16.10”.

Having determined where your table fits regarding the two factors shown above, use the following table to determine which design is best for you. The designs are explained in detail after this table and examples are given.

 

Teradata version < TD 16.10

Teradata version >= TD 16.10

Small

Add an extra column to the table definition which is the PI for the table. All rows have the same value for this column. (Design#1) Create the table with a ‘natural’ definition (UPI on the code column) and assign it to a 1-AMP Hash Map. (Design#2)

Medium

(See below – ‘Large table’) Create the table using a NUPI on the code column, add Row Partitioning using the ‘current row’ column (“End date” for this article) as the PPI column. Assign the table to a 1 AMP Hash Map.

Define the partitioning scheme so that all ‘current’ rows go into a single partition.

(Design#3)

Large

Create the table with a ‘natural’ definition (UPI on the code column).

Create a sparse Single Table Join Index (STJI) which has a WHERE clause that includes the ‘current’ row data value.

Note: It is critical that the WHERE clause is not an equality constraint against this ‘current’ row data value, but includes it.

Note: It is critical that the ‘current’ row data value is non-null.

(Design#4)

 

The following section shows examples of each of the four designs.

Design#1: Small table, < TD 16.10

The table definition is as shown in my previous article on this subject (link to All Rows on One AMP – By Design):

CREATE SET TABLE small_ref_table
(picol INTEGER NOT NULL
,ref_code CHAR(2) NOT NULL
,ref_desc VARCHAR(200))
PRIMARY INDEX(picol);

With the above definition in place, a typical query which joins to this table now includes (in the view definition) selection against the Primary Index column and so the relevant part of the plan becomes:

2) Next, we do a single-AMP RETRIEVE step from small_ref_table by way of
the primary index “small_ref_table.picol = 1” with no residual
conditions into Spool 2 (all_amps), which is duplicated on all
AMPs. The size of Spool 2 is estimated …

Design#2: Small table,>= TD 16.10

This table definition uses a ‘natural’ Primary Index but uses the TD16.10 MHM feature to ensure that all primary data rows are assigned to a single AMP.

Note: the actual map name (“TD_1AmpSparseMap_1Node” in the example below) may be different on your system).

CREATE SET TABLE small_ref_table
,MAP = TD_1AmpSparseMap_1Node
(ref_code CHAR(2) NOT NULL
,ref_desc VARCHAR(200))
UNIQUE PRIMARY INDEX(ref_code);

With the above definition in place, a typical query which joins to this table now includes (in the view definition) selection against the Primary Index column, and so the relevant part of the plan becomes:

4) We do a single-AMP RETRIEVE step from small_ref_table by way of an
all-rows scan with no residual conditions into Spool 3 (all_amps),
which is duplicated on all AMPs in TD_Map1. The size of Spool 3
is estimated …

Design#3: Medium table,>= TD 16.10

This table definition keeps the ‘code’ column as the Primary Index but adds in row partitioning on the ‘end_date’ column (the one used to identify the ’current’ row). The table now contains ‘history’ rows, and so the PI is defined as non-unique.

Note: the actual map name (“TD_1AmpSparseMap_1Node” in the example below) may be different on your system).

CREATE TABLE medium_ref_table
,MAP = TD_1AmpSparseMap_1Node
(ref_code CHAR(2) NOT NULL
,ref_desc VARCHAR(200)
,start_date DATE NOT NULL
,end_date DATE NOT NULL — uses 2099-12-31 as ‘current row indicator value’
PRIMARY INDEX(ref_code)
PARTITION BY RANGE_N(end_date BETWEEN DATE ‘2010-01-01’ AND DATE ‘2020-12-31’
EACH INTERVAL ‘1’ DAY,NO RANGE);

With the above definition in place, a typical query which joins to this table now includes (in the view definition) selection against the partitioning column so that only ‘current’ rows are returned, only requiring 1 partition to be accessed. The relevant part of the plan becomes:

4) We do a single-AMP RETRIEVE step from a single partition of
medium_ref_table with a condition of (“medium_ref_table.end_date
= DATE ‘2099-12-31′”) with a residual condition of
(“medium_ref_table.end_date = DATE ‘2099-12-31′”) into Spool 3
(all_amps), which is duplicated on all AMPs in TD_Map1. The size of
Spool 3 is estimated …

Note: a slight alternative to the above design is to use a NULL value in column ‘end_date’ to identify the ‘current’ row. If this is your requirement then change the design to include the special ‘UNKNOWN’ partition (instead of ‘NO RANGE’) and change the selection criteria to use “WHERE end_date IS NULL”.

Design#4: Large table (all TD releases), Medium table < TD 16.10

This design uses a sparse, single table join index (STJI) to give us single AMP access to the required data for ’current’ rows. The optimizer uses the STJI to partially cover the query.

CREATE TABLE large_ref_table
(ref_code CHAR(2) NOT NULL
,ref_desc VARCHAR(200)
,start_date DATE NOT NULL
,end_date DATE NOT NULL — uses 2099-12-31 as ‘current row indicator value’
UNIQUE PRIMARY INDEX(ref_code);

CREATE JOIN INDEX large_ref_table_stji
AS
SELECT end_date
,ref_code
,ref_desc
,start_date
FROM large_ref_table
WHERE end_date BETWEEN DATE ‘2099-12-30’ AND DATE ‘2099-12-31’
PRIMARY INDEX (end_date);

Note: It is critical that the WHERE clause in STJI definition is not an equality constraint against the ‘current’ row data value, but includes it.

Note: It is critical that the ‘current’ row data value is non-null. If you use NULL as the ’current’ row data value you will not get a single AMP retrieve from the STJI. This is because of a specific query re-write feature in Teradata.

With the above definition in place, a typical query which joins to this table now includes (in the view definition) selection against the ‘end_date’ column so that only ‘current’ rows are returned. The optimizer uses the STJI instead of the table, the relevant part of the plan becomes:

3) We do a single-AMP RETRIEVE step from LARGE_REF_TABLE_STJI
by way of the primary index “LARGE_REF_TABLE_STJI.end_date
= DATE ‘2099-12-31′” with no residual conditions into Spool 3
(all_amps), which is duplicated on all AMPs in TD_Map1. The size
of Spool 3 is estimated …

An additional benefit of this design which may prove useful to you is that your base table (“large_ref_table” in my example) can contain additional columns. For instance, you could have other columns which record:

  • The name of the user that initially created each row (CREATE_USERNAME).
  • The date and time when a row was created (CREATE_TS).
  • The name of the user that last updated each row (LASTALTER_USERNAME).
  • The date and time when a row last update (LASTALTER_TS).

If you include these columns in your base table design, do not include them in the STJI definition unless the majority of queries also need them (remember that the intent is to have all real data rows for the STJI fit into 1 or 2 data blocks).

When Can I (And When Can I Not) Use This Design?

Unlike the original design that I described (‘Design#1’ above), I think this approach can be used against any size table reference. You are no longer constrained by overall table size, only the size of the ‘current’ rows.

Here are some guidelines to help you when thinking about how and where to use the designs from this article:

  • Add the appropriate WHERE clause into the base view that references the table.
  • The aim is to reduce the resources used for processing; this will not noticeably change performance although you may find that a frequently referenced table (or at least the ‘current’ rows) will simply sit in memory on the Node.
  • If you decide to use Design#1 for some tables, do not use the same data value for column PICOL on all tables. Doing so would potentially overload one AMP. Use different values (perhaps the table name?) to ensure that the table rows are assigned to different AMPs.
  • If you decide to use Design#2 and Design#3 for some tables, do not use the same value for the COLOCATE USING clause on all tables. Doing so would potentially overload one AMP. I suggest that you do not code the COLOCATE USING clause to ensure that the table rows are assigned to different AMPs.

Summary

As before, this design approach is NOT going to make SQL requests run noticeably faster. However, it will make them run more efficiently because they use fewer AWT’s, CPU cycles and IO operations to achieve the same result.

 

Our Reader Score
[Total: 4    Average: 4/5]
All Rows on One AMP – By Design (Part 2) written by David Wellman on August 30, 2017 average rating 4/5 - 4 user ratings

1 COMMENT

LEAVE A REPLY

Please enter your comment!
Please enter your name here