Teradata Join Index – Teradata Tuning – DWH Pro – The Teradata Experts Forum

Forum

Teradata Join Index
 

Teradata Join Index  

  RSS

Roland Wenzlofsky
(@rwenzlofsky)
Estimable Member
Joined: 6 years ago
Posts: 204
12/05/2014 9:00 am  

What is the Join Index?

The Teradata Join index stores pre-joined tables, the result of aggregations, or simply the content of one table in a different structure. Join indexes are comparable to materialized views.

Although the name indicates that join indexes are an index structure, they have to be considered more as an added layer of pre-joined, pre-aggregated or permanent tables with its own Primary Index, Secondary Indexes, and row-level partitioning.

What makes the join index comparable to a secondary index is that it provides the optimizer with additional access paths to the data.

Unlike secondary indexes, which are stored as internal subtables, Teradata Join indexes store as separate physical database objects.

Therefore, a join index can also be stored in any database.

Because join indexes are stored like base tables, their maintenance costs are visible in table DBC.DBQLOGTBL (Teradata does not store information about the maintenance costs of other index types).

The use of a Join Index by the Optimizer is not subject to the decision of the user.

Only the Teradata Optimizer decides when it makes sense from a cost perspective to use the join index and when it is cheaper to use the underlying table structures.

Secondary indexes typically offer exactly one additional access path to the data in the base table (hash lookup, in the case of a NUSI also full index scan).

Join indexes can be modeled in a variety of ways and themselves serve as the basis for secondary indexes or partitioning.

When to Use a Join Index?

  • Join together large tables with a significant number of rows
  • Join large tables selecting only a few columns
  • Queries with complex expressions in its predicate
  • Denormalization of PDM for to increase performance
  • Create alternative row-level partitioning
  • Create alternative Primary Index as an additional access path
  • Movement of time-consuming joins and aggregations into the batch window

A join index provides additional access paths to data and therefore is used for performance tuning.

Higher performance usually is achieved by designing a join index with a different primary index or row-level partitioning.

However, even if the primary index or the partitions of the join index cannot be used for querying, the optimizer will use the join index with a full index scan if the index is considerably smaller than the base table and therefore fewer data blocks have to be read.

The Primary Index of a Join Index

A unique Primary Index can be defined if the index is not compressed (this is not related to multivalue compression, but the compression of repeating groups in the index) and only on a single table join index.

This will give all the advantages of a UPI on a base table, such as ensuring that no duplicates can be inserted or created with an update statement.

Usually, the Primary Index is used for data distribution, and the rows are sorted by rowhash on the AMPs. But the Join Index allows as well to chose another sort order by using the "ORDER BY VALUE" clause. Value order gives better performance for range queries:

CREATE JOIN INDEX JI1 AS
    SELECT (a, b), (c, d, e)
    FROM TheTable
    LEFT JOIN Customer ON a= c
    ORDER BY a
    PRIMARY INDEX (b);

Value order is only allowed for 4 Byte numeric values, as this is historically the maximum value allowed for row hashes being stored in the array of data pointers in each data block (DATE can be used as it's internally stored as INTEGER).

Join Indexes with Built-In Functions

If the functions DATE, CURRENT_DATE, or CURRENT_TIMESTAMP are used in the Expression partition of a join index, a one-time evaluation takes place when the index is created and not dynamically when the index is used. This also applies to the use of these functions in a WHERE condition of the join index.

However, "ALTER TABLE TO CURRENT" can be used on the index to update the value in the Expression partition to the current value.

Why you should use OUTER JOINS

Since with an INNER JOIN, the rows that cannot be matched are lost, it is better to define Join Indexes with OUTER JOIN. This allows more queries to be covered.

Join Index and Partitioning

  • A Join Index can be Row or Column Partitioned
  • Partitioning adds an additional access path
  • A Join Index can be partitioned on top of the non-partitioned table
  • A non-partitioned Join Index on top of a partitioned base table is not recommended for performance reasons
  • A Row compressed Join Index can't be row-level partitioned
  • The Partitioning Expression of a join index can be numeric or of a character data type
  • DATE, CURRENT_DATE, or CURRENT_TIMESTAMP functions can be used in the partition expression of a join index

Join Index Coverage

If the join index does not contain all selected columns ("not covering"), it can access these missing columns using the ROWID of the base table. To do this, the dummy column "ROWID" usually must be included (exceptions see below) in the SELECT list.

The cost saving is that the relevant rowids are first extracted into a spool file using the join index. In a subsequent step, the rows of the base table belonging to the rowids are queried.

If the join index contains all columns of the SELECT list ("covering"), no access to the base table is required.

There is no coverage possible for SELECT * FROM table_name queries:

Avoid the practice of writing SQL queries without specifying a column list or join index coverage is not given!

When will the Optimizer use a non-covering Join Index?

  • One of the columns in the join index definition is the dummy column "ROWID". If this is the case, the base table ROWID is staying with each index row, and a lookup of the base table row can be done as described above.
  • The UPI columns of the base table are available in the SELECT list of the Join Index. Teradata can hash this column combination in order to identify the AMP containing a row. Again, the rowids are used to access the base table. A message with the rowid is sent to the AMP in order to receive the required row.
  • The column set of a NUPI of the base table is contained in the select list of the join index (condition 1) and one of the two additional conditions is met:

    (2a) The Join Index contains the dummy column "ROWID" or
    (2b) There is a USI on the base table available which matches the column set of the NUPI.

If condition (2a) is met, the base table rows can be accessed via the column "ROWID" by rehashing the NUPI columns and accessing the base table rows via the rowids extracted into a spool file.

Condition (2b) is similar to the UPI case described above. The AMPs owning the requested rows are identified by rehashing the USI columns, everything else is the same as in condition (2a)

Coverage does not guarantee the use of a Join Index. The cost of using it must be less than the cost of accessing the base table with the usage of another index or a full table scan.

What are the Limitations of a Join Index?

  • The Usage of FULL OUTER JOIN is not allowed
  • LEFT or RIGHT JOINS: on the inner side at least one non-nullable column must be selected
  • OUTER JOIN is preferable for Join Index usage likelihood, but not allowed for Aggregate Join Index
  • HAVING and QUALIFY not allowed
  • No Set Operations are allowed (UNION, INTERSECT, MINUS)
  • No Subqueries are allowed
  • No inequality conditions in ON clauses of join index definitions. are allowed. Exception: If they are ANDed to at least one equality join condition
  • Only <, <=, >=, > as comparison operators allowed
  • TOP n and SAMPLE not allowed
  • During the restoration of a base table or database, the join index is marked as invalid
  • A maximum of 64 columns per table per Join Index is allowed
  • A maximum of 32 Indexes can be defined on a table including join indexes
  • UPI only allowed for single table join index
  • Value ordered NUSI and Partition Expression on the same Join Index columns are not allowed

Multi-Table Join Index

Multi-Table Join Indexes allow you to move resource-intensive joins from the online to the batch window.

Shifting the workload does not reduce the total workload, but shifts it to a point in time that is beneficial for the overall performance of the system.

The syntax for creating a join index is not much different than the syntax of a CREATE TABLE AS statement:

CREATE JOIN INDEX  AS
SELECT t01.PK, t01.COL_A,t02.COL_B
FROM
 t01
INNER JOIN
t02 ON t01.PK = t02.PK PRIMARY INDEX (COL_B);

The statement above shows several important characteristics of a join index:

Join indexes are a good alternative to temporary tables. However, unlike a temporary table, the use of the join index is not guaranteed. Whether the optimizer uses an existing join index also depends on the available statistics.

A join index should only be created if it can be reused by the optimizer in different scenarios. The additional space required must be taken into account accordingly. If the use of the join index cannot be achieved consistently, a temporary table is probably better.

If we want to overcome any possible deviation between indexed columns and the queried columns, it is feasible to store the pseudo-column ROWID in the join index which allows the usage of this index even if not all needed columns are available in the query.

Teradata will get the missing columns from the base table (as it has the ROWID available, this is the same as primary index access).

Limitations of the Multi-Table Join Index

The bulk load utilities can only be used if there is no join index defined.

However, a multi-table join index is usually used for pre-joining tables to avoid expensive online joins, for example, when creating ad-hoc reports. Joins are one of the most expensive tasks. Often large amounts of data have to be redistributed.

The big advantage of the Join Index is that the maintenance is completely done by Teradata and optimized methods are used to e.g. keep an Aggregate Join Index up to date without having to re-aggregate the whole index table with every small change.

However, since Fastload and Multiload are not possible, there is a contradiction here:

If the join index has to be removed before loading and then created again, the advantages mentioned above (for example, optimized aggregation) are not available.

Restrictions for outer-joined tables in the join index are another problem:

CREATE JOIN INDEX  AS
SELECT t01.PK, t01.COL_A,t02.COL_B
FROM
t01 LEFT OUTER JOIN
t02 ON t01.PK = t02.PK PRIMARY INDEX (COL_B);

Each outer join  in a multi-table join index has to meet the following restrictions:

  • All columns from the left table have to be selected
  • At least one column from the right table requires must be defined as NOT NULL.

Compressed Join Index

Multi-Table Join Indexes can be compressed by putting brackets around a group of columns.  Compression reduces occupied disk space as each group of compressed values is only stored once (and not in each row).

Frequent combinations of column values are therefore good candidates.

CREATE JOIN INDEX  AS
SELECT t01.PK,(t01.COL_A,t02.COL_B)
FROM
t01 LEFT OUTER JOIN
t02 ON t01.PK = t02.PK PRIMARY INDEX (COL_B);

Single Table Join Index 

The name is somehow misleading as no join is involved. Single-Table join indexes are created from exactly one base table.

Their main purpose is to have the same table available with a different primary index, partitioning, or to have a smaller table (the join index table) with fewer columns to be spooled.

Let's assume we have the following table below:

CREATE TABLE  (PK INTEGER, COL_A INTEGER) PRIMARY INDEX (PK);

By creating the below join index, we now have physically stored the same table twice, with two different primary indexes:

CREATE JOIN INDEX  AS
SELECT t01.PK, t01.COL_A
FROM
 t01
PRIMARY INDEX (COL_A);

The join index on with a different primary index gives the Teradata Optimizer an additional data access path.

Aggregate Join Index

Aggregate join indexes are pre-aggregations which are automatically maintained in an optimized way by Teradata (only the required rows are re-aggregated). An Aggregate Join Index can be based on one or multiple tables.

The aggregate join index allows defining a summary table without violating the normalization of the data model.

Below is an example of a single-table aggregate join index:

CREATE JOIN INDEX  AS
SELECT t01.PK, SUM(t01.COL_A)
FROM
 t01
GROUP BY 1
PRIMARY INDEX (PK);

Aggregate join indexes are limited in the aggregation functions which can be applied: SUM, COUNT, MIN, or MAX.

An aggregate join index is only used if the following conditions are fulfilled:

  • The grouping clause must include all columns in the grouping clause of the query
  • WHERE clause columns that belong to tables that are not defined in the aggregate join index must be part of the join index select list.
  • For an aggregate join index with a PPI, an aggregated column cannot be a partitioning column

The Teradata Optimizer will consider rewriting a query to make use of an existing Aggregate Join Index.

It will consider the usage of an existing Aggregate Join Index in the following scenarios:

  • For the  coverage of queries containing a subquery (derived table spool)
  • For internally created distinct and grouping steps
  • To partly cover outer join queries
  • For partial/early grouping steps

Sparse Join Index

Adding a WHERE condition to the join index turns a given index into a sparse join index.

Only those rows are stored in the join index that fulfills the WHERE condition. This saves space. However, a sparse index only makes sense if the WHERE condition is used in the queries. If not, the join index cannot be used.

A sparse join index consumes less permanent space as only the rows matching the WHERE conditions are stored.

A further advantage is given because the maintenance of the index is optimized:

If a row is to be inserted, it is first checked whether it meets the WHERE condition. Only if the row is to be inserted does meet the WHERE condition criteria, Teradata also checks whether it satisfies the partition expression.

Join Index Summary

Some knowledge and preparation are needed to allow the optimizer to use a join index.

A join index makes sense when queries are stable. In systems with many ad-hoc queries, it is usually difficult to find a set of useful join indexes that are also used by the optimizer.

Be warned when experimenting with join indexes in a production environment as they may have unexpected side effects. I will give you one real-life example:

A daily batch load had a SQL step deleting a huge 6+ billion records table:

DELETE FROM the_huge_table;

It's usually a big deal…. The cylinder index marks the data blocks belonging to the table as free, and that's it.

I overlooked this DELETE step in the load process and added a multi-table join index to the table.

The result was that the subsequent batch load took 20 hours to maintain the join index triggered by the DELETE.

I had to cancel the DELETE, and so did the subsequent ROLLBACK. This is always critical because the table is then in an inconsistent state. But that didn't matter in this case, because the table should be deleted completely.

Translated with www.DeepL.com/Translator

You can use the query below helps to find all join indexes available in a Teradata System:

SELECT * FROM dbc.Indices WHERE indextype='J';

See also:
The Secondary Index in Teradata - The NUSI compendium

This topic was modified 1 month ago 5 times by DWH Pro Admin
This topic was modified 3 weeks ago 97 times by DWH Pro Admin

Quote
Khurram Waziri
 Khurram Waziri
(@Khurram Waziri)
Guest
Joined: 5 years ago
Posts: 3
19/02/2015 5:25 pm  

Hi,

We have a scenario where a table has fields A and B, both Decimal (15,3). An AJI for this table has SUM(A) and SUM(B). When we use SEL SUM(A), SUM(B) FROM explain plan chooses AJI as access path which is ok because index is fully covering, but data is incorrect!
We verified by running SEL SUM(A) as Z1, SUM(B) AS Z2, SUM(Z1 - Z2) as Z3. Now the optimizer accesses base table directly and we get correct values. What could be the reason?


ReplyQuote
Khurram Waziri
 Khurram Waziri
(@Khurram Waziri)
Guest
Joined: 5 years ago
Posts: 3
19/02/2015 5:28 pm  

EDIT: The verification SQL is:

SEL SUM(A) as "Z1", SUM(B) as "Z2", SUM(A - B) AS "Z3" ....


ReplyQuote
Virendra
 Virendra
(@Virendra)
Guest
Joined: 4 years ago
Posts: 7
24/12/2015 2:20 pm  

Hi Roland,

Teradata reads and writes on disk in terms of blocks, that is the lowest measure.
Assuming our data blocks have more than one rows each, consider that we have to delete just one row from a table.
How does Teradata proceed. Does it mark the entire block as free? What about remaining rows of the block.
Does it create another fresh block with only remaining rows and then marks the old block as free?

Regards,
Virendra


ReplyQuote
DWH Pro Admin
(@oshun123)
Member Admin
Joined: 5 years ago
Posts: 94
09/01/2016 10:15 am  

Two data blocks can be merged if the occupied space drops below a certain limit. I am not 100% sure but I think it is the MERGEBLOCKRATIO from the table DDL which can be used to change the default value for block merges.


ReplyQuote
CK
 CK
(@CK)
Guest
Joined: 3 years ago
Posts: 1
06/12/2016 10:37 pm  

DELETE FROM the_huge_table;
Instead of Deleting 6+ billion records every day , i think it is wise to drop the table and recreate the table.


ReplyQuote
Sachin
 Sachin
(@Sachin)
Guest
Joined: 3 years ago
Posts: 1
29/05/2017 10:32 pm  

Thanks a lot for this amazing article.

Can you please elaborate the following statements:

Move vast and time-consuming joins or aggregations into the batch windows.
Join indexes even allow direct access via the primary index of the JI.


ReplyQuote
Paramananda
 Paramananda
(@Paramananda)
Guest
Joined: 1 year ago
Posts: 1
08/09/2018 1:26 pm  

if we have a single table join index(st_jn_idx) on dept_no for emp table.
How to join the dept table and the join single table join index on emp table.

Can we write,
select emp.*
from dept as d join st_jn_idx as e
on d.dept_no=e.dept_no;

Kindly suggest.

Thanks


ReplyQuote
Siyanda B
 Siyanda B
(@Siyanda B)
Guest
Joined: 10 months ago
Posts: 1
04/02/2019 11:58 am  

Failed [5464 : HY000] Error in Join Index DDL, Only satisfiable single-table conditions and equality join conditions are allowed in the ON clause. What causes this error, I am struggling to resolve the because I do not understand the problem. When I execute the query (without CREATE JOIN INDEX statement), the query executes with no issues but when I encapsulate it in a JI it refuses.


ReplyQuote
Share:
>

Please Login or Register