How Join Indexes Can Optimize Performance in a Normalized Data Model

DWH Pro Admin

May 2, 2023

minutes reading time

A normalized data model can increase the complexity of creating performant queries due to the higher number of tables that must be linked compared to a denormalized data model. It is essential to select a primary index precisely to optimize queries and joins, enabling them to have a direct access path. However, relationship tables often encounter the issue that only one side of the relationship has an optimized primary index access path, leading to the other side conducting a full table scan if no alternatives are available. A join index can be advantageous in optimizing performance with relationship tables. For instance, the table below shows an example where the LeftId column serves as the primary index:

CREATE MULTISET TABLE DWHPRO.TheRelationshipTable
(
    LeftId BIGINT NOT NULL,
    RightId BIGINT NOT NULL,
    RelationTypeCd CHAR(10)
PRIMARY INDEX ( LeftId )
PARTITION BY 
;

Singe-AMP Primary Index can be accessed through Column LeftId, not Column RightId. A query with a WHERE condition on Column RightId will prompt a complete table scan.

SELECT * FROM DWHPRO.TheRelationshipTable WHERE RightId = 1000;

EXPLAIN SELECT * FROM DWHPRO.TheRelationshipTable WHERE RightId = 1000;

1) First, we lock DWHPRO.TheRelationshipTable in TD_MAP1 for
read on a reserved RowHash in all partitions to prevent global deadlock.
2) Next, we lock DWHPRO.TheRelationshipTable in TD_MAP1 for read. 
3) We do an all-AMPs RETRIEVE step in TD_MAP1 from 
DWHPRO.TheRelationshipTable by way of an all-rows scan with a condition of ("DWHPRO.TheRelationshipTable.RightId = 10000") into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with high confidence to be 9 rows (1,611 bytes).  The estimated time for this step is 0.08 seconds.
4) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.08 seconds.

To optimize performance, consider using a join index on the RightId columns. Instead of copying the entire table, it is recommended to include only the ROWID of the base table in the join index. This approach is particularly useful for wide tables with numerous columns. The join index enables the optimizer to join back to the base table and retrieve additional required columns. Additionally, it occupies minimal storage space.

CREATE JOIN INDEX DWHPRO.JI_RELATIONSHIP
AS
SELECT
    (RightId,RelationTypeCd), -- Compressed
    (ROWID)
FROM
    DWHPRO.TheRelationshipTable
PRIMARY INDEX (RightId);

After creating the join index, we rerun the previous query and observe that the optimizer utilizes the join index, querying it via the primary index, resulting in the desired primary index access. In our example, as we query the RelationTypeCd column, the optimizer must rejoin the base table using ROWID to retrieve it. However, if the query seeks the RightId column solely, the optimizer could skip this step as the join index would provide coverage.

Explain SELECT * FROM DWHPRO.TheRelationshipTable
WHERE RightId = 10000;
 
1) First, we do a single-AMP RETRIEVE step from
DWHPRO.JI_RELATIONSHIP by way of the primary index
"DWHPRO.JI_RELATIONSHIP.RightId = 10000" with no
residual conditions into Spool 2 (group_amps), which is
redistributed by the rowkey of (DWHPRO.JI_RELATIONSHIP.Field_1038) to few AMPs in TD_Map1.
Then we do a SORT to order Spool 2 by the sort key in spool field1. The size of Spool 2 is estimated with high confidence to be 9 rows (450 bytes). The estimated time for this step is 0.00 seconds.
2) Next, we do a group-AMPs JOIN step in TD_Map1 from Spool 2 (Last Use) by way of an all-rows scan, which is joined to
DWHPRO.TheRelationshipTable by way of an all-rows scan with
no residual conditions.  Spool and DWHPRO.TheRelationshipTable are joined using a rowkey-based
row id join, with a join condition of ("Field_1 =   DWHPRO.TheRelationshipTable.ROWID"). The result goes into
Spool 1 (group_amps), which is built locally on the AMPs. The
size of Spool 1 is estimated with low confidence to be 9 rows (
1,611 bytes).  The estimated time for this step is 0.05 seconds.
3) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. 
-> The contents of Spool 1 are sent back to the user as the result of statement 1.  The total estimated time is 0.05 seconds.

We compressed the join index by enclosing columns in brackets, which is particularly useful for relationship tables where one side of the relationship is often skewed.

LeftId	RightId
1	1
2	1
3	1
4	1
5	1

Skewed Relation Table

The compressed join index stores RightId as a value of 1 in a single physical row without creating a skewed index.

The compressed join index stores the value 1 for compressed RightId to prevent query skew. In Teradata, all rows with RightId = 1 are sourced from the same or a small number of physical rows in the join index.