A normalized data model usually makes it more complex to create performant queries because there are more tables to link than in a denormalized data model. We have to choose precisely one primary index to optimize our queries and joins, i.e., give them a direct access path. Therefore, with relationship tables, we usually have the problem that only one side of the relationship has optimized primary index access. Unfortunately, the other side usually has to perform a full table scan if we can’t provide alternatives. With relationship tables, a join index can be beneficial in optimizing performance. Below is an example table. We have chosen the column LeftId as the primary index:
CREATE MULTISET TABLE DWHPRO.TheRelationshipTable
(
LeftId BIGINT NOT NULL,
RightId BIGINT NOT NULL,
RelationTypeCd CHAR(10)
PRIMARY INDEX ( LeftId )
PARTITION BY
;
Singe-AMP Primary Index access is possible via the Column LeftId, but not for the Column RightId. A query with a WHERE condition on column RightId triggers a full table scan:
SELECT * FROM DWHPRO.TheRelationshipTable WHERE RightId = 1000;
EXPLAIN SELECT * FROM DWHPRO.TheRelationshipTable WHERE RightId = 1000;
1) First, we lock DWHPRO.TheRelationshipTable in TD_MAP1 for
read on a reserved RowHash in all partitions to prevent global deadlock.
2) Next, we lock DWHPRO.TheRelationshipTable in TD_MAP1 for read.
3) We do an all-AMPs RETRIEVE step in TD_MAP1 from
DWHPRO.TheRelationshipTable by way of an all-rows scan with a condition of ("DWHPRO.TheRelationshipTable.RightId = 10000") into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with high confidence to be 9 rows (1,611 bytes). The estimated time for this step is 0.08 seconds.
4) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.08 seconds.
A possible solution to optimize performance is to use a join index on the columns RightId. You could make a copy of the whole table, and in our example table, this would probably be ok due to the few columns, but for wide tables with many columns, the better method is to include only the ROWID of the base table. Such a join index allows the optimizer to join back to the base table and fetch more needed columns. At the same time, the join index is small and doesn’t take much space.:
CREATE JOIN INDEX DWHPRO.JI_RELATIONSHIP
AS
SELECT
(RightId,RelationTypeCd), -- Compressed
(ROWID)
FROM
DWHPRO.TheRelationshipTable
PRIMARY INDEX (RightId);
After creating the join index, we run the query from before again. As we can see, the optimizer uses the join index and queries it using the primary index. Primary index access is what we wanted to achieve. Since we are also querying the column RelationTypeCd in our example, the optimizer must join back to the base table using ROWID to retrieve it. If the query only needed the column RightId the Optimizer could omit this step because the join index would be covering:
Explain SELECT * FROM DWHPRO.TheRelationshipTable
WHERE RightId = 10000;
1) First, we do a single-AMP RETRIEVE step from
DWHPRO.JI_RELATIONSHIP by way of the primary index
"DWHPRO.JI_RELATIONSHIP.RightId = 10000" with no
residual conditions into Spool 2 (group_amps), which is
redistributed by the rowkey of (DWHPRO.JI_RELATIONSHIP.Field_1038) to few AMPs in TD_Map1.
Then we do a SORT to order Spool 2 by the sort key in spool field1. The size of Spool 2 is estimated with high confidence to be 9 rows (450 bytes). The estimated time for this step is 0.00 seconds.
2) Next, we do a group-AMPs JOIN step in TD_Map1 from Spool 2 (Last Use) by way of an all-rows scan, which is joined to
DWHPRO.TheRelationshipTable by way of an all-rows scan with
no residual conditions. Spool and DWHPRO.TheRelationshipTable are joined using a rowkey-based
row id join, with a join condition of ("Field_1 = DWHPRO.TheRelationshipTable.ROWID"). The result goes into
Spool 1 (group_amps), which is built locally on the AMPs. The
size of Spool 1 is estimated with low confidence to be 9 rows (
1,611 bytes). The estimated time for this step is 0.05 seconds.
3) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.05 seconds.
As you may have noticed, we have compressed the join index by putting columns into brackets. Especially with relationship tables, it often makes sense since one side of the relationship is often skewed:
LeftId | RightId |
1 | 1 |
2 | 1 |
3 | 1 |
4 | 1 |
5 | 1 |
The compressed join index can store the value 1 for RightId compressed in one physical row. No skewed index is created.
The compressed join index can then store the value 1 for RightId compressed. This way, the query is not skewed. All rows with RightId = 1 Teradata serve from the same or a few physical rows of the join index.