The Vulnerability of Outer Joins in Business Intelligence Tools: A Simple Test Scenario

Roland Wenzlofsky

April 22, 2023

minutes reading time

Business intelligence tools frequently utilize queries with multiple outer joins. This article presents evidence supporting the vulnerability of outer joins to skewing. We will illustrate that, regardless of whether both tables are skewed, there exists a possibility that the plan step in the initial joins may be biased.

Let me carve out the details based on a simple test scenario:

CREATE TABLE table1 (pk INTEGER, a INTEGER) PRIMARY INDEX (pk);
CREATE TABLE table2 (pk INTEGER, b INTEGER, c INTEGER) PRIMARY INDEX (pk);
CREATE TABLE table2 (pk INTEGER, d INTEGER, e INTEGER) PRIMARY INDEX (pk);

SELECT a,b,c,d,e FROM table1 t01 LEFT JOIN table2 t02 ON t01.a = t02.b LEFT JOIN table3 t03 ON t02.c = t03.d;

The Optimizer can generate an execution plan for the aforementioned setup (tables and query) that consists of the subsequent steps:

1. Redistribute table1 on column a, and redistribute table2 on column b (in parallel).

2. Merge join (outer join) the resulting spools (from step 1). The columns of the resulting spool will be padded with NULL values for columns not matching rows.

3. Let’s assume that the resulting spool (i.e., the spool which holds the joined result of table1 and table2) is still not skewed. We will call this spool s(table1,table2) for simplicity.

4. The result spool s(table1,table2) is redistributed on column c.

5. Table3 is redistributed on column d.

The skewing issues will commence as the join of table1 and table2 may result in numerous non-matching rows, causing multiple NULL values to populate column c of the s(table1,table2).

The rowhash value of a NULL value is deterministic and consistent – every NULL value will always hash to the same AMP.

If only a few rows match during the join of table1 and table2, redistributing s(table1,table2) based on column c (necessary for joining with table3) will relocate any unmatched rows to the AMP holding NULL values.

Initially, our situation was ideal with three unskewed tables. However, executing the second outer join resulted in a skewed spool.

Business intelligence tools often generate queries containing many outer joins, which can significantly increase the likelihood of skewing.

Teradata 14.10 employs techniques such as PDPR to prevent join skewing.

Statistics are essential to ensure the availability of PDPR as the Optimizer may not recognize all applicable situations.

When manually coding queries, only use outer joins when business logic dictates.

Avoid using outer joins unnecessarily to prevent losing rows, which has become a prevalent practice.

Take a brief survey and receive the book for free Get instant access

Share0

Tweet0

Share0

Roland Wenzlofsky

Roland Wenzlofsky is an experienced freelance Teradata Consultant & Performance Trainer. Born in Austria's capital Vienna, he is building and tuning some of the largest Teradata Data Warehouses in the European financial and telecommunication sectors for more than 20 years. He has played all the roles of developer, designer, business analyst, and project manager. Therefore, he knows like no other the problems and obstacles that make many data warehouse projects fail and all the tricks and tips that will help you succeed.

The Vulnerability of Outer Joins in Business Intelligence Tools: A Simple Test Scenario

Roland Wenzlofsky

Fast multi-file export of Teradata query results using only Teradata SQL Assistant

The Teradata AMP Worker Task

Boost Your Teradata Performance – The Critical Role of NOT NULL Declarations

Optimizing Teradata SQL Queries by Avoiding Full Table Scans and Utilizing Secondary Indexes