Introduction to Teradata Performance and NOT NULL
Welcome to our latest Teradata performance blog post, a series designed to provide valuable insights into SQL queries. This article spotlights ‘NOT IN’.
To delve deeper into ‘NOT IN’, it is crucial to comprehend a frequently neglected SQL database design principle: properly defining columns that cannot hold NULL values as NOT NULL. Despite seeming insignificant, this is a fundamental aspect of efficient database design that significantly affects performance.
Defining columns as NOT NULL guides the Teradata Optimizer, a key system component that devises efficient plans for executing SQL statements. When a column is defined as NOT NULL, the SQL Optimizer can safely exclude the possibility of NULL values, leading to more efficient execution plans and better database performance.
Our Test Scenario
We will take a pragmatic approach as we delve further into this captivating subject. We will create a test scenario that exemplifies the consequences of mishandling NULL values in SQL queries and how it can substantially impact performance.
Consider this experiment a concrete illustration that will showcase the outcomes of bad database design. This principle is not just theoretical or abstract; it can have noteworthy and occasionally adverse effects on the performance and efficiency of SQL queries.
In this exercise, we will emphasize the crucial role of the Teradata Optimizer and its reliance on precise information regarding NULL values to produce optimal execution plans.
By utilizing a hands-on approach, you will gain a concrete understanding of the significance of accurately defining NOT NULL columns. Additionally, you will gain insights into the potential performance issues that may arise from not adhering to this crucial database design principle.
Below is the structure of our test table. It is important to note that this table’s ‘job’ column can contain NULL values.
Teradata Performance – NULL column ‘job’
Initially, we'll establish a new table. In this table, we will specifically define the 'job' column to allow NULL values:
CREATE SET TABLE DWHPRO.PERSONS
(
person_id INTEGER,
first_name VARCHAR(100) CHARACTER SET LATIN NOT CASESPECIFIC,
last_name VARCHAR(100) CHARACTER SET LATIN NOT CASESPECIFIC,
salary INTEGER,
job VARCHAR(100) CHARACTER SET LATIN NOT CASESPECIFIC,
description VARCHAR(1000) CHARACTER SET LATIN NOT CASESPECIFIC,
gender VARCHAR(10) CHARACTER SET LATIN NOT CASESPECIFIC,
dept VARCHAR(100) CHARACTER SET LATIN NOT CASESPECIFIC,
birthdate DATE FORMAT 'YYYY-MM-DD')
PRIMARY INDEX ( person_id ,job );
First, let’s execute this SQL query on the table:
EXPLAIN
SELECT * FROM DWHPRO.PERSONS
WHERE JOB NOT IN (SELECT JOB FROM DWHPRO.PERSONS);
After analyzing the execution plan, we notice that it adapts based on the existence or non-existence of NULL values in either table. The execution plan may follow two different pathways depending on the presence of NULL values.
1) First, we lock DWHPRO.PERSONS in TD_MAP1 for read on a reserved
RowHash to prevent global deadlock.
2) Next, we lock DWHPRO.PERSONS in TD_MAP1 for read.
3) We do an all-AMPs SUM step in TD_MAP1 to aggregate from
DWHPRO.PERSONS by way of an all-rows scan with no residual
conditions, grouping by field1 (DWHPRO.PERSONS.job). Aggregate
intermediate results are computed globally, then placed in Spool
10 in TD_Map1. The input table will not be cached in memory, but
it is eligible for synchronized scanning. The size of Spool 10 is
estimated with low confidence to be 3,751,314 rows (562,697,100
bytes). The estimated time for this step is 1 minute and 43
seconds.
4) We do an all-AMPs SUM step in TD_MAP1 to aggregate from
DWHPRO.PERSONS by way of an all-rows scan with no residual
conditions. Aggregate intermediate results are computed globally,
then placed in Spool 5 in TD_Map1. The input table will not be
cached in memory, but it is eligible for synchronized scanning.
5) We do an all-AMPs SUM step in TD_Map1 to aggregate from Spool 10
by way of an all-rows scan. Aggregate intermediate results are
computed globally, then placed in Spool 12 in TD_Map1.
6) We execute the following steps in parallel.
1) We do an all-AMPs RETRIEVE step in TD_Map1 from Spool 12
(Last Use) by way of an all-rows scan into Spool 3 (all_amps),
which is duplicated on all AMPs in TD_Map1.
2) We do an all-AMPs RETRIEVE step in TD_Map1 from Spool 5 (Last
Use) by way of an all-rows scan into Spool 4 (all_amps),
which is duplicated on all AMPs in TD_Map1.
7) We execute the following steps in parallel.
1) We do an all-AMPs RETRIEVE step in TD_MAP1 from
DWHPRO.PERSONS by way of an all-rows scan with no residual
conditions into Spool 7 (all_amps), which is redistributed by
the hash code of (DWHPRO.PERSONS.job) to all AMPs in TD_Map1.
Then we do a SORT to order Spool 7 by row hash. The input
table will not be cached in memory, but it is eligible for
synchronized scanning. The result spool file will not be
cached in memory, and null value information in Spool 4 and
Spool 3. Skip this retrieve step if null exists. The size
of Spool 7 is estimated with high confidence to be 5,001,752
rows (2,525,884,760 bytes). The estimated time for this step
is 4 minutes and 11 seconds.
2) We do an all-AMPs RETRIEVE step in TD_Map1 from Spool 10 by
way of an all-rows scan into Spool 9 (all_amps), which is
redistributed by the hash code of (DWHPRO.PERSONS.job) to all
AMPs in TD_Map1. Then we do a SORT to order Spool 9 by row
hash, and null value information in Spool 4 and Spool 3.
Skip this retrieve step if null exists. The size of Spool 9
is estimated with low confidence to be 5,001,752 rows (
240,084,096 bytes).
8) We execute the following steps in parallel.
1) We do an all-AMPs JOIN step in TD_Map1 from Spool 7 (Last
Use) by way of an all-rows scan, which is joined to Spool 9
(Last Use) by way of an all-rows scan. Spool 7 and Spool 9
are joined using an exclusion merge join, with a join
condition of ("job = job"), and null value information in
Spool 4 and Spool 3. Skip this join step if null exists.
The result goes into Spool 1 (group_amps), which is built
locally on the AMPs. The size of Spool 1 is estimated with
index join confidence to be 5,001,752 rows (2,565,898,776
bytes). The estimated time for this step is 1 minute and 27
seconds.
2) We do an all-AMPs RETRIEVE step in TD_MAP1 from
DWHPRO.PERSONS by way of an all-rows scan with no residual
conditions into Spool 14 (all_amps), which is redistributed
by the hash code of (DWHPRO.PERSONS.job) to all AMPs in
TD_Map1. Then we do a SORT to order Spool 14 by row hash.
The input table will not be cached in memory, but it is
eligible for synchronized scanning. The result spool file
will not be cached in memory, and null value information in
Spool 4 and Spool 3. Skip this retrieve step if there is no
null. The size of Spool 14 is estimated with high confidence
to be 5,001,752 rows (2,525,884,760 bytes). The estimated
time for this step is 4 minutes and 11 seconds.
3) We do an all-AMPs RETRIEVE step in TD_Map1 from Spool 10
(Last Use) by way of an all-rows scan into Spool 15
(all_amps), which is redistributed by the hash code of (
DWHPRO.PERSONS.job) to all AMPs in TD_Map1. Then we do a
SORT to order Spool 15 by row hash, and null value
information in Spool 4 and Spool 3. Skip this retrieve step
if there is no null. The size of Spool 15 is estimated with
low confidence to be 5,001,752 rows (240,084,096 bytes). The
estimated time for this step is 1 minute and 49 seconds.
9) We do an all-AMPs JOIN step in TD_Map1 from Spool 14 (Last Use) by
way of an all-rows scan, which is joined to Spool 15 (Last Use) by
way of an all-rows scan. Spool 14 and Spool 15 are joined using
an exclusion merge join, with a join condition of ("job = job"),
and null value information in Spool 4 (Last Use) and Spool 3 (Last
Use). Skip this join step if there is no null. The result goes
into Spool 1 (group_amps), which is built locally on the AMPs.
The size of Spool 1 is estimated with index join confidence to be
5,001,752 rows (2,565,898,776 bytes). The estimated time for this
step is 1 minute and 27 seconds.
10) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1.
Teradata Performance – NOT NULL column ‘job’
Moving forward, we’ll duplicate the original table containing identical data. However, this time, we will explicitly define the ‘job’ column as NOT NULL. This modification ensures that this column will not contain any NULL values:
CREATE SET TABLE DWHPRO.PERSONS_NOT_NULL
(
person_id INTEGER,
first_name VARCHAR(100) CHARACTER SET LATIN NOT CASESPECIFIC,
last_name VARCHAR(100) CHARACTER SET LATIN NOT CASESPECIFIC,
salary INTEGER,
job VARCHAR(100) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL,
description VARCHAR(1000) CHARACTER SET LATIN NOT CASESPECIFIC,
gender VARCHAR(10) CHARACTER SET LATIN NOT CASESPECIFIC,
dept VARCHAR(100) CHARACTER SET LATIN NOT CASESPECIFIC,
birthdate DATE FORMAT 'YYYY-MM-DD')
PRIMARY INDEX ( person_id ,job );
We will execute the SQL query on our newly created table and analyze the execution plan. This is the same query we used previously.
1) First, we lock DWHPRO.PERSONS_NOT_NULL in TD_MAP1 for read on a
reserved RowHash to prevent global deadlock.
2) Next, we lock DWHPRO.PERSONS_NOT_NULL in TD_MAP1 for read.
3) We execute the following steps in parallel.
1) We do an all-AMPs RETRIEVE step in TD_MAP1 from
DWHPRO.PERSONS_NOT_NULL by way of an all-rows scan with no
residual conditions into Spool 3 (all_amps), which is
redistributed by the hash code of (
DWHPRO.PERSONS_NOT_NULL.job) to all AMPs in TD_Map1. Then we
do a SORT to order Spool 3 by row hash. The result spool
file will not be cached in memory. The size of Spool 3 is
estimated with high confidence to be 5,003,864 rows (
2,526,951,320 bytes). The estimated time for this step is 4
minutes and 4 seconds.
2) We do an all-AMPs SUM step in TD_MAP1 to aggregate from
DWHPRO.PERSONS_NOT_NULL by way of an all-rows scan with no
residual conditions, grouping by field1 (
DWHPRO.PERSONS_NOT_NULL.job). Aggregate intermediate results
are computed globally, then placed in Spool 6 in TD_Map1.
The size of Spool 6 is estimated with no confidence to be
3,752,898 rows (562,934,700 bytes). The estimated time for
this step is 3 minutes and 19 seconds.
4) We do an all-AMPs RETRIEVE step in TD_Map1 from Spool 6 (Last Use)
by way of an all-rows scan into Spool 5 (all_amps), which is
redistributed by the hash code of (DWHPRO.PERSONS_NOT_NULL.job) to
all AMPs in TD_Map1. Then we do a SORT to order Spool 5 by row
hash. The size of Spool 5 is estimated with no confidence to be
5,003,864 rows (240,185,472 bytes).
5) We do an all-AMPs JOIN step in TD_Map1 from Spool 3 (Last Use) by
way of an all-rows scan, which is joined to Spool 5 (Last Use) by
way of an all-rows scan. Spool 3 and Spool 5 are joined using an
exclusion merge join, with a join condition of ("job = job"). The
result goes into Spool 1 (group_amps), which is built locally on
the AMPs. The size of Spool 1 is estimated with index join
confidence to be 5,003,864 rows (2,566,982,232 bytes). The
estimated time for this step is 1 minute and 8 seconds.
6) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1.
Conclusion
As we wrap up this enlightening exploration of SQL queries and NOT NULL column specifications, we can make significant observations based on our test scenario.
The execution plan for the table that permits NULL values in the ‘job’ column is notably more intricate. As the query becomes more complex, there is a greater chance of generating an inefficient execution plan.
We recommend explicitly defining a column as NOT NULL when it cannot contain NULL values. This practice enhances the efficiency and effectiveness of the SQL Optimizer, resulting in streamlined query performance.
In our test scenario, we observed the impact of omitting the NOT NULL column definition. The query lacking this definition substantially increased IOs and CPU usage. This comparison emphasizes how a seemingly minor detail in database design can significantly affect resource usage and Teradata performance:
Query | IO | CPU seconds |
---|---|---|
Job column NULL | 10677.0 | 16.832 |
Job column NOT NULL | 5447.0 | 10.08 |