Boost Your Teradata Performance - The Critical Role Of NOT NULL Declarations

Table of Contents

Introduction to Teradata Performance and NOT NULL

Welcome to our latest Teradata performance blog post, a series designed to provide valuable insights into SQL queries. This article spotlights ‘NOT NULL’.

To delve deeper into ‘NOT IN’, it is crucial to comprehend a frequently neglected SQL database design principle: properly defining columns that cannot hold NULL values as NOT NULL. Despite seeming insignificant, this is a fundamental aspect of efficient database design that significantly affects performance.

Defining columns as NOT NULL guides the Teradata Optimizer, a key system component that devises efficient plans for executing SQL statements. When a column is defined as NOT NULL, the SQL Optimizer can safely exclude the possibility of NULL values, leading to more efficient execution plans and better database performance.

Our Test Scenario

We will take a pragmatic approach as we delve further into this captivating subject. We will create a test scenario that exemplifies the consequences of mishandling NULL values in SQL queries and how it can substantially impact performance.

Consider this experiment a concrete illustration that will showcase the outcomes of bad database design. This principle is not just theoretical or abstract; it can have noteworthy and occasionally adverse effects on the performance and efficiency of SQL queries.

In this exercise, we will emphasize the crucial role of the Teradata Optimizer and its reliance on precise information regarding NULL values to produce optimal execution plans.

By utilizing a hands-on approach, you will gain a concrete understanding of the significance of accurately defining NOT NULL columns. Additionally, you will gain insights into the potential performance issues that may arise from not adhering to this crucial database design principle.

Below is the structure of our test table. It is important to note that this table’s ‘job’ column can contain NULL values.

Teradata Performance – NULL column ‘job’

Initially, we’ll establish a new table. In this table, we will specifically define the ‘job’ column to allow NULL values:CREATE SET TABLE DWHPRO.PERSONS (person_id INTEGER,first_name VARCHAR(100) CHARACTER SET LATIN NOT CASESPECIFIC,last_name VARCHAR(100) CHARACTER SET LATIN NOT CASESPECIFIC,salary INTEGER,job VARCHAR(100) CHARACTER SET LATIN NOT CASESPECIFIC,description VARCHAR(1000) CHARACTER SET LATIN NOT CASESPECIFIC,gender VARCHAR(10) CHARACTER SET LATIN NOT CASESPECIFIC,dept VARCHAR(100) CHARACTER SET LATIN NOT CASESPECIFIC,birthdate DATE FORMAT ‘YYYY-MM-DD’)PRIMARY INDEX ( person_id ,job );

First, let’s execute this SQL query on the table:

EXPLAINSELECT * FROM DWHPRO.PERSONSWHERE JOB NOT IN (SELECT JOB FROM DWHPRO.PERSONS);

After analyzing the execution plan, we notice that it adapts based on the existence or non-existence of NULL values in either table. The execution plan may follow two different pathways depending on the presence of NULL values.

1) First, we lock DWHPRO.PERSONS in TD_MAP1 for read on a reservedRowHash to prevent global deadlock.2) Next, we lock DWHPRO.PERSONS in TD_MAP1 for read.3) We do an all-AMPs SUM step in TD_MAP1 to aggregate fromDWHPRO.PERSONS by way of an all-rows scan with no residualconditions, grouping by field1 (DWHPRO.PERSONS.job). Aggregateintermediate results are computed globally, then placed in Spool10 in TD_Map1. The input table will not be cached in memory, butit is eligible for synchronized scanning. The size of Spool 10 isestimated with low confidence to be 3,751,314 rows (562,697,100bytes). The estimated time for this step is 1 minute and 43seconds.4) We do an all-AMPs SUM step in TD_MAP1 to aggregate fromDWHPRO.PERSONS by way of an all-rows scan with no residualconditions. Aggregate intermediate results are computed globally,then placed in Spool 5 in TD_Map1. The input table will not becached in memory, but it is eligible for synchronized scanning.5) We do an all-AMPs SUM step in TD_Map1 to aggregate from Spool 10by way of an all-rows scan. Aggregate intermediate results arecomputed globally, then placed in Spool 12 in TD_Map1.6) We execute the following steps in parallel.1) We do an all-AMPs RETRIEVE step in TD_Map1 from Spool 12(Last Use) by way of an all-rows scan into Spool 3 (all_amps),which is duplicated on all AMPs in TD_Map1.2) We do an all-AMPs RETRIEVE step in TD_Map1 from Spool 5 (LastUse) by way of an all-rows scan into Spool 4 (all_amps),which is duplicated on all AMPs in TD_Map1.7) We execute the following steps in parallel.1) We do an all-AMPs RETRIEVE step in TD_MAP1 fromDWHPRO.PERSONS by way of an all-rows scan with no residualconditions into Spool 7 (all_amps), which is redistributed bythe hash code of (DWHPRO.PERSONS.job) to all AMPs in TD_Map1.Then we do a SORT to order Spool 7 by row hash. The inputtable will not be cached in memory, but it is eligible forsynchronized scanning. The result spool file will not becached in memory, and null value information in Spool 4 andSpool 3. Skip this retrieve step if null exists. The sizeof Spool 7 is estimated with high confidence to be 5,001,752rows (2,525,884,760 bytes). The estimated time for this stepis 4 minutes and 11 seconds.2) We do an all-AMPs RETRIEVE step in TD_Map1 from Spool 10 byway of an all-rows scan into Spool 9 (all_amps), which isredistributed by the hash code of (DWHPRO.PERSONS.job) to allAMPs in TD_Map1. Then we do a SORT to order Spool 9 by rowhash, and null value information in Spool 4 and Spool 3.Skip this retrieve step if null exists. The size of Spool 9is estimated with low confidence to be 5,001,752 rows (240,084,096 bytes).8) We execute the following steps in parallel.1) We do an all-AMPs JOIN step in TD_Map1 from Spool 7 (LastUse) by way of an all-rows scan, which is joined to Spool 9(Last Use) by way of an all-rows scan. Spool 7 and Spool 9are joined using an exclusion merge join, with a joincondition of (“job = job”), and null value information inSpool 4 and Spool 3. Skip this join step if null exists.The result goes into Spool 1 (group_amps), which is builtlocally on the AMPs. The size of Spool 1 is estimated withindex join confidence to be 5,001,752 rows (2,565,898,776bytes). The estimated time for this step is 1 minute and 27seconds.2) We do an all-AMPs RETRIEVE step in TD_MAP1 fromDWHPRO.PERSONS by way of an all-rows scan with no residualconditions into Spool 14 (all_amps), which is redistributedby the hash code of (DWHPRO.PERSONS.job) to all AMPs inTD_Map1. Then we do a SORT to order Spool 14 by row hash.The input table will not be cached in memory, but it iseligible for synchronized scanning. The result spool filewill not be cached in memory, and null value information inSpool 4 and Spool 3. Skip this retrieve step if there is nonull. The size of Spool 14 is estimated with high confidenceto be 5,001,752 rows (2,525,884,760 bytes). The estimatedtime for this step is 4 minutes and 11 seconds.3) We do an all-AMPs RETRIEVE step in TD_Map1 from Spool 10(Last Use) by way of an all-rows scan into Spool 15(all_amps), which is redistributed by the hash code of (DWHPRO.PERSONS.job) to all AMPs in TD_Map1. Then we do aSORT to order Spool 15 by row hash, and null valueinformation in Spool 4 and Spool 3. Skip this retrieve stepif there is no null. The size of Spool 15 is estimated withlow confidence to be 5,001,752 rows (240,084,096 bytes). Theestimated time for this step is 1 minute and 49 seconds.9) We do an all-AMPs JOIN step in TD_Map1 from Spool 14 (Last Use) byway of an all-rows scan, which is joined to Spool 15 (Last Use) byway of an all-rows scan. Spool 14 and Spool 15 are joined usingan exclusion merge join, with a join condition of (“job = job”),and null value information in Spool 4 (Last Use) and Spool 3 (LastUse). Skip this join step if there is no null. The result goesinto Spool 1 (group_amps), which is built locally on the AMPs.The size of Spool 1 is estimated with index join confidence to be5,001,752 rows (2,565,898,776 bytes). The estimated time for thisstep is 1 minute and 27 seconds.10) Finally, we send out an END TRANSACTION step to all AMPs involvedin processing the request.-> The contents of Spool 1 are sent back to the user as the result ofstatement 1.

Teradata Performance – NOT NULL column ‘job’

Moving forward, we’ll duplicate the original table containing identical data. However, this time, we will explicitly define the ‘job’ column as NOT NULL. This modification ensures that this column will not contain any NULL values:

CREATE SET TABLE DWHPRO.PERSONS_NOT_NULL (person_id INTEGER,first_name VARCHAR(100) CHARACTER SET LATIN NOT CASESPECIFIC,last_name VARCHAR(100) CHARACTER SET LATIN NOT CASESPECIFIC,salary INTEGER,job VARCHAR(100) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL,description VARCHAR(1000) CHARACTER SET LATIN NOT CASESPECIFIC,gender VARCHAR(10) CHARACTER SET LATIN NOT CASESPECIFIC,dept VARCHAR(100) CHARACTER SET LATIN NOT CASESPECIFIC,birthdate DATE FORMAT ‘YYYY-MM-DD’)PRIMARY INDEX ( person_id ,job );

We will execute the SQL query on our newly created table and analyze the execution plan. This is the same query we used previously.

1) First, we lock DWHPRO.PERSONS_NOT_NULL in TD_MAP1 for read on a reserved RowHash to prevent global deadlock. 2) Next, we lock DWHPRO.PERSONS_NOT_NULL in TD_MAP1 for read. 3) We execute the following steps in parallel. 1) We do an all-AMPs RETRIEVE step in TD_MAP1 from DWHPRO.PERSONS_NOT_NULL by way of an all-rows scan with no residual conditions into Spool 3 (all_amps), which is redistributed by the hash code of ( DWHPRO.PERSONS_NOT_NULL.job) to all AMPs in TD_Map1. Then we do a SORT to order Spool 3 by row hash. The result spool file will not be cached in memory. The size of Spool 3 is estimated with high confidence to be 5,003,864 rows ( 2,526,951,320 bytes). The estimated time for this step is 4 minutes and 4 seconds. 2) We do an all-AMPs SUM step in TD_MAP1 to aggregate from DWHPRO.PERSONS_NOT_NULL by way of an all-rows scan with no residual conditions, grouping by field1 ( DWHPRO.PERSONS_NOT_NULL.job). Aggregate intermediate results are computed globally, then placed in Spool 6 in TD_Map1. The size of Spool 6 is estimated with no confidence to be 3,752,898 rows (562,934,700 bytes). The estimated time for this step is 3 minutes and 19 seconds. 4) We do an all-AMPs RETRIEVE step in TD_Map1 from Spool 6 (Last Use) by way of an all-rows scan into Spool 5 (all_amps), which is redistributed by the hash code of (DWHPRO.PERSONS_NOT_NULL.job) to all AMPs in TD_Map1. Then we do a SORT to order Spool 5 by row hash. The size of Spool 5 is estimated with no confidence to be 5,003,864 rows (240,185,472 bytes). 5) We do an all-AMPs JOIN step in TD_Map1 from Spool 3 (Last Use) by way of an all-rows scan, which is joined to Spool 5 (Last Use) by way of an all-rows scan. Spool 3 and Spool 5 are joined using an exclusion merge join, with a join condition of (“job = job”). The result goes into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with index join confidence to be 5,003,864 rows (2,566,982,232 bytes). The estimated time for this step is 1 minute and 8 seconds. 6) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> The contents of Spool 1 are sent back to the user as the result of statement 1.

Conclusion

As we wrap up this enlightening exploration of SQL queries and NOT NULL column specifications, we can make significant observations based on our test scenario.

The execution plan for the table that permits NULL values in the ‘job’ column is notably more intricate. As the query becomes more complex, there is a greater chance of generating an inefficient execution plan.

We recommend explicitly defining a column as NOT NULL when it cannot contain NULL values. This practice enhances the efficiency and effectiveness of the SQL Optimizer, resulting in streamlined query performance.

In our test scenario, we observed the impact of omitting the NOT NULL column definition. The query lacking this definition substantially increased IOs and CPU usage. This comparison emphasizes how a seemingly minor detail in database design can significantly affect resource usage and Teradata performance:

Query	IO	CPU seconds
Job column NULL	10677.0	16.832
Job column NOT NULL	5447.0	10.08

Related Services

⚡ Need Help Optimizing Your Data Platform?

We cut data platform costs by 30–60% without hardware changes. 25+ years of hands-on tuning experience.

Explore Our Services →

📋 Considering a Move From Teradata?

Get a personalized migration roadmap in 2 minutes. We have migrated billions of rows from Teradata to Snowflake, Databricks, and more.

Free Migration Assessment →

Boost Your Teradata Performance – The Critical Role of NOT NULL Declarations

Introduction to Teradata Performance and NOT NULL

Our Test Scenario

Teradata Performance – NULL column ‘job’

Teradata Performance – NOT NULL column ‘job’

Conclusion

📊 Data Platform Migration Survey

Stay Ahead in Data Warehousing

Leave a Comment Cancel reply