The Teradata flavor of SQL is still, in principle, a declarative language. Hence, there can be multiple ways to describe an SQL query and achieve the same result. While the answer is the same, Teradata may use a completely different execution plan based on how the query is expressed. One approach is investing in heavy COLLECT STATISTICS to reduce resource consumption. However, many times a simple rewrite of the query can achieve enormous performance improvements. In this article, we discuss two such cases revolving around duplicates and calendars.

Show duplicate records

Sooner or later, we will face the case that a table or a query produces duplicate values. While the root causes are beyond the scope here, more often than not, we need to examine these records and see what could be wrong. A common approach is using a GROUP BY construct:

SELECT b.*FROM badTable bWHERE (col1, col2, col3, ...) IN (SELECT col1, col2, col3, ...FROM badTable cGROUP BY col1, col2, col3, ...HAVING count(*) > 1)

This approach requires parsing the table twice, once to count occurrences and another one to return the matching data. From an expressiveness point of view, it is not directly evident what the query intention is. Contrast with the following approach:

SELECT b.*FROM badTable bQUALIFY count(*) OVER (PARTITION BY col1, col2, col3, ...) > 1 

The second query should require a single table parse, and its aim is more evident: Select all records that (col1, col2, col3, …) occur more than once in the set. Hence, they are duplicates. We take advantage of the extra processing layer of QUALIFY that is available “for free” and with less code, we get results with less processing.

Business calendar dates

Many companies use a so-called “business calendar” instead of the Gregorian one, marking specific dates as working or non-working (e.g. because the company observes a national or religious holiday). Then, concepts like “last working day of a month” or “first working day of next month” deviate from one of the conventional calendars. Assume the question is formulated as “starting with a given (reference) date, report the next working day after it“. One approach we have seen goes like this:


SELECT b.CAL_DT, Min(n.CAL_DT) AS NEXT_WORKING_DTFROM businessCalendar bCROSS JOIN businessCalendar nWHERE n.IS_WORKING_DAY = 'Y'AND n.CAL_DT > b.CAL_DTGROUP BY b.CAL_DT

This approach involves a product join and a grouping instead of accessing the same table twice. Can we do better? Yes! We can just let Teradata handle all this logic like internally:

SELECT b.CAL_DT, LEAD(CASE WHEN b.IS_WORKING_DAY = 'Y' THEN b.CAL_DT END IGNORE NULLS) AS NEXT_WORKING_DT
FROM businessCalendar b

The second version is more readable (subjective opinion) and, thus, more maintainable. The execution plan is objectively more performant. It takes a single table pass, and the intermediate spool remains in the size of the underlying table (about 75,000 rows covering 1900-2100 compared to 5.4 billion needed for the product join approach).

Conclusion

QUALIFY offers an additional query processing layer, which often goes unexplored. Still, it is not a panacea for all problems. The execution plan and available data demographics should always be checked to avoid unpleasant No more spool errors.

The way the question is posed can lead to SQL queries that are unnecessarily complex and result in inefficient execution plans. Rather than investing effort to tune a sub-optimal query, it is worth taking a step back and examining how the original question can be reformulated. Maybe the SQL can be rewritten more cleanly and compactly, saving us a ton.

Do you have an example to share? We’ll be glad to read your ideas and comments below!

  • Nice, I like this 😉 Congrats!

  • {"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

    You might also like

    >