Teradata Development – Keep it simple, stupid!
Today’s post describes a problem that occurs with ever-increasing frequency, which seems to be related to the over-specialization we face. In my experience, the number of data warehouse allrounders is decreasing rapidly.
While a decade ago, data warehouse teams consisted almost exclusively of people with a deep background in the information technology field, this is no longer valid. Nowadays, companies tend to create very specialized jobs. While this may help reduce costs (I doubt this), it leads to plenty of new problems.
Surprisingly, one of these problems is the development of huge SQL statements. I observed this trend in several of my most recent projects.
The required skill set of Teradata data warehouse developers is often reduced to the knowledge of SQL (and probably some ETL-Tool).
A simplified development chain (or however you would like to call it) usually looks like this:
Business Analyst -> Technical Analyst -> Developer
Usually, the technical analyst packs the business logic into the mapping document, which is passed to the developers.
Unfortunately, as development skills are minimal, many developers tend to create one huge SQL statement out of each mapping without considering, for example, redundancy. I have seen developers blindly copying the same code over and over again.
While this may seem tempting from a maintenance point of view, this approach is far from producing high-performance SQL, often resulting in wrong, redundant code and, in the end, even additional maintenance overhead.
As we all know, the Teradata optimizer’s task is to create the execution plan from the submitted SQL query. Although the optimizer becomes better with each new release of Teradata (at least, I hope so), we have to remember how Teradata decomposes each SQL statement into a sequence of 2-table (2-spool) joins.
While the join of 3 tables allows for 3*2*1 = 6 different possibilities of consecutive 2-table joins, we already have more than 3 million possible 2-table joins (10*9*8*7*6*5*4*3*2*1) in case we have to join ten tables together!
As you can imagine, there is plenty of risk for the optimizer to get things wrong.
The chance of ending up with a good execution plan decreases rapidly with the number of joins!
How can we reduce the risk of a bad execution plan?
There are three straightforward and efficient ways of reducing the risk of bad performance:
- Keep the number of joins small (the obvious solution)
- Split your SQL statements into parts by working with temporary tables (volatile tables, global temporary tables, or even real tables)
- Identify common business logic and maintain it in one place within your code.
You may think about arguing against this idea, mainly because you will lose the intelligent solution of having one SQL statement per mapping. Still, the tips mentioned above pay off quickly (you can be sure, by not following the above rules, somebody else will come at some point in time and take over your bad-performing implementation to improve it)