Teradata Development – Keep it simple, stupid!
Today’s post describes a problem which occurs with ever increasing frequency, and it seems to be related to the over-specialization we are facing. In my experience, the number of data warehouse allrounders is decreasing rapidly.
While a decade ago, data warehouse teams consisted almost exclusively of people with a deep background in the field of information technology, this is not valid anymore. Nowadays, companies tend to create very specialized jobs. While this may help in reducing costs (I doubt this), it leads to a plenty of new problems.
Surprisingly, one of this problems is the development of huge SQL statements. I was able to observe this trend in several of my most recent projects.
The required skill set of Teradata data warehouse developers is often being reduced to the knowledge of SQL (and probably some ETL-Tool).
A simplified development chain (or however you would like to call it) usually looks like this:
Business Analyst -> Technical Analyst -> Developer
Usually, the technical analyst packs the business logic into the mapping document, which is passed forward to the developers.
Unfortunately, as development skills are very limited, many developers tend to create one huge SQL statement out of each mapping, without considering, for example, redundancy. I have seen developers blindly copying the same code over and over again.
While this may seem tempting from a maintenance point of view, this approach is far away from producing high-performance SQL, most times resulting in wrong, redundant code and at the end even additional maintenance overhead.
As we all know, the Teradata optimizer’s task is to create the execution plan from the submitted SQL query. Although the optimizer becomes better with each new release of Teradata (at least I hope so), we have to keep in mind how Teradata decomposes each SQL statement into a sequence of 2-table (2-spool) joins.
While the join of 3 tables allows for 3*2*1 = 6 different possibilities of consecutive 2-table joins, we already have more than 3 million possible 2-table joins (10*9*8*7*6*5*4*3*2*1) in case we have to join ten tables together!
As you can imagine, there is plenty of risk for the optimizer to get things wrong.
The chance to end up with a good execution plan decreases rapidly with the number of joins!
How can we reduce the risk to end up with a bad execution plan?
There are three straightforward and efficient ways of reducing the risk of bad performance:
- Keep the number of joins small (the obvious solution)
- Split your SQL statements into parts by working with temporary tables (volatile tables, global temporary tables or even real tables)
- Identify common business logic and maintain it at one place within your code.
You may think about arguing against this idea, mainly because you will lose the smart solution of having one SQL statement per mapping. Still, tips mentioned above pay off quickly (you can be sure, by not following above rules, somebody else will come at some point in time and take over your bad performing implementation to improve it)