Many data warehouse projects are a failure because an inappropriate data model is chosen at the beginning of the project.
I am a proponent of a normalized data model. Unfortunately, it is not necessarily fashionable to store data in a third normal form or higher in recent years. This may also be because, in the age of Big Data, unstructured data is gaining the upper hand. Therefore, data structured by definition is often stored in the data warehouse or denormalized. There are situations in which Teradata denormalization makes sense. However, there is also often the situation that denormalization has to serve as a stopgap solution due to wrong decisions in the creation of the physical data model.
In this article, however, we will look at denormalization techniques for the cases where it makes sense.
We should always try to create a normalized data model first and start denormalizing only when performance requires it. Denormalization should always be well considered and should never be done intuitively.
The effects of denormalization are unpredictable. We must not forget that denormalization is designed to improve the performance of specific applications. At the same time, we often do not know or see the impact on all other applications until it is too late. It is unforgivable if already the logical data model is not normalized.
Effects of denormalization
- Denormalization can improve the performance of the target application but degrade the performance of other applications. So we do precisely the opposite of what we want to achieve.
- A denormalized data model makes it challenging to develop new applications because a specialized model for a particular application usually does not fit others.
- Denormalization leads to DML (insert, update) anomalies. This is exactly what we wanted to prevent with a 3NF data model.
- Denormalization leads to higher costs of new applications due to increased complexity for all the reasons mentioned earlier.
Now that we know all the disadvantages let’s discuss the techniques used for denormalization.
While we avoid repeating groups in a normalized data model (as they violate 1NF), they can help save space and avoid joins. Assume we are designing a table to compare our customers’ end-of-month account balances, i.e., each month, the balance has to be compared with the previous month’s balance.
We could store (the preferred method) the information in a normalized table, keeping one row per month and customer.
To bring the required rows together, we either do a self-join or use an ordered analytic function. We might improve bad performance with repeating groups. We would store precisely one row per customer, holding the balances from each month in its column (BalanceJanuar, BalanceFebruar,…,BalanceDecember).
Advantages of Repeating Groups
- Increase performance
- Saves disk space
- Avoidance of analytical functions
Disadvantages of Repeating Groups
- DLL may need to be extended if not enough columns are provided. As a consequence also, an adaptation of the applications
- The reporting options are very limited to one application
Prejoins of Tables or Join Indexes
Prejoining of tables can be achieved in two different ways.
While we could do manual prejoins in your ETL/ELT process and store the result in a table, Teradata offers another method, the Join Indexes.
Both kinds of prejoins come with their advantages and disadvantages. The problem with join indexes is that it often requires a lot of experience to persuade the Teradata optimizer to use the join index; Still, as join indexes are system maintained, they might be an excellent choice to support our applications (“set and forget”).
I recommend designing manual prejoins for my ETL/ELT tasks (as the manual joins ensure a stable performance) and using join indexes to support reporting tools (such as Microstrategy, Business Objects, etc.).
Another common denormalization approach is to store pre-calculated data.
We would choose this approach if calculating the required information is expensive and done frequently. For example, we may store the average account balance calculated over the last 12 months. This would eliminate an expensive ordered analytic function otherwise needed to calculate this information.
As with prejoin, there are alternatives for derived data without denormalizing the tables: Aggregate Join Index, Global Temporary Tables, and Views in which the values are derived. These methods are preferred if we can achieve satisfactory performance.
Volatile Tables and Global Temporary Tables instead of Denormalization
If we cannot avoid techniques to increase the performance, Global Temporary Tables and Volatile Tables are often good alternatives to denormalize the PDM. Instead of the PDM, we can perform the normalization in these temporary database objects, but the PDM remains untouched. This saves us from repeating groups, derived data, and prejoins on the PDM.
While the above techniques can help solve performance issues, they never should be our first instrument of choice. Teradata offers many great tuning tools, such as row partitioning, secondary indexes, join indexes, etc. We should use these tools primarily to get rid of performance problems. Classical denormalization should be considered only if these techniques can’t deliver the required result.
We should never denormalize our data because it is the most straightforward and convenient approach from an implementation point of view. The disadvantages, such as data redundancy and the risk of inconsistencies, will cause enormous costs in a later stage of the project.
Many of the performance issues I had to solve were caused by “intuitive denormalization” (i.e., denormalization as the “path of least resistance”).