Teradata Denormalization – Don’t trust your intuition!
From my experience, I can tell that many data warehouse implementations fail because of a wrong design choice at the begin of the project.
I myself am an impassioned advocate of the normalized data warehouse. Unfortunately I have to admit, that keeping a normalized core data model is not always a simple task. Bad performance may force us to denormalize (and this is usually the only reason for me to consider denormalization)
In my opinion it is always a good idea to start with a normalized core data model and deviate from this approach only later in the modeling process for above mentioned reason. Denormalization should always take place based on a good design, and should never be kind of an intuitive approach.
In the following section, we will discuss some common denormalization techniques.
While you would avoid repeating groups in a normalized data model (as they violate 1NF), they can be helpful in saving space and to avoid joins. Assume you are designing a table which is used to compare the end of month account balances of your customers, i.e. each month the balance has to be compared with the previous month’s balance.
Of course, you could store (the prefered method) the information in a normalized table, keeping one row per month and customer.
As you can imagine, to bring the required pair of rows together, you either need to do a self-join or have to use an ordered analytic function. Bad performance might force you to introduce repeating groups. You would store exactly one row per customer holding the balances of each month in a different column.
Pre-Join of Tables
Pre-Joining of tables can be achieved in two different ways.
While you could simply do manual pre-join in your ETL/ELT process, and store the result in a table, Teradata offers another method, the Join Indexes.
Both kind of pre-joins come with their own advantages and disadvantages. The problem with Join Indexes is, that it often requires a lot of time and patience to convince the Teradata optimizer to use the join index; Still, as Join Indexes are system maintained they might be an excellent choice to support your applications (“set and forget”).
Personally, I prefer to design manual pre-joins for my ETL/ELT tasks (as the manual joins ensure a stable performance), and to use Join Indexes to support the reporting tools (such as Microstrategy, Business Objects, etc.).
Another common denormalization approach is to store pre-calculated data.
You would choose this approach if the calculation of the required information is expensive and used often. For example, you may store the average account balance calculated over the last 12 month together. This would get rid of an expensive ordered analytic function otherwise needed to calculate this information.
While above techniques can be helpful to solve performance issues, they never should be your first instrument of choice. Teradata offers many great tuning tools, such as row partitioning, secondary indexes, join indexes, etc. This tools should be used primarily to get rid of performance problems. Only if they can’t deliver the required result, classical denormalization should be considered.
You should never denormalize your data, just because it is the easiest approach and convenient. The associated problems, such as data redundancy and the risk of inconsistencies, can cause enormous costs in a later stage of the project – and usually, this will happen.
Many of the performance issues I had to solve during the last couple of years were caused by “intuitive denormalization” (i.e. denormalization as the “path of least resistance”).
I don’t judge about which approach is the preferred one when it comes to Inmon or Kimball.
Personally, I love normalized data models as a starting point, as in my experience they help to avoid a lot of errors which can be done in the early design phase of a data warehouse.