First the PDM, then Compression
The importance of the PDM
Compression is a popular way of saving IOs, by allowing to pack more rows into each data block. Data blocks are the smallest unit transferred between hard drives and main memory, the most expensive operation in any database system.
Compression can be done automatically by the system (block level compression) or is applied by the user (MVC). Another option is to use compression software, such as the one offered by Prise Ltd. (check it out at https://www.prisetools.com/products/compress-wizard).
While developers tend to use Multivalue Compression (MVC), other opportunities to cut resource usage are most times overseen or left out for several reasons. Very often the PDM i.e. the proper implementation of data types and character sets is neglected.
First the PDM, then Compression
Compression depends on the underlying data types. Therefore, before starting any compression activity, we have to make sure that the correct data types and character sets are used in the PDM. Otherwise, any redesign of the PDM will force us to rework compressions.
While the correct choice of data types and compression both help to cut resource usage, there are even some advantages of the right data type/ character set choice which compression doesn’t have.
The right data types will reduce resource usage independently from the data content, while compression has no space or resource usage benefit for unique columns.
The choice of correct data types always helps to cut resource usage (by allowing to pack more rows into each data block).
The selection of the right data types & character sets is part of physical data modeling. Unfortunately, often this step is skipped because it causes costs early in the implementation process.
Many times, data types are directly taken over from the stage area into the PDM of your core model, and from there they are directly moved into the data marts.
While this approach overcomes costs at the begin of the implementation process, we don’t recommend it. The costs will only be transferred to a later stage when it is much harder to do changes on the PDM.
Only because a column definition is a DECIMAL(38,n) in the source system doesn’t mean it will ever hold such huge numbers. Remember, space usage of decimals is in the range of 1 to 8 bytes. The same is valid for integer values (byte int -> 1 byte, big int -> 8 bytes)
In my experience, many decimals could be stored in barely 4 bytes, but because of laziness, time pressure & cost reasons, developers often save them in the largest decimal number available(which consumes a total of 8 bytes).
A single decimal, if improved in the way described above, need 4 bytes less per table row. Consider the possible space reduction for a table containing billions of rows, and don’t forget that this improvement is before applying compressions!
Similarly, using character set LATIN instead of UNICODE helps to cut space usage of a character field by 50% per character. Often I even see simple codes, such as “ACTIVE,” “SUSPENDED,” or “CANCELED” stored as UNICODE characters!
As a Teradata Consultant, you never should leave out this important optimization step. Yes, it takes time when we create the PDM, but it will pay off at the end.
I think I gave you enough arguments to insist on this activities. Don’t forget about it in your next project!
If you have any questions about all this, please ask in the comments! I’ll be paying close attention and answering as many as I can. Thank you for reading. Whatever this blog has become, I owe it all to you.