The importance of the PDM
Compression is a popular way of saving IOs by allowing for packing more rows into each data block. Data blocks are the smallest unit transferred between hard drives and main memory, the most expensive operation in any database system.
Compression can be done automatically by the system (block-level compression) or is applied by the user (MVC). Another option is to use compression software, such as the one offered by Prise Ltd. (check it out at https://www.prisetools.com/products/compress-wizard).
While developers use Multivalue Compression (MVC), other opportunities to cut resource usage are often overseen or left out for several reasons. The PDM, i.e., the proper implementation of data types and character sets, is often neglected.
First the PDM, then Compression
Compression depends on the underlying data types. Therefore, before starting any compression activity, we must ensure that the correct data types and character sets are used in the PDM. Otherwise, any redesign of the PDM will force us to rework compressions.
While the correct choice of data types and Compression help to cut resource usage, there are even some advantages of the right data type/ character set choice, which Compression doesn’t have.
The suitable data types will reduce resource usage independently from the data content, while Compression has no space or resource usage benefit for unique columns.
The choice of correct data types always helps to cut resource usage (by allowing for packing more rows into each data block).
The selection of the suitable data types & character sets is part of physical data modeling. Unfortunately, often this step is skipped because it causes costs early in the implementation process.
Data types are often taken over from the stage area into the PDM of your core model, and from there, they are directly moved into the data marts.
While this approach overcomes costs at the beginning of the implementation process, we don’t recommend it. The costs will only be transferred to a later stage when making changes on the PDM is much more complicated.
Only because a column definition is a DECIMAL(38,n) in the source system doesn’t mean it will ever hold such huge numbers. Remember, the space usage of decimals is in the range of 1 to 8 bytes. The same is valid for integer values (byte int -> 1 byte, big int -> 8 bytes)
In my experience, we could store many decimals in barely 4 bytes. Still, because of laziness, time pressure & cost reasons, developers often save them in the most significant decimal number available(which consumes 8 bytes).
If improved in the way described above, a single decimal needs 4 bytes less per table row. Consider the possible space reduction for a table containing billions of rows, and don’t forget that this improvement is before applying compressions!
Similarly, using character set LATIN instead of UNICODE helps to cut space usage of a character field by 50% per character. Often I even see simple codes, such as “ACTIVE,” “SUSPENDED,” or “CANCELED” stored as UNICODE characters!
As a Teradata Consultant, you never should leave out this critical optimization step. Yes, it takes time to create the PDM, but it will pay off in the end.
I think I have given you enough arguments to insist on these activities. Don’t forget about it in your next project!
Please ask in the comments if you have any questions about all this! I’ll be paying close attention and answering as many as I can. Thank you for reading. Whatever this blog has become, I owe it all to you.
Great article, Thank you!