The importance of the PDM
Compression is a widely used method for reducing IOs by enabling the accommodation of more rows in each data block. Data blocks are the tiniest entity exchanged between hard drives and main memory, which is the most costly operation in any database system.
Compression can be automated through block-level compression or executed manually through the MVC approach. Additionally, compression software, like the one provided by Prise Ltd. (accessible at https://www.prisetools.com/products/compress-wizard), is also an alternative.
Developers frequently overlook various resource-saving techniques besides Multivalue Compression (MVC). The proper implementation of data types and character sets, known as PDM, is frequently neglected.
First, the PDM, then Compression
Compression relies on the specific data types utilized. Hence, before commencing any compression process, confirming that the accurate data types and character sets are employed in the PDM is essential. Otherwise, any modifications made to the PDM will necessitate the need for recompressing.
Choosing the appropriate data types and character sets can effectively reduce resource usage and offer benefits that Compression cannot provide.
Optimal data types reduce resource usage regardless of the data content, whereas compression offers no space or resource usage advantage for columns with distinct values.
Choosing appropriate data types can reduce resource usage by allowing more rows to be packed into each data block.
Choosing appropriate data types and character sets is integral to the process of physical data modeling. Regrettably, this crucial step is frequently overlooked due to the upfront expenses incurred during implementation.
Data types are commonly transferred from the staging area to the PDM of the core model and subsequently to the data marts.
Although this method initially reduces expenses during the implementation process, we advise against it. The expenses will merely be deferred to a later phase, when modifying the PDM becomes significantly more complex.
A column’s definition as a DECIMAL(38,n) in the source system does not necessarily indicate that it will always contain massive numbers. It is worth noting that decimals typically occupy between 1 and 8 bytes of space. The same is true for integer values, where byte int requires only 1 byte and big int necessitates 8 bytes.
In my experience, storing multiple decimals in just 4 bytes is possible. However, due to factors such as time constraints, cost considerations, and a lack of motivation, developers often opt to save them in the largest decimal number available, which occupies 8 bytes.
When implemented as suggested above, an optimized decimal requires 4 fewer bytes per row in a table. Imagine the amount of space saved in a table with billions of rows, and remember that this enhancement occurs before compression techniques are applied.
Using the LATIN character set instead of UNICODE can reduce character field space usage by 50% per character. Surprisingly, even basic codes like “ACTIVE,” “SUSPENDED,” or “CANCELED” are stored as UNICODE characters.
As a Teradata Consultant, omitting this vital optimization step is not advisable. Although creating the PDM is time-consuming, it will yield significant benefits in the long run.
I believe I have presented sufficient justifications to prioritize these activities. Please keep them in mind for your upcoming project.
Feel free to inquire in the comments if you have any questions regarding the information. I will be closely monitoring and responding to as many inquiries as possible. Thank you for taking the time to read this. My gratitude towards you is immeasurable for contributing to the success of this blog.