The importance of the PDM
Compression is a widely used method for reducing IOs by enabling the accommodation of more rows in each data block. Data blocks are the smallest unit exchanged between hard drives and main memory, which is the most costly operation in any database system.
Compression can be automated through block-level compression or executed manually through the MVC approach. Additionally, compression software, like the one provided by Prise Ltd. (accessible at https://www.prisetools.com/products/compress-wizard), is also an alternative.
Developers frequently overlook various resource-saving techniques besides Multivalue Compression (MVC). The proper implementation of data types and character sets, known as PDM, is frequently neglected.
First, the PDM, then Compression
Compression relies on the specific data types utilized. Hence, before commencing any compression process, confirming that the correct data types and character sets are used in the PDM is essential. Otherwise, any modifications made to the PDM will require recompressing.
Choosing the appropriate data types and character sets can effectively reduce resource usage and offer benefits that Compression cannot provide.
Optimal data types reduce resource usage regardless of the data content, whereas compression offers no space or resource usage advantage for columns with distinct values.
Choosing appropriate data types can reduce resource usage by allowing more rows to be packed into each data block.
Choosing appropriate data types and character sets is integral to the process of physical data modeling. Regrettably, this crucial step is frequently overlooked due to the upfront expenses incurred during implementation.
Data types are commonly transferred from the staging area to the PDM of the core model and subsequently to the data marts.
Although this method initially reduces expenses during the implementation process, we advise against it. The expenses will merely be deferred to a later phase, when modifying the PDM becomes significantly more complex.
A column’s definition as a DECIMAL(38,n) in the source system does not necessarily indicate that it will always contain massive numbers. It is worth noting that decimals typically occupy between 1 and 8 bytes of space. The same is true for integer values, where byte int requires only 1 byte and big int necessitates 8 bytes.
In my experience, it is possible to store multiple decimals in just 4 bytes. However, due to time constraints, cost considerations, and a lack of motivation, developers often opt to save them as the largest available decimal type, which occupies 8 bytes.
When implemented as suggested above, an optimized decimal requires 4 fewer bytes per row in a table. Imagine the amount of space saved in a table with billions of rows, and remember that this enhancement occurs before compression techniques are applied.
Using the LATIN character set instead of UNICODE can reduce character field space usage by 50% per character. Surprisingly, even basic codes like “ACTIVE,” “SUSPENDED,” or “CANCELED” are stored as UNICODE characters.
As a Teradata Consultant, omitting this vital optimization step is not advisable. Although creating the PDM is time-consuming, it will yield significant benefits in the long run.
I believe I have presented sufficient justifications to prioritize these activities. Please keep them in mind for your upcoming project.
Feel free to ask any questions in the comments. I will be monitoring and responding to as many as possible. Thank you for taking the time to read this.
Related Services
⚡ Need Help Optimizing Your Data Platform?
We cut data platform costs by 30–60% without hardware changes. 25+ years of hands-on tuning experience.
Explore Our Services →📋 Considering a Move From Teradata?
Get a personalized migration roadmap in 2 minutes. We have migrated billions of rows from Teradata to Snowflake, Databricks, and more.
Free Migration Assessment →
Great article, Thank you!