Introduction to Teradata Data Types
Further Data Type considerations require occasional attention to additional issues.
Happenstance Nullability
Strive for nullability settings that adhere to the data model and are consistent throughout all tables. Allowing mandatory columns to be left open for null entries incurs costs for both storage and optimization. An additional presence bit per column must be allocated, and a null option must be considered when creating the SQL execution plan.
Content Cleanliness
Ensure to thoroughly verify when transferring data from a source table attribute to a target table attribute to avoid discrepancies and inconsistencies in values. Consider that an attribute like ORDER_NUMBER can have varying structures based on its origin.
AAA.123.456O or BBB.123.456. Assume further that the positions of 456 represent a customer-type code. The “O” stands for order. The application creates it that way. If you merely extract the number using a substring starting from position 5, your result will not be entirely numeric – which you might not even realize as you load into a character field! – And there are invalid customer type codes. Surprisingly wrong reporting results for customer-type aggregations despite no-load fails and quantitative checks passed will keep you busy later.
Content Sobriety
Achieving this goal is a commendable task for any warehouse, despite the difficulty posed by manual data entry. Nonetheless, it is a goal worth striving for, much like how we regularly organize our living spaces, rather than aiming for a one-time, ultimate state of cleanliness.
In contrast to cleanliness, which refers to the adequate processing of data, I like to call sobriety a state of redundancy-free and simplified attribute content down to the point where it cannot be pruned further without loss of information.
First, values should be all written out or all coded and abbreviated. A threefold categorization of “A,” “B” and “DEFAULT” could be converted to A,” “B” and “D” without loss of information, but size needs down to one character byte instead of several. Taking this further, could dozens of full-text entries be converted to official codes so your table does not take in extra calories?
An anonymized real-world example:
An ENTITY_NAME, let’s say a branch name, is stored like these thousands of times each:
“ABCD – ABCD Division of ABCD Industries Fantasia”
Why not cut it down to “ABCD” and store the meaning of the ENTITY_NAME in a tiny lookup table?
I could cut the ENTITY_NAME from VARCHAR(200) to CHAR(4). The savings can be estimated to (average length of current entries-4+2) bytes per record. The table I derived this example from had 177 million records when I wrote these lines.
Second, summarize and prune. Although this involves changing the table content somewhat, it will not alter the meaning of the content while fewer distinct values emerge. They then have a higher likeliness of being subject to compression. If relevant, you will need fewer partitions. Address and name fields are always an excellent place to start.
Derived from what I just saw on a production table, imagine a street field mapped 1 to 1 from the source where we have stored “Main Street,” “Main Str.” as well as “Mainstreet” thousands of times each. Once a standard form of street name representation is defined, the distinct values can probably be cut in half.
Even if you believe that the blanks or abbreviations have to be preserved because they carry a, however, a subtle difference in meaning, a hundred different ways of saying “no input” don’t have to. Cut down all kinds of meaningless input to no more than two values, one meaning NULL and one meaning that the original input is unusable, be it because of refusal, pranks, or severe typing errors that cannot be reverse-engineered.
Propose and plan this, and the chances are that your reporting and analytics department colleague will add you as a new friend.
A Success Story to Inspire You
Recently I tested a data type resizing over dozens of tables. These tables were already compressed and mainly partitioned and indexed well. Unfortunately, they were implemented under significant uncertainty concerning the actual content or a lack of due diligence. Pruning character and decimal fields closer to the reality of the table content saved 5 to 25% of the used table space, even for small tables, with extremes of 35% and more.
See also:
First, the PDM, then Compression – Measures to reduce Resource Usage
Hello,
These two-part of data types considerations are really nice.
It is not only the waste space that we have on our tables, (so more time for backup and so on) but as you said, it looks like that we do not know our data.
One thing that I have to mention here, is how our DDL affects our performance and statistics.
By defining, for simplicity reasons the columns as Unicode, do not only affect the space of the table (since they want two bytes per character) but and the length that statistics can keep.
Until TD 13.10, it was 16 and a lot of wrong stats have been defined due to this.From 14 and on, this has been partial
solved from TD by setting the length at 25 or manually change up to 50 (even when collecting stats on multicolumn,
the acceptable length from TD until 13 then summed to be 16, but from TD 14 every column on the multicolumn can be
from 25 to 50 ).
Furthermore, as you mentioned, never define a varchar column with length much more than the expected one. TD has an issue here, on spool space during transactions. If the columns participate on a group by / order by / union or on Multiload / FastLoad operations,
then TD does not take the actual length of the varchar column but the defined one. This leads TD to consume more CPU / IO
even for a simple query. Imagine, on your example, if you have 1 Byte rows and the
column is defined as VARCHAR(200) but your actual max length is 50, you waste 150 bytes * 1 B rows (and if the column is defined as
Unicode, * 2 ).
Thanks.
Hello Dimitrios!
Thank you for your refinement on the length and character set issue!
You inspire me to investigate and test the relationship between Character set choice, length and statistics a little deeper.
Kind Regards,
Paul