Introduction to Teradata Data Types
Further Data Type considerations require occasional attention to additional issues.
Strive for nullability settings that adhere to the data model and are consistent throughout all tables. Allowing mandatory columns to be left open for null entries incurs costs for both storage and optimization. An additional presence bit per column must be allocated, and a null option must be considered when creating the SQL execution plan.
Ensure to thoroughly verify when transferring data from a source table attribute to a target table attribute to avoid discrepancies and inconsistencies in values. Consider that an attribute like ORDER_NUMBER can have varying structures based on its origin.
AAA.123.456O or BBB.123.456. Assume further that the positions of 456 represent a customer-type code. The “O” stands for order. The application creates it that way. If you merely extract the number using a substring starting from position 5, your result will not be entirely numeric – which you might not even realize as you load into a character field! – And there are invalid customer type codes. Surprisingly wrong reporting results for customer-type aggregations despite no-load fails and quantitative checks passed will keep you busy later.
Achieving this goal is a commendable task for any warehouse, despite the difficulty posed by manual data entry. Nonetheless, it is a goal worth striving for, much like how we regularly organize our living spaces, rather than aiming for a one-time, ultimate state of cleanliness.
In contrast to cleanliness, which refers to the adequate processing of data, I like to call sobriety a state of redundancy-free and simplified attribute content down to the point where it cannot be pruned further without loss of information.
First, values should be all written out or all coded and abbreviated. A threefold categorization of “A,” “B” and “DEFAULT” could be converted to A,” “B” and “D” without loss of information, but size needs down to one character byte instead of several. Taking this further, could dozens of full-text entries be converted to official codes so your table does not take in extra calories?
An anonymized real-world example:
An ENTITY_NAME, let’s say a branch name, is stored like these thousands of times each:
“ABCD – ABCD Division of ABCD Industries Fantasia”
Why not cut it down to “ABCD” and store the meaning of the ENTITY_NAME in a tiny lookup table?
I could cut the ENTITY_NAME from VARCHAR(200) to CHAR(4). The savings can be estimated to (average length of current entries-4+2) bytes per record. The table I derived this example from had 177 million records when I wrote these lines.
Second, summarize and prune. Although this involves changing the table content somewhat, it will not alter the meaning of the content while fewer distinct values emerge. They then have a higher likeliness of being subject to compression. If relevant, you will need fewer partitions. Address and name fields are always an excellent place to start.
Derived from what I just saw on a production table, imagine a street field mapped 1 to 1 from the source where we have stored “Main Street,” “Main Str.” as well as “Mainstreet” thousands of times each. Once a standard form of street name representation is defined, the distinct values can probably be cut in half.
Even if you believe that the blanks or abbreviations have to be preserved because they carry a, however, a subtle difference in meaning, a hundred different ways of saying “no input” don’t have to. Cut down all kinds of meaningless input to no more than two values, one meaning NULL and one meaning that the original input is unusable, be it because of refusal, pranks, or severe typing errors that cannot be reverse-engineered.
Propose and plan this, and the chances are that your reporting and analytics department colleague will add you as a new friend.
A Success Story to Inspire You
Recently I tested a data type resizing over dozens of tables. These tables were already compressed and mainly partitioned and indexed well. Unfortunately, they were implemented under significant uncertainty concerning the actual content or a lack of due diligence. Pruning character and decimal fields closer to the reality of the table content saved 5 to 25% of the used table space, even for small tables, with extremes of 35% and more.