Continuing the Data Type considerations, here are more issues that demand attention from time to time.
Aim for nullability settings that are compliant with the data model and unified across all tables. The cost of a mandatory column left open for null entries is both on the storage side and on the optimizer side. An extra presence bit per column has to be reserved and a null option to be considered when making the SQL execution plan.
Double-check whenever you extract parts of one source table attribute into a target table attribute so that you do not create phantom variance and incongruence of values. Assume a source attribute ORDER_NUMBER can be structured differently depending on where the order originated from:
AAA.123.456O or BBB.123.456. Assume further that the positions of 456 represent a customer type code. The “O” stands for order. The application creates it that way. If you merely extract the number using a substring starting from position 5, your result will not be entirely numeric – which you might not even realize as you load into a character field! – And there are invalid customer type codes. Surprisingly wrong reporting results for customer type aggregations despite no load fails and quantitative checks passed will keep you busy at a later stage.
This is one of the noblest yet hardest-to-achieve goals of a warehouse, at least as long as we humans are allowed to type in values in source system applications. Nevertheless, we need to reach out for it, just as we tidy up our apartments, not in the expectation of a one-time final orderliness.
In contrast to cleanliness, which refers to the adequate processing of data, I like to call sobriety a state of redundancy-free and simplified attribute content down to the point where it cannot be pruned further without loss of information.
First, values should be all written out or all coded and abbreviated. A threefold categorization of “A,” “B” and “DEFAULT” could be converted to A,” “B” and “D” without loss of information, but size needs down to one character byte instead of several. Taking this further, could dozens of full-text entries be converted to official codes so that your table does not take in extra calories?
An anonymized real-world example:
An ENTITY_NAME, let’s say a branch name, is stored like this, thousands of times each:
“ABCD – ABCD Division of ABCD Industries Fantasia”
Why not cut it down to “ABCD” and store the meaning of the ENTITY_NAME in a tiny lookup table?
By this, I could cut down the ENTITY_NAME from VARCHAR(200) to CHAR(4). The savings can be estimated to (average length of current entries-4+2) bytes per record. The table I derived this example from had 177 million records at the time I wrote these lines.
Second, summarize and prune. Although this involves changing the table content somewhat, it will not alter the meaning of the content while fewer distinct values emerge. They then have a higher likeliness to be subject to compression. If relevant, you will need fewer partitions. Address and name fields are always a good place to start.
Derived from what I just saw on a production table, imagine a street field mapped 1 to 1 from the source where we have stored “Main Street,” “Main Str.” as well as “Mainstreet” thousands of times each. Once a standard form of street name representation is defined, the distinct values can probably be cut in half.
Even if you are of the opinion that the blanks or abbreviations have to be preserved because they carry a, however, a subtle difference in meaning, a hundred different ways of saying “no input” don’t have to. Cut down all kinds of meaningless input to no more than two values, one meaning NULL and one meaning that the original input is unusable, be it because of refusal, pranks or severe typing errors that cannot be reverse-engineered.
Propose and plan this and the chances are that your reporting and analytics department colleague will add you as a new friend.
A Success Story to Inspire You
Recently I tested a data type resizing over dozens of tables. These tables were already compressed and mostly partitioned and indexed well. Unfortunately, they were implemented under either significant uncertainty concerning the actual content or lack of due diligence in this respect. Pruning character and decimal fields closer to the reality of the table content saved 5 to 25% of the used table space, even for small tables, with extremes of 35% and more.