Teradata table skew: if you’re on this page and reading this, you most likely have encountered the topic of skew in your experience working with the Teradata database.
When searching online for Teradata table skew or skew factor, many or all documentation will directly refer to DBC.TableSizeV for calculating this.
To analyze table skew, the widespread and often used query is :
SELECT DATABASENAME, TABLENAME, SUM(CURRENTPERM) CURRENTPERM, CAST((100-(AVG(CURRENTPERM)/MAX(CURRENTPERM)*100)) AS DECIMAL(5,2)) AS SKEW_PCT FROM DBC.TABLESIZEV WHERE 1=1 AND DATABASENAME = 'db1' GROUP BY DATABASENAME,TABLENAME ;
Except that the result can be biased and mislead you!
How it can be misleading
What you get from this query is a skew based on how much space the data takes up per AMP.
There are two different things to consider :
- the natural skew: is what I call the skew produced by how data is distributed among the AMPs based on the table’s Primary Index
- the artificial skew: is what I call the skew generated by the size of the data stored on each AMP
With the query above, you can find a table with a skew of 60%: we cannot conclude right away that this table has inadequate data distribution and needs a new PI…
…simply because our concept of skew (non-uniform distribution in a dataset) here is based not on the distribution of rows but on the size of the data per AMP.
A closer look
Suppose you have a 100 AMP system.
A table containing 1 million rows, perfectly distributed on all 100 AMPs (10,000 rows per AMP).
The size of these 10,000 rows is 1 MB on each AMP.
Based on the query above, there is 0 skew.
When using a particular compression type, the size of the data stored on the AMPs can change…!
I am talking about Temperature-Based Block Level Compression (TBBLC).
Same table as above, but this time-compressed in TBBLC. The skew is now at 63%.
Not to get into the details of TBBLC and TVS, but this compression (TBBLC) works based on how frequently data is accessed:
– WARM/HOT data (frequently accessed data) is not compressed
– while COLD data (not accessed or less frequently accessed data) is compressed
Data in AMPs 0-39, 50-57, and 71-99 are compressed (from 1MB down to 0.2MB) – because they are tagged as COLD data.
On the other hand, data in AMPs 40-49 and 58-70 remain unchanged, still uncompressed at 1MB – because they are often accessed and tagged as WARM/HOT data.
Using the previous query, the skew went from 0% to 63% for the same table!
But not because the data distribution from the Primary Index (natural skew) has changed, but because of how the data is compressed on the AMPs (which I referred to as the artificial skew).
This is why identifying skewed tables based on this method is misleading.
When investigating for skewed tables based on DBC.TableSizeV, you will have to double-check if this skew you are looking at is caused by the distribution of the PI or by how data is compressed.
The Teradata functions: HASHAMP, HASHBUCKET, and HASHROW can help you find this natural data skew:
HASHROW will give you the 4-byte row hash
HASHBUCKET will provide you with the bucket number in the hash map that holds this row hash
HASHAMP will provide you with the corresponding AMP that the hash bucket falls in
/*add TRANSACTIONTIME and-or VALIDTIME qualifier when analyzing temporal tables*/ SELECT HASHAMP(HASHBUCKET(HASHROW( /*primary index columns*/ ))), COUNT(*) FROM databasename.tablename GROUP BY 1 ORDER BY 1 DESC;
As for TBBLC, as incredible as it may sound, you will need to know in advance how your data will be accessed to avoid huge differences in table size per AMP.
I hope this article has brought you some new and helpful insight on the topic of table skew!
Thanks for reading!
You may want to check out these articles related to skew!
The Impact of Skewing on Teradata demonstrated
One Of our Biggest Enemies: Skew In Teradata Joins