Teradata table skew: if you’re on this page and reading this, you most likely have encountered the topic of skew in your experience working with the Teradata database.
When searching online for Teradata table skew or skew factor, many or all documentation will directly refer to DBC.TableSizeV for calculating this.
To analyze table skew, the very common and often used query is :
SELECT DATABASENAME, TABLENAME, SUM(CURRENTPERM) CURRENTPERM, CAST((100-(AVG(CURRENTPERM)/MAX(CURRENTPERM)*100)) AS DECIMAL(5,2)) AS SKEW_PCT FROM DBC.TABLESIZEV WHERE 1=1 AND DATABASENAME = 'db1' GROUP BY DATABASENAME,TABLENAME ;
Except that the result can be biased and mislead you!
How it can be misleading
What you actually get from this query is a skew based on how much space the data takes up per AMP.
There are two different things to consider :
- the natural skew: is what I call the skew produced by how data is distributed among the AMPs based on the table’s Primary Index
- the artificial skew: is what I call the skew generated by the size of the data stored on each AMP
With the query above, you can find a table with a skew of 60%: we cannot draw the conclusion right away that this table has bad data distribution and needs a new PI…
…simply because our concept of skew (non-uniform distribution in a dataset) here is based not on the distribution of rows but the size of the data per AMP.
A closer look
Suppose you have a 100 AMP system.
A table containing 1 million rows, perfectly distributed on all 100 AMPs (10,000 rows per AMP).
The size of these 10,000 rows is 1 MB on each AMP.
Based on the query above, there is 0 skew
When using a certain type of compression, the size of the data stored on the AMPs can change…!
The compression I am talking about is the Temperature-Based Block Level Compression (TBBLC).
Same table as above, but this time compressed in TBBLC. The skew is now at 63%.
Not to get into the details of TBBLC and TVS, but this compression (TBBLC) works based on how frequently data is accessed:
– WARM/HOT data (frequently accessed data) is not compressed
– while COLD data (not accessed or less frequently accessed data) is compressed
Data in AMPs 0-39, 50-57, and 71-99 are compressed (from 1MB down to 0.2MB) – because they are tagged as COLD data.
On the other hand, data in AMPs 40-49 and 58-70 remain unchanged, still uncompressed at 1MB – because they are often accessed and tagged as WARM/HOT data.
Using the query mentioned previously, for the same table, the skew went from 0% to 63%!
But not because the data distribution from the Primary Index (natural skew) has changed, but because of how the data is compressed on the AMPs (which I referred to as the artificial skew).
This is why, identifying skewed tables based on this method is misleading.
When investigating for skewed tables based on DBC.TableSizeV, you will have to double-check if this skew you are looking at is caused by the distribution of the PI or by how data is compressed.
The Teradata functions : HASHAMP, HASHBUCKET, and HASHROW can help you find this natural data skew:
HASHROW will give you the 4-byte row hash
HASHBUCKET will give you the bucket number in the hash map that holds this row hash
HASHAMP will give you the corresponding AMP that hash bucket falls in
/*add TRANSACTIONTIME and-or VALIDTIME qualifier when analyzing temporal tables*/ SELECT HASHAMP(HASHBUCKET(HASHROW( /*primary index columns*/ ))), COUNT(*) FROM databasename.tablename GROUP BY 1 ORDER BY 1 DESC;
As for TBBLC, as great as it may sound, you will really need to know in advance how your data will be accessed to avoid huge differences in table size per AMP.
I hope this article has brought you some new and useful insight on the topic of table skew!
Thanks for reading!
You may want to check these out these articles related to skew!
The Impact of Skewing on Teradata demonstrated
One Of our Biggest Enemies: Skew In Teradata Joins