Teradata Random AMP sampling is used if no collected statistics are available or they are stale.
When Teradata moves the table header into each AMP's FSG cache, the random AMP sample information of the related table is stored in the table header of each AMP.
Although the random AMP sample of a table is stored in the FSG cache on each AMP (to be precise, in the cached table header), by default only one AMP is doing the sampling.
For each table, this will be a different AMP. The sampling AMP is determined by calculating the row hash of the TABLEID (internally, each table is identified by the TABLEID). Afterward, the row hash of the TABLEID is searched in the hash map, which presents the final mapping between each row hash and AMP:
CALCULATE ROWHASH(TABLEID) -> HASHMAP(ROWHASH) -> AMP
Random AMP sampling can't be a substitute for collected statistics in each case. Random AMP sampling only covers indexed columns and table level statistics, such as row counts and distinct values.
In previous releases of Teradata, random AMP sampling was only used in the absence of collected statistics. Nowadays, Teradata always stores random AMP samples in the table header (cached in FSG). They are used if the collected statistics are stale.
Random AMP sampling works like this:
- If the table header is not available in the FSG cache, a random AMP sample of the indexes is taken and distributed to all AMPs.
- The related DBC tables are checked, to see if collected statistics are available
- If no collected statistics are available, the random AMP sample is used for estimations
- If collected statistics are available, the random AMP sample is compared against the collected statistics.
- If the difference between collected statistics and random AMP samples is above a certain threshold, the random AMP sample will
be used. Otherwise, the collected statistics will be utilized. The parsing engine (PE) will be able to extrapolate statistics by including object use counts (starting with Teradata 14.10) into its considerations.
We mentioned that Teradata chooses one AMP to do the sampling. While this is probably enough for evenly distributed tables, it results in wrong estimations for skewed ones. Therefore, Teradata allows using more than one AMP for index sampling.
The number of AMPs used for sampling can be increased in the DBC control area by changing the field “RandomAmpSampling.” The following settings are possible:
D…use the default 1 AMP sampling
L.. use 2 AMPs for sampling
M…use 5 AMPs for sampling
N…use all AMPs belonging to one node for sampling
A…use all available AMPs in the system for sampling
By increasing the number of sampling AMPs, better estimations are possible. At the same time, this comes with a parsing overhead. Very short running queries might experience an adverse performance impact.
Let us take a closer look at the real sampling process used to calculate table level statistics. Depending on settings mentioned above of field “RandomAMPSampling,” these steps will be done by one or several AMPs at the same time:
- Each sampling AMP reads 1 – 2 cylinders from disk into the FSG cache.
Usually, it's two cylinders (the first and the last belonging to the table). But if the table fits entirely into one cylinder, only one cylinder has to be read.
- The average number of rows of the cylinder's data blocks is calculated
- The average number of rows is multiplied by the number of data blocks
- Above result is multiplied by the number of cylinders belonging the table
- Above result is multiplied by the number of AMPs available in the entire system
The result of above calculations is the estimation of the table's row count.
Teradata uses a similar approach when doing random AMP sampling for an NUSI:
- Each sampling AMP reads 1 – 2 cylinders from disk into the FSG cache
- The number of NUSI values in the cylinder(s) is counted
- The table row count is divided by the NUSI row number calculated in the previous step: The result is the rows per NUSI value.
The random AMP sample of an NUSI can be used by the Teradata Optimizer to decide if, in a full table scan, the base table should be utilized or the NUSI.
Teradata is not using random AMP sampling for a USI. The Optimizer always assumes that the number of distinct values of the USI equals the number of table rows, and it's not using statistics in its considerations.
Random AMP sampling can be an alternative to collected statistics for evenly distributed tables. Nevertheless, keep in mind that only table level statistics and statistics on indexes are sampled. Unreliable heuristics will be used for non-indexed columns. For example, let's consider the following query:
SELECT * FROM TheTable WHERE TheColumn = 10;
If TheColumn is not indexed, the Optimizer will assume that 10% of the table rows will be selected. In this case, only collected statistics will help to improve the estimations.