Filter efficiently with Teradata NOS

Roland Wenzlofsky

May 5, 2023

minutes reading time


Teradata NOS facilitates querying data in an S3 object store with ease. To attain maximum performance, partitioning external data is crucial for efficient reading. This article outlines the key considerations for optimal efficiency when reading data from the object store.

To begin, we must establish S3 access by obtaining an AUTHORIZATION object. In this instance, we will be utilizing Amazon AWS, which requires the following credentials (note that these are not my personal credentials): user and password.

CREATE AUTHORIZATION authorization_dwhpro AS DEFINER TRUSTED USER 'AKIAU56DUTZWABDFJ7AGC'PASSWORD 'LtOkSzse74ndk9388uL6y8WBnAAP9gGuNlr1/';

In the next step, we create a FUNCTION MAPPING (WRITE_NOS as we want to write parquet files) and link it to our AUTHORIZATION object:

CREATE FUNCTION MAPPING WRITE_NOS_DWHPROFOR WRITE_NOS EXTERNAL SECURITY DEFINER TRUSTED authorization_dwhpro USING LOCATION, STOREDAS, MAXOBJECTSIZE, COMPRESSION, NAMING, INCLUDE_ORDERING, INCLUDE_HASHBY, MANIFEST FILE, MANIFESTONLY, OVERWRITE, ANY IN TABLE;

Next, we will create a FUNCTION MAPPING called WRITE_NOS, allowing us to write parquet files. This mapping will be linked to our AUTHORIZATION object. With these preparations, we can now write a test table to our Amazon S3 bucket as parquet files. Our test table consists of one date column, one INTEGER column, and one DECIMAL column. To enable later filtering, we partition the parquet files by the year of the date column and include this column in the files.

SELECT NodeId, AmpId, Sequence, ObjectName, ObjectSize, RecordCountFROM WRITE_NOS_DWHPRO (ON ( SELECT t01.*, EXTRACT(YEAR FROM TheDate) AS TheYear FROM DWHPRO.Table1 t01) PARTITION BY TheYear ORDER BY TheYear USING LOCATION('/s3/dwhpro.s3.amazonaws.com/data/') STOREDAS('PARQUET') COMPRESSION

This is what our S3 bucket looks like after we exported the test table. Under “data/” there is one directory per year in which the corresponding parquet file is located.

Next, we create a FUNCTION MAPPING (READ_NOS as we want to read parquet files) and link it to our AUTHORIZATION object. We can now read our test table from the parquet files in our Amazon S3 bucket. However, to read parquet files, you need an EXTERNAL table (this is not the case with CSV and JSON).

CREATE FUNCTION MAPPING DWHPRO.READ_NOS_DWHPROFOR READ_NOS EXTERNAL SECURITY DEFINER TRUSTED authorization_dwhpro USING BUFFERSIZE, SAMPLE_PERC, ROWFORMAT, RETURNTYPE, HEADER, MANIFEST, LOCATION, STOREDAS, FULLSCAN,ANY IN TABLE;

Before creating the EXTERNAL TABLE, we will get the metadata (with NOSREAD_PARQUET_SCHEMA) from parquet files to know what the external table should look like. We decide to take the one for the year 2022:

SELECT * FROM READ_NOS_DWHPRO (USING STOREDAS(‘PARQUET‘) FULLSCAN(‘TRUE’)RETURNTYPE(‘NOSREAD_PARQUET_SCHEMA‘) LOCATION (‘/s3/dwhpro.s3.amazonaws.com/data/2022’)) AS D;

Once the query is executed, metadata becomes available, providing insight into our external table’s required Teradata data types.

ColNameTeradata DatatypePhysical TypeLogical TypePrecisionScale
1PKINTEGERINT32NONE00
2A_NumberDECIMALBYTE_ARRAYDECIMAL3818
3TheDateDATEINT32DATE00
4TheYearINTEGERINTEGERNONE00

Creating a FOREIGN Table is now effortless with the following statement. It is essential to have a column-partitioned table for parquet files. Alternatively, you could replicate the FOREIGN table data into a Teradata table and select the most appropriate DDL for your workload. When reading data, the PATHPATTERN allows variable usage for folder navigation in the S3 bucket.

CREATE FOREIGN TABLE Table1_Parquet, EXTERNAL SECURITY DEFINER TRUSTED authorization_dwhpro(Location VARCHAR(2048) CHARACTER SET UNICODE CASESPECIFIC, PK INTEGER, A_Number DECIMAL(38,18), TheDate DATE, TheYear INTEGER) USING (LOCATION (‘/s3/dwhpro.s3.amazonaws.com/data’)PATHPATTERN(‘$data/$year1/$year’) STOREDAS (‘PARQUET‘)) NO PRIMARY INDEX, PARTITION BY COLUMN ;

Column Filtering

Column filtering corresponds to a conventional filter in the WHERE condition. However, all folder/parquet files under /data must be read and brought into Teradata. Depending on how much data we have swapped out, this can have a huge impact on the performance of the query.

SELECT * FROM Table1_Parquet WHERE TheYear = 2022;

Path Filtering

We can pre-determine the paths to be read from the S3 bucket by utilizing path filtering. As the folders have been organized by year, we can optimize the SQL query by using PATHPATTERN to navigate through the folders. This approach yields the same result as column filtering but with increased efficiency.

SELECT * FROM Table1_Parquet WHERE $PATH.$year = 2022;

Performance Results

After executing the queries, we examine the query log for runtime.

SELECT STARTTIME, FIRSTRESPTIME, AND QUERYBAND FROM DBC.DBQLOGTBLWHERE QUERYBAND LIKE ‘%Filter%’AND STATEMENT TYPE = ‘SELECT’ ORDER BY STARTTIME DESC;

STARTTIMEFIRSTRESPTIMEQUERYBAND
2022-10-20 15:03:34.222022-10-20 15:07:34.35Column
2022-10-20 14:53:06.032022-10-20 14:53:09.33Path

Path Filtering returns the query in seconds, whereas Column Filtering takes several minutes.

EXPLAIN PLAN Column Filtering, moving all sub-folders to Teradata:

1) First, we lock DWHPRO.Table1_Parquet for read.2) Next, we do a single-AMP RETRIEVE step from DWHPRO.Table1_Parquet metadata by way of an all-rows scan with no residual conditions into Spool 2 (one-amp), built locally on that AMP. Then we do a SORT to order Spool 2 by the sort key. The size of Spool2 is estimated with low confidence to be 202 rows (142,410 bytes). The estimated time for this step is 0.50 seconds.3) We do a single-AMP RETRIEVE step from Spool 2 (Last Use) by way of an all-rows scan into Spool 3 (all_amps), which is bin packed and redistributed by size to all AMPs in TD_Map1. The size of Spool 3 is estimated with low confidence to be 202 rows (144,026 bytes). The estimated time for this step is 0.00 seconds.4) We do an all-AMPs RETRIEVE step in TD_MAP1 from 6-column partitions of DWHPRO.Table1_Parquetby way of external metadata in Spool 3 (Last Use) with a condition of ("DWHPRO.Table1_Parquet.TheYear = 2022") into Spool 1(group_amps), which is built locally on the AMPs. The input table will not be cached in memory but is eligible for synchronized scanning. The result spool file will not be cached in memory. The size of Spool 1 is estimated with no confidence to be 867,775 rows (636,079,075 bytes). The estimated time for this step is 14 minutes and 9 seconds.5) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request.-> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 14 minutes and 10 seconds.

EXPLAIN PLAN Path Filtering, filtering done by only extracting data from the /data/2022 folder:

1) First, we lock DWHPRO.Table1_Parquet for read.2) Next, we do a single-AMP RETRIEVE step from DWHPRO.Table1_Parquetmetadata by way of an all-rows scan with a condition of ("(TD_SYSFNLIB.NosExtractVarFromPath (DWHPRO.Table1_Parquet.Location, '/s3/dwhpro.s3.amazonaws.com', 3)(FLOAT, FORMAT '-9.99999999999999E-999'))= 2.02200000000000E 003")into Spool 2 (one-amp), which is built locally on that AMP. Then we do a SORT to order Spool 2 by the sort key. The size of Spool2 is estimated with no confidence to be 21 rows (14,805 bytes). The estimated time for this step is 0.50 seconds.3) We do a single-AMP RETRIEVE step from Spool 2 (Last Use) by way of an all-rows scan into Spool 3 (all_amps), which is bin packed and redistributed by size to all AMPs in TD_Map1. The size of Spool 3 is estimated with no confidence to be 21 rows (14,973 bytes). The estimated time for this step is 0.00 seconds.4) We do an all-AMPs RETRIEVE step in TD_MAP1 from 6-column partitions of DWHPRO.Table1_Parquetby way of external metadata in Spool 3 (Last Use) into Spool 1(group_amps), built locally on the AMPs. The input table will not be cached in memory but is eligible for synchronized scanning. The result spool file will not be cached in memory. The size of Spool 1 is estimated with low confidence to be 902,142 rows (661,270,086 bytes). The estimated time for this step is 14 minutes and 10 seconds.5) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request.
  • Avatar
    el mahdi el medaghri el alaoui says:

    Hello,
    when writing the data with write nos, we always have the/s3/dwhpro.s3.amazonaws.com/data/2022/file1.parquet” while via other technologies we have the/s3/dwhpro.s3.amazonaws.com/data=2022/file1.parquet”.
    So if we want to read this last parquet file by Read_NOS generated by another technology, it does not work and therefore we have to do some tweaking to get to read it.
    would you know a way to write with write_NOS and have the partiton format as “column=value”.

    thank you in advance for your answer.

    Regards

  • Avatar
    Surya Sripada says:

    This is awesome explanation

  • {"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

    You might also like

    >