Often we are faced with the challenge of splitting the content of a column. Teradata offers several ways to do this, each with its own advantages and disadvantages, which we will look at in detail in this article. First, we will create a table with enough random data to determine the performance and resource consumption of each option:
CREATE TABLE DWHPRO.Split
(
PK INTEGER,
ToBeSplit VARCHAR(200)
) PRIMARY INDEX (PK);
INSERT INTO DWHPRO.Split
SELECT
ROW_NUMBER() OVER (ORDER BY 1) AS PK,
'Row' || TRIM(PK) || 'Col1,' || 'Row' || TRIM(PK) || 'Col2,' || 'Row' || TRIM(PK) || 'Col3'
FROM
(
SELECT * FROM SYS_CALENDAR.CALENDAR
UNION ALL SELECT * FROM SYS_CALENDAR.CALENDAR
UNION ALL SELECT * FROM SYS_CALENDAR.CALENDAR
UNION ALL SELECT * FROM SYS_CALENDAR.CALENDAR
UNION ALL SELECT * FROM SYS_CALENDAR.CALENDAR
UNION ALL SELECT * FROM SYS_CALENDAR.CALENDAR
UNION ALL SELECT * FROM SYS_CALENDAR.CALENDAR
UNION ALL SELECT * FROM SYS_CALENDAR.CALENDAR
UNION ALL SELECT * FROM SYS_CALENDAR.CALENDAR
) t01;
-->
The Teradata CSVLD Function
With the CSVLD function, we must consider that all defined columns must be present. If not all columns can be created, an error message will be displayed:
SET QUERY_BAND = 'Method=CSVLD;' FOR SESSION;
SELECT * FROM TABLE (CSVLD(Split.ToBeSplit, ',', '')
RETURNS (col1 varchar(200), col2 varchar(200), col3 varchar(200))) as T1;
The Teradata Unpack Function
In contrast to the CSVLD function, the Unpack function also allows you to omit rows that cannot be split into all defined columns. The parameter “IgnoreInvalid” is used for this purpose:
SET QUERY_BAND = 'Method=Unpack;' FOR SESSION;
SELECT
PK,
CAST(Col1 AS VARCHAR(255)),
CAST(Col2 AS VARCHAR(255)),
CAST(Col3 AS VARCHAR(255))
FROM Unpack (
ON Split
USING
TargetColumn ('ToBeSplit')
OutputColumns ('Col1', 'Col2', 'Col3')
OutputDataTypes ('varchar', 'varchar', 'varchar')
Delimiter (',')
Regex ('(.*)')
RegexSet (1)
IgnoreInvalid ('true')
) AS t01
;
The Teradata strtok_split_to_table Function
Teradata strtok_split_to_table not only splits into columns, but also creates a new row for each column, but marks it with a column number so that it can be identified:
SET QUERY_BAND = 'Method=strtok_split_to_table;' FOR SESSION;
WITH T1 as
(
SELECT d.* FROM TABLE (strtok_split_to_table( Split.PK, Split.ToBeSplit, ',')
RETURNS (PK integer, tokennum integer, Col varchar(100)) ) as d
)
SELECT T1.* FROM t1
;
SELECT QUERYBAND,Spoolusage,TotalIOCount,AMPCPUTIme
FROM DBC.DBQLOGTBL
WHERE QUERYBAND LIKE '%Method%'
Performance Considerations
Of course, only the Teradata functions CSVLD and Unpack are directly comparable. They create exactly one output row per input row. strtok_split_to_table multiplies the number of rows and therefore needs more spool space.
According to our empirical measurements (we ran each query 3 times), the Unpack method requires about 43% less spool space and 59% fewer I/Os.
Queryband | SpoolUsage | TotalIOCount | AMPCPUTime |
=S> Method=CSVLD; | 110 559 232 | 332,00 | 1,89 |
=S> Method=CSVLD; | 110 559 232 | 332,00 | 1,89 |
=S> Method=CSVLD; | 110 559 232 | 332,00 | 1,90 |
=S> Method=strtok_split_to_table; | 245 719 040 | 608,00 | 4,71 |
=S> Method=strtok_split_to_table; | 245 719 040 | 608,00 | 4,72 |
=S> Method=strtok_split_to_table; | 245 719 040 | 608,00 | 4,98 |
=S> Method=Unpack; | 63 406 080 | 136,00 | 1,90 |
=S> Method=Unpack; | 63 406 080 | 136,00 | 1,79 |
=S> Method=Unpack; | 63 406 080 | 136,00 | 1,87 |