Boosting Retail Banking operations with improved JOIN performance and QUALIFY clause

0
155

Executive summaryGUEST POST BY ARTEMIY KOZYR

Today I shed some light on how Data Warehousing lies at the core of Retail Banking operations. We will see the real case of vital marketing process malfunction, dive deep under the surface level to comprehend technical issues of data alchemy.

You will learn how to approach such issues, see the big picture, identify flaws and address them the best possible way. Auditing Execution plan, grasping Performance Indicators, understanding Data Distribution, refactoring and benchmarking Join Operations, reducing unnecessary IO, applying Analytical Functions.

As a result, treatment resulted in boosting critical Data Mart calculation by x60 times, so our business users get their data in time on a daily basis. Thus, you can truly understand why delivering value really matters. As usual, feel free to post any comments and questions below. I will be glad to answer and discuss it with you.

Data Warehousing vs. Retail Banking

Hi guys! Today I am going to talk about improving and accelerating vital Marketing operations in Retail Banking.

With all the disruptive threats coming from FinTech startups, modern Banks strive to become Data Driven. This means utilizing the whole potential of existing data in order to make meaningful and reliable business decisions.

This moves us to the core need of IT and Business working closely, understanding each other’s needs and capabilities and thus creating a synergetic effect.

The key process steps here are:

  • EXTRACT raw data
  • TRANSFORM and ENRICH data in order to CREATE knowledge
  • DELIVER value to Business Management
  • so they can derive conscious INSIGHTS and make useful DECISIONS
  • and do it on a TIMELY basis

We will barely scratch the surface of the process of preparing cross-channel marketing campaigns in Retail Banking, its core activities, departments, data and Information Systems involved into it, but we dive deep into a particular technical realization of one of the steps residing in Teradata Data Warehouse.

Sometimes one small part can cause malfunction of the whole system. Descending into how it works beneath the surface level, analyzing technical details, system capabilities, and bottlenecks might be extremely interesting and rewarding.

The Big Picture and the overall process scheme looks like this:

  • Gather source data/customer interaction experience
  • Extract lots of customer markers/flags
  • Map with client’s master data (identity, contacts, etc.)
  • Enrich with CRM / Transactions / Scoring historical data
  • Segment database / calculate metrics
  • Prepare cross-channel marketing campaigns

Thumbs up and give me any sort of feedback, if you would like to hear some more about real case studies of applied DWH (data warehousing) / Data Cloud Platforms inside Banking from a Business perspective.

Things have gone wrong – Bottlenecks and Malfunction

A couple of days ago business users faced with TLA (Time-Level Agreement) violation. The necessary business data was not ready by 9 AM in the morning. That consequently created further bottlenecks and delays and resulted in overall process malfunction.

Spheres affected by the issue:

  • Business Impact: TLA violation – creates a bottleneck for subsequent operations
  • Tech impact: Excessive System Resource Usage (thus cannibalizing resource from other routines)

The problem must be addressed as soon as possible, so IT team initiated the investigation. The closer look resulted in a query posing a serious issue inside Teradata Data Warehouse. The amount of data was rising gradually, but the average time to process it increased significantly at one point in time.

In terms of speed, it takes around 50-60 minutes daily for this one particular query to proceed.

The basic question we want to ask is what is the purpose of this query? What does it do?

Here is a brief description in business terms:

  1. Transfer customer interaction experience data from staging area.
  2. Map customer to CRM and Payment Card Processing systems through MDM (Master Data Management).
  3. Remove duplicates; take the customer’s actual ID only.

Technically speaking, a number of joins with some grouping and filtering applied.

Eventually, this non-optimal query needs serious treatment. We will have to find and analyze core causes, implement a solution and test it in order to improve performance.

A top-down approach to address the problem

First of all, we need to understand how to approach this kind of problem in general.

There should be an effective and comprehensive plan to take everything into account, see the big picture, identify flaws and address them the best possible way. What would this approach look like?

In order to find the particular reasons, fix them and improve the situation we need to take several steps:

  • Analyze the source code
  • Gather historical performance metrics
  • Examine the execution plan and identify core problems
  • Investigate the actual data, its demographics, and distribution
  • Propose a viable solution to the issue
  • Implement and test it, assess results

Understanding the query purpose

First of all, we need to make sure we understand the purpose of every query from a business perspective. There should be no excessive or irrelevant data participating in the query unless it complies with business needs.

Let us take a deeper look at the source code of this particular query. I have left some comments down below so you can grasp the general purpose of every stage.

INSERT INTO SANDBOX.gtt_cl_nps_trg_1 (
client_mdm_id,
event_type_id,
event_date,
load_date,
id_target,
load_dt,
tag_v21,
name_product,
tag_v22,
tag_v23)
SELECT
cpty_2.src_pty_cd AS client_mdm_id,
t1.event_type_id,
t1.date_trggr AS event_date,
t1.load_date AS load_date,
t1.id_target AS id_target,
v_curr_dt AS load_dt,
t1.tag_v21,
t1.name_product,
t1.tag_v22,
t1.tag_v23
FROM
(
SELECT
cc.load_date,
cc.trggr_dt AS date_trggr,
cc.client_dk AS id_target,
cc.load_dt AS date_load,
/* 3.2. taking latest party key according to MDM system: */
Max(cpty_1.pty_id) AS pty_id_max,
event_type_id,
cc.tag_v21,
cc.name_product,
cc.tag_v22,
cc.tag_v23
FROM
/* 1. Gathering raw data: */
(SELECT
ncd.load_date,
sbx.load_dt,
sbx.trggr_dt,
sbx.client_dk,
eti.event_type_id,
CASE
WHEN eti.event_type_id IN (’22’, ’23’, ’24’, ’25’, ’26’, ’27’, ’28’, ’29’) THEN sbx.tag_v21
ELSE Cast(NULL AS VARCHAR(200))
end AS tag_v21,
sbx.name_product,
sbx.tag_v22,
sbx.tag_v23
FROM SANDBOX.v_snb_xdl_dm_fvs_nps_sbx sbx
INNER JOIN SANDBOX.gtt_nps_calc_date ncd
ON ( sbx.load_dt = ncd.load_date )
INNER JOIN (
/* … additional data joins … */
) eti
) cc
/* 2. Enriching data with client master key – party external reference: */
LEFT JOIN SANDBOX.v_prd017_REF_PTY_EXTRNL_PTY ref
ON (Cast(cc.client_dk AS VARCHAR(40)) = ref.pty_host_id
AND ref.mdm_system_id=1201)
LEFT JOIN SANDBOX.v_prd017_pty cpty_1 ON ref.pty_id=cpty_1.pty_id
LEFT JOIN SANDBOX.v_prd017_pty_stts stts ON cpty_1.pty_id=stts.pty_id
WHERE stts.pty_stts_id=1401
AND stts.pty_stts_end_dt = DATE’9999-12-31′
/* 3. taking latest party key according to MDM system: */
GROUP BY cc.load_date,
cc.trggr_dt,
cc.client_dk,
cc.load_dt,
cc.event_type_id,
cc.tag_v21,
cc.name_product,
cc.tag_v22,
cc.tag_v23
) t1
/* 3.1. leave one row only with max(party_id) */
LEFT JOIN SANDBOX.v_prd017_pty cpty_2
ON t1.pty_id_max=cpty_2.pty_id
;

All in all, the following steps are performed:

  1. Getting raw data from view v_snb_xdl_dm_fvs_nps_sbx
  2. Enriching data with client master key – party external reference
  3. Removing duplicates – leaving latest entry – max(party_id)
  4. Inserting result set to be processed on next steps

Audit Execution Plan and Historical Performance

The best way to analyze historical data on query performance is to use PDCR (Performance Data Collection and Reporting) database and its visually engaging version – Teradata Viewpoint.

Each query ever executed on RDBMS is stored with crucial metadata such as query text, user, start time, end time, resources consumed, and any errors if occurred.

So let us dig into some details from PDCR info database as well as Teradata Viewpoint Portlet.

What do the performance indicators say?

First of all, according to PDCR database query utilizes an enormous amount of resources in comparison to the rest of other queries of this particular DataMart:

  1. CPU Time utilization ~ 160k CPU sec
  2. Spool Usage ~ 15 TB
  3. Total IO count ~ 55kk
  4. It takes around ~ 55 minutes to process on average

A tremendous amount of memory and CPU seconds are used to store and compute intermediate query results. Further steps take far fewer resources to process. So there are certainly several problems that need to be addressed.

Secondly, here are some pictures from Teradata Viewpoint Query Spotlight portlet:

We can see here the same performance metrics in a more visually appealing way.

The next step would be an examination of the actual EXECUTION plan and discovering which steps (operations) result in the heaviest load. According to the Explain tab in Teradata Viewpoint Portlet the most expensive step is one particular JOIN which took about ~37 minutes (~90% of the time):

On this step the following actions are performed:

  1. Joining the table of master keys T_REF_PTY_EXTRNL_PTY with the number of actual rows around 319 billion.
  2. Duplication of these 319 billion of rows to every AMP on the system.

Comprehend Data Demographics

As we identified the heaviest operation during that query, let us find out more about data being joined.

Table T_REF_PTY_EXTRNL_PTY is used to accumulate customers’ keys in different banking systems, so they could be matched afterward on the following steps.

That might help to find out more about customer’s profile, loans, expenses, preferences, behavior patterns. This kind of knowledge helps Bank offer personalized products and services at an appropriate time and situation.

We need to answer a number of simple questions:

  • How many rows are there in a table?
  • How many of them do we really need?
  • How selective is this data by specific columns?
  • Which data types are used and why?
  • How the JOIN is performed?
  1. The table definition goes like this:

CREATE MULTISET TABLE SANDBOX.T_REF_PTY_EXTRNL_PTY , NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT,
DEFAULT MERGEBLOCKRATIO
(
mdm_system_id BIGINT NOT NULL ,
pty_host_id VARCHAR(40) CHARACTER SET Unicode NOT CaseSpecific NOT NULL,
pty_id BIGINT NOT NULL,
deleted_flag CHAR(1) CHARACTER SET Unicode NOT CaseSpecific NOT NULL DEFAULT ‘N’,
action_cd CHAR(1) CHARACTER SET Unicode NOT CaseSpecific NOT NULL,
workflow_run_id BIGINT ,
session_inst_id INTEGER ,
input_file_id BIGINT ,
info_system_id SMALLINT,
pty_extrnl_pty_id BIGINT NOT NULL,
pty_host_id_master_cd CHAR(1) CHARACTER SET Unicode NOT CaseSpecific
PRIMARY INDEX ( pty_id )
INDEX t_ref_pty_extrnl_pty_nusi_1 ( pty_extrnl_pty_id );

So basically, we see that pty_host_id column (customer’s id in different source systems) is VARCHAR(40) type. That is due to the fact that some systems store customer’s identity with additional symbols like ‘_’, ‘-’, ‘#’, etc. That is why the whole column cannot be stored as a numeric data type.

2. The data distribution by source systems goes like this:

What is most important, we only need a portion of this data! Note that in source query only one specific source system is mapped with the following filtering clause:

WHERE    mdm_system_id=1201 

So we don’t need the whole data from the table. We only need around 20% of the data.

Moreover, this portion of data is easily cast to numeric data type, which is joined way easier than joining on a string field. In one of the next chapters, I will show you some benchmarking and comparison of different join types.

  1. Statistics of data for the optimizer

Make sure statistics are present and up-to-date on tables and column participating in a query. Statistics is one of the most crucial aspects on which the execution plan is based. Lack of statistics or stale stats may result in Teradata choosing wrong execution plan and throwing an error, for example ‘No more Spool Space’.

Discover how statistics are collected on a certain table with SHOW command:

SHOW STATS ON SANDBOX.v_prd017_REF_PTY_EXTRNL_PTY ;

COLLECT STATISTICS
— default SYSTEM SAMPLE PERCENT
— default SYSTEM THRESHOLD PERCENT
COLUMN ( deleted_flag,pty_host_id_master_cd,pty_id ) ,
COLUMN ( mdm_system_id,deleted_flag,pty_host_id ) ,
COLUMN ( deleted_flag,pty_host_id_master_cd ) ,
COLUMN ( pty_id ) ,
COLUMN ( deleted_flag ) ,
COLUMN ( pty_host_id_master_cd ) ,
COLUMN ( mdm_system_id ) ,
COLUMN ( pty_host_id ) ,
COLUMN ( mdm_system_id,pty_id ) ,
COLUMN ( pty_extrnl_pty_id )
ON SANDBOX.T_REF_PTY_EXTRNL_PTY ;

To see the actual stats values and date of gathering use HELP command:

HELP STATS ON SANDBOX.v_prd017_REF_PTY_EXTRNL_PTY;

Addressing core problems

 

Taking everything into account, in order to improve the query performance we have to achieve several goals:

  1. Retrieve only necessary portion of data from largest table prior to joining.
  2. JOIN on numeric data wherever possible.
  3. Eliminate the duplication of a colossal amount of data on all AMPs.
  4. Use QUALIFY command to exclude unnecessary JOIN, GROUP BY and applying WHERE (filtering data).

Step #1. Joining the right and easy way

First of all, let us place WHERE clause to filter rows we want to JOIN prior to joining itself, as we only need around 20% of the whole table.

Secondly, let us cast this data to the numeric data type (BIGINT) as it will boost the performance significantly.

Below I am providing BEFORE and AFTER version of source code:

/* BEFORE STATEMENT */

LEFT JOIN SANDBOX.v_prd017_REF_PTY_EXTRNL_PTY ref
ON (Cast(cc.client_dk AS VARCHAR(40)) = ref.pty_host_id
AND ref.mdm_system_id=1201)

/* AFTER STATEMENT */

LEFT JOIN
(SELECT
PTY_HOST_ID
, Cast(Trim(PTY_HOST_ID) AS BIGINT) AS PTY_HOST_ID_CAST
, pty_id
FROM SANDBOX.T_REF_PTY_EXTRNL_PTY
WHERE mdm_system_id=1201
) ref ON cc.id_target = ref.pty_host_id_cast

As a result, we will see that these simple steps will improve the overall query performance drastically as well as make Teradata optimizer choose data redistribution strategy instead of duplication (which takes a tremendous amount of time).

Furthermore, I have done some detailed testing to benchmark different kinds of JOIN operation:

  • BIGINT = BIGINT
  • CAST(VARCHAR as BIGINT) = BIGINT
  • CAST(BIGINT as VARCHAR) = VARCHAR

Table with performance metrics is below:

As you can see, the best type of JOIN is on equality of numeric data types (e.g. BIGINT). This kind of join consumes 4 times fewer CPU seconds and almost 2 times less IO operations as casting data types and comparing strings sign by sign requires far more CPU resources overhead. Joining on string fields also requires 25-30% more Spool space.

Step #2. QUALIFYing to remove duplicates

Instead of excessive joining, grouping and applying aggregate function, we may use Teradata specific function to qualify rows we need. This syntax will help us execute query more efficient and intelligent way. Let us see how we can do this:

/* BEFORE STATEMENT */

/* 3.3. taking latest party key according to MDM system: */
Max(cpty_1.pty_id) AS pty_id_max,

/* 3.1. taking latest party key according to MDM system: */
GROUP BY cc.load_date,
cc.trggr_dt,
cc.client_dk,
cc.load_dt,
cc.event_type_id,
cc.tag_v21,
cc.name_product,
cc.tag_v22,
cc.tag_v23

/* 3.2. leave one row only with max(party_id) */
LEFT JOIN SANDBOX.v_prd017_pty cpty_2
ON t1.pty_id_max=cpty_2.pty_id

/* AFTER STATEMENT */

QUALIFY Row_Number() Over (PARTITION BY cc.load_date,
cc.trggr_dt,
cc.client_dk,
cc.load_dt,
event_type_id,
cc.tag_v21,
cc.name_product,
cc.tag_v22,
cc.tag_v23
ORDER BY cpty_1.pty_id DESC
) = 1

Applying QUALIFY clause enables us to leave only one actual data entry per client. To learn some more about QUALIFY clause, Row_Number() as well as other ordered analytical functions please refer to web resources.

Assessing the results

As a result of this analysis, the next step is to implement this new source code, deploy it on testing and production environments and assess the results.

First of all, how has the execution plan changed?

  • The number of actual rows processed has decreased by 2k times! From 319b to 157m.
  • The actual time of this step has fallen to 26 seconds compared to 70 minutes before (by 160 times)
  • We have got rid of duplication of all rows to all AMPs using redistribution of this chunk of data instead.

See the actual performance metrics below:

  • Total CPU time required for processing has fallen by 60 times due to the fact we ensure the best possible way of joining and accessing data
  • We have saved around 50m IO operations with data blocks by not processing unnecessary data
  • Consequently, the amount of Spool Space used to store intermediate results has decreased by 55 times from 14 TB to 250 GB.
  • Most importantly, our query now takes only around 1 minute to process.

Notice how we managed to improve crucial performance indicators, system resources utilization and what is the most important overall time required to process data.  This particular query boosted our Data Mart calculation for almost 60 minutes, so our business users get their data in time on a daily basis.

Artemiy Kozyr is Data Engineer at Sberbank, Moscow with Master Degree in CS.
He has 5 years of Experience in Data Warehousing, ETL, Visualization for Financial Institutions.

Contacts:

[email protected]
http://linkedin.com/in/artemiykozyr/

Our Reader Score
[Total: 3    Average: 5/5]
Boosting Retail Banking operations with improved JOIN performance and QUALIFY clause written by DWH Pro Admin on November 21, 2018 average rating 5/5 - 3 user ratings

LEAVE A REPLY

Please enter your comment!
Please enter your name here