The modern era is
the era of global competitions and if one has to survive he/she has to ensure
that he clears all threats in creating his business successful. More often, we
keen to ignore some minor issues which in turn out to be show stopping issues
moving forward. These days, most of the organizations all over the globe depend
on IT for their services in making decisions. This
document‘s purpose is to light out one such approach, ETL testing. ETL testing
comes under Data Warehouse Testing. A data warehouse is a repository of
transaction data that has been extracted from sources and transformed into the
intended formatusing query and analysis and finally which can be put into
reporting. I am thankful to Infosys which was my primary inspiration for undertaking
this document as the topic for BOK.
Data Warehouse
A data warehouse is the data repository
which can contain data from different sources. It means data that is loaded
into a data warehouse may be from various sources of data, which may consist of
data from databases, application files (such as ms office files) or flat files.
All these data must be extracted from all these sources, transformed to a
common format, and loaded into the data warehouse. Extraction, transformation
and loading (ETL) is a critical step in any data warehouse implementation.
Example
We can take a simple example of a sales
based organization. The organization can have its branches in multiple
locations, it would have various products for sale and different sales and
price order in different geographies. So, the integration of data form source
system tables to the target system tables is an essential element in a data
warehouse.
Phases in Data Warehouse Testing-
·
Business Understanding
·
Test plan creation
·
Test case creation
·
Test data set up planning
·
Test data set up creation
·
Test case execution
·
Deployment in production environment
Type of Data Warehouse Testing-
·
Business Flow
·
One-Shot / Prospective
·
Data Quality
·
Incremental Load
·
Performance
·
Version
·
Production Checkout
Goals of Data warehouse Testing-
·
No Data losses
·
Correct transformation rules
·
Data validation
·
Data validation foe each and every record in the table
·
Regression Testing
·
Sampling
·
Post implementation
ETL JOBS
ETL Stands for Extract, Transform and
Load. As mentioned above, the data ware house contains data from various
sources. The process of ETL means extracting all data from all these different
types of system, reorganize them in the needed format so as to enable them to
be queried using SQL and also to enable for reporting. Once the formatting is
done, they are fit to be loaded to the destination tables. Only after this
procedure, the ETL process gets complete.
ETL includes running ETL jobs for data
to get populated in the needed tables. Every table will be associated with its
respective ETL’s and on running them by which we can get the data populated in
the tables. ETL jobs can be scheduled at some frequency like daily basis/1
hour/ monthly basis. These jobs have to be run in a platform which means we are
starting the process of extracting data from source. ETL jobs have SQL queries
embedded in them. So once these jobs are run, based on the SQl query, data is
pulled from source and data is loaded into destination tables. It is like
Source--->ETL--->Destination tables. It is like an automation process.
Write the SQL query, embed it into a job and run it every time we want.
Launch of Jobs
To run
ETL jobs, we have two options,
Either
through $U or Informatica
In $U,
ETL’s are termed as sessions and in informatica, they are termed as workflows.
•
$U – We run the
respective session names of the jobs. Each session is associated with its own
uprocs.
•
Informatica – We run
the respective workflows of the job to be run.
Launch of jobs through $U: -
·
Connect to melange server
and type the following
Sudosu –
aeadm
export
DISPLAY= <your system ip address>:0.0
loaduni
·
The $U screen gets popped
up.
·
$U: - From the main
screen, select the Launches screen.
Expected Launches Window
·
The ‘Expected Launches’
window appears. Right click on the Expected Launches tab and select ‘Create’.
The Create a launch widow will appear.
·
User should select the
session name, Management unit as ‘-AEMU0000’, check the ‘By-pass condition
check’ as ‘Yes’ and click on Confirm.
Job Monitor Window
·
Once the user confirms the
job to run, user can then view the status of the job run in job monitor. If the
status of the job in job monitor is ‘Completed’ it is run successfully.
Launch of jobs through Informatica : -
Informatica
is a widely used ETL tool for extracting the source data and loading it into
the target after applying the required transformation. In informatica, the
ETL’s are called up as Workflows. Each workflow has a job associated with it.
There are several components in Informatica. They are,
·
InformaticaPowercenter designer
·
InformaticaPowercenter Repository
Manager
·
InformaticaPowercenter Workflow
Manager
·
InformaticaPowercenter Workflow
Monitor.
We use the InformaticaPowercenter Workflow Manager
to run the workflows. We login with the the proper credentials. The names of
all the workflows will be listed at the LHS. We have to insert another tool
called ’Workflow Designer’ to the RHS. In this workflow designer, we could see
the details of the workflow or mapping selected currently. Select the workflow,
right click and click on ‘Start workflow’. The workflow begins running.
To view the status of the currently running, we log
into the InformaticaPowercenter Workflow Monitor. This displays all the
workflows currently running and their status.
EXTRACT
This is the first stage in the ETL
process. This includes extracting the data available from the source systems.
Source systems may be flat files, excel files, relational databases, or any
applications. Prior to the extraction, as a part of ETL process, mapping
between the source and target objects is performed. The entities used in both
the source and target systems will be different and needs some relationship to
be created between them. Defining this relationship is named as mapping. In
short , the goal of the extraction phase is to convert the data into a single
format which is appropriate for transformation processing.
TRANSFORM
The next step in ETL process is the
transformation of extracted data or cleansing of data. This is also called the
staging area. At this stage, several processes are carried out to modify the
data according to the business requirements. This includes validation of the
given data from top to bottom. There should not be any loop holes left by testers
for the development team to find out an issue at the later part of time.
The data retrieved on extraction is to
be cleansed in this stage. The tables in the source system should match with
the data loaded into the destination tables, ie, data warehouse. This is done
by making use of the source and target queries. Get the source sql and target
sql and extract the data from the respective tables. Then, we go on with the
validation of the source and target data. There might be a number of challenges
faced during validation of data between source and target.
Ø Count Mismatch: - The record count for both the source and target tables
should be the same for the condition set for both. If the record count is not
same in both, the data in both is not matching. This might be because of two
reasons,
·
The records in the source
not present in the target
·
The records in the target
not present in the source.
·
Duplicate records present
in Source\Target.
Data
Mismatch: - There might be data
mismatch also in the data present in both source and target. Data mismatch can
be column names level, field level data, formatting issues etc. Representation
of data in the same way in both source and target is an essential property that
has to be noticed in ETL testing. This requires manual formatting. The data
will need to be formatted to the common and accepted format so that no error
will be thrown on the validation of the same.
No
primary key: - Sometimes, it may happen
that the source or the target table may not have a unique key combination. This
involves more commitment in validation than the usual validation with primary
keys. The person who is testing the system needs to be clear on the data used
for both source and target and the columns and data contained in it.
The
validation and transformation of data can be carried out in multiple ways. This
involves the processes of Find and Replace, Convert case, Merging data, NULL
Value handling, Data type conversion, splitting of files to multiple files,
removing duplicates, reindexing existing data, streamlining the formatting of
information that has been merged from different parent databases etc.
There may
be situations when there are field level differences for many no of records in
the target when compared to the source. In this case, it becomes absolutely
impossible for the tester to go and change the values of each separate values
of the particular affected field when the record count is huge. The field level
differences validation can be performed in the following ways –
Find and Replace – We can
use the Find and Replace option when many no of record values for a particular
field has to be changed to a specified value.
Convert Case – This
validation is done when the target records are in different case, either in upper
or lower when compared to the source records.
Null value handling – This
happens when a particular field in the target data is replaced by ‘NULL’ where
the source records for that particular field has got proper data values. Tester has to give prime importance
in tracing out these kinds of issues which often does not get into vicinity of
issues.
Data Type Conversion – This
validation is done when the data types of the data in both source and target
differ. The difference may be due to the entire data type mismatch, difference
in range/size of data types etc.
Duplicate Removal – When
target table data is loaded and when performing validation, there are times
when the target data show duplicate records which results in the entire count
and data mismatch between the source and target. This might have occurred due
to data refresh issue or job run issues. We need to identify the duplicate
records and remove them and go on with the remaining data validation.
The
requirement may be like only certain columns of the source table are needed to
be loaded into the target system. But in the target system, more columns
may/may not exist. So, this kind of validation is performed during the
transformation stage. Also some validations demands sorting on the basis of the
primary keys or else if no primary key group of all key combination sorting may
be needed. Unless sorted, the validation may not return the right result. After
cleansing, the data will become consistent with the required set of data. Once the data is cleansed, it has to be
integrated. Data integration is a vital part in data warehouse testing.
The next
stage is Aggregation of data. This usually comes into place when we need data
segregated according to different criteria’s. Suppose, if the user wants the
list of a particular data present in multiple rows to be segregated, then this
is termed aggregation. For example, in the case of a sales organization, the
manager needs the sales of a particular product in a particular geography. This
enables easy reporting as we are able to generate separate sets of data with
the specific criteria’s provided by the user.
LOAD
This
stage is used for loading data into the target system or data warehouse. This
is the final stage of the ETL process. The load process interacts with a
database. As a result, the filters and conditions applied in the database
schema are verified which will give more quality to the data.
Different ETL tools:
There are many ETL tools available
today. Some of them are,
Informatica
AbInitio
Cognos
Data stage….
Validation check-points in ETL testing:
ETL testing
is carried at different stages, each of which is performed to validate data
completeness, transformation logic and data quality and performance
scalability. During ETL application testing, we test for the following:
1. Record count
verification – We check whether there
are any variances between the record count for a particular table between the
source and the target system. We write SQL queries to check the record count
from both the source and the target tables.
2. Database verification – In this, we check for the changes made to the existing
tables and also check the tables added in different schemas. It means we have
to manually verify all the fields in different tables from all the Schemas.
This also includes the data type validation on both the tables.
3. History data
verification – This is the entire
field to field validation of all the data flowing from the source to other
schemas. Comparison is done for all the fields mapped between the source and
target tables.
4. Verification of calculated fields – In ETL testing, we may need to perform some calculations.
There will be many fields, which are obtained based on certain calculations.
Ex: Total Cost = Sum (All the cost), Available Qty = Order – shipped Qty etc.
These fields are derived fields and hence we need to test these fields
separately.
Why Data warehouse testing is important?
In a single data warehouse there are multiples of tables and schemas where in some millions of records are transformed as per the business requirements.
There is every possibility that a minute change in the value of one field can affect thousands of records
High Complexity of business rules and Business transformation of values in the each field requires a thorough testing.
A data ware house consists of many thousands of records each inside many tables and schemas. So, according to the user requirements, user may go ahead and change the records and apply transformations to them. Since data inside a data warehouse could be independent, modification of one record may cause changes in so many other records too.
Concept of ETL in Data warehouse:
A properly designed and implemented ETL transforms data into information, is the most essential and critical part of the entire data warehouse process. Once this step has been accomplished, the data warehouse should be able to provide fast and useful information to assist in optimizing the business and its processes.
Challenges in ETL testing:
·
Voluminous data to be handled
·
Complexity of the data
·
Inconsistent and redundant data
·
Increased number of source tables
·
Non availability of history data
·
The data used for predictions is different and have to
be mapped with actual data, based on different business rules.
Benefits of ETL testing:
·
Provides accurate data
·
Resulting data will be consistent and reliable
·
Ensures Data Security
·
We can easily reduce the data loss risk
·
Detect errors at early stage
·
Efficient and User friendly
·
Clear and precise data
·
Validation of upstream and downstream data
1 comments :
Thanks for sharing valuable information and very well explained. Keep posting.
etl testing online
etl testing training
Post a Comment