Saturday 16 August 2014

DataWare house Testing




The modern era is the era of global competitions and if one has to survive he/she has to ensure that he clears all threats in creating his business successful. More often, we keen to ignore some minor issues which in turn out to be show stopping issues moving forward. These days, most of the organizations all over the globe depend on IT for their services in making decisions.   This document‘s purpose is to light out one such approach, ETL testing. ETL testing comes under Data Warehouse Testing. A data warehouse is a repository of transaction data that has been extracted from sources and transformed into the intended formatusing query and analysis and finally which can be put into reporting. I am thankful to Infosys which was my primary inspiration for undertaking this document as the topic for BOK.



Data Warehouse
A data warehouse is the data repository which can contain data from different sources. It means data that is loaded into a data warehouse may be from various sources of data, which may consist of data from databases, application files (such as ms office files) or flat files. All these data must be extracted from all these sources, transformed to a common format, and loaded into the data warehouse. Extraction, transformation and loading (ETL) is a critical step in any data warehouse implementation.
Example
We can take a simple example of a sales based organization. The organization can have its branches in multiple locations, it would have various products for sale and different sales and price order in different geographies. So, the integration of data form source system tables to the target system tables is an essential element in a data warehouse.
Phases in Data Warehouse Testing-
·         Business Understanding
·         Test plan creation
·         Test case creation
·         Test data set up planning
·         Test data set up creation
·         Test case execution
·         Deployment in production environment
Type of Data Warehouse Testing-
·         Business Flow
·         One-Shot / Prospective
·         Data Quality
·         Incremental Load
·         Performance
·         Version
·         Production Checkout






Goals of Data warehouse Testing-
·         No Data losses
·         Correct transformation rules
·         Data validation
·         Data validation foe each and every record in the table
·         Regression Testing
·         Sampling
·         Post implementation

ETL JOBS
ETL Stands for Extract, Transform and Load. As mentioned above, the data ware house contains data from various sources. The process of ETL means extracting all data from all these different types of system, reorganize them in the needed format so as to enable them to be queried using SQL and also to enable for reporting. Once the formatting is done, they are fit to be loaded to the destination tables. Only after this procedure, the ETL process gets complete.
ETL includes running ETL jobs for data to get populated in the needed tables. Every table will be associated with its respective ETL’s and on running them by which we can get the data populated in the tables. ETL jobs can be scheduled at some frequency like daily basis/1 hour/ monthly basis. These jobs have to be run in a platform which means we are starting the process of extracting data from source. ETL jobs have SQL queries embedded in them. So once these jobs are run, based on the SQl query, data is pulled from source and data is loaded into destination tables. It is like Source--->ETL--->Destination tables. It is like an automation process. Write the SQL query, embed it into a job and run it every time we want.








Launch of Jobs
To run ETL jobs, we have two options,
Either through $U or Informatica
In $U, ETL’s are termed as sessions and in informatica, they are termed as workflows.
          $U – We run the respective session names of the jobs. Each session is associated with its own uprocs.
          Informatica – We run the respective workflows of the job to be run.

Launch of jobs through $U: -
·         Connect to melange server and type the following
Sudosu – aeadm
export DISPLAY= <your system ip address>:0.0
loaduni
·         The $U screen gets popped up.
·         $U: - From the main screen, select the Launches screen.
Expected Launches Window
·         The ‘Expected Launches’ window appears. Right click on the Expected Launches tab and select ‘Create’. The Create a launch widow will appear.
·         User should select the session name, Management unit as ‘-AEMU0000’, check the ‘By-pass condition check’ as ‘Yes’ and click on Confirm.

Job Monitor Window
·         Once the user confirms the job to run, user can then view the status of the job run in job monitor. If the status of the job in job monitor is ‘Completed’ it is run successfully.

Launch of jobs through Informatica : -
Informatica is a widely used ETL tool for extracting the source data and loading it into the target after applying the required transformation. In informatica, the ETL’s are called up as Workflows. Each workflow has a job associated with it. There are several components in Informatica. They are,
·         InformaticaPowercenter designer
·         InformaticaPowercenter Repository Manager
·         InformaticaPowercenter Workflow Manager
·         InformaticaPowercenter Workflow Monitor.

We use the InformaticaPowercenter Workflow Manager to run the workflows. We login with the the proper credentials. The names of all the workflows will be listed at the LHS. We have to insert another tool called ’Workflow Designer’ to the RHS. In this workflow designer, we could see the details of the workflow or mapping selected currently. Select the workflow, right click and click on ‘Start workflow’. The workflow begins running.

To view the status of the currently running, we log into the InformaticaPowercenter Workflow Monitor. This displays all the workflows currently running and their status.

EXTRACT
This is the first stage in the ETL process. This includes extracting the data available from the source systems. Source systems may be flat files, excel files, relational databases, or any applications. Prior to the extraction, as a part of ETL process, mapping between the source and target objects is performed. The entities used in both the source and target systems will be different and needs some relationship to be created between them. Defining this relationship is named as mapping. In short , the goal of the extraction phase is to convert the data into a single format which is appropriate for transformation processing.
TRANSFORM
The next step in ETL process is the transformation of extracted data or cleansing of data. This is also called the staging area. At this stage, several processes are carried out to modify the data according to the business requirements. This includes validation of the given data from top to bottom. There should not be any loop holes left by testers for the development team to find out an issue at the later part of time.
The data retrieved on extraction is to be cleansed in this stage. The tables in the source system should match with the data loaded into the destination tables, ie, data warehouse. This is done by making use of the source and target queries. Get the source sql and target sql and extract the data from the respective tables. Then, we go on with the validation of the source and target data. There might be a number of challenges faced during validation of data between source and target.
Ø  Count Mismatch: - The record count for both the source and target tables should be the same for the condition set for both. If the record count is not same in both, the data in both is not matching. This might be because of two reasons,
·         The records in the source not present in the target
·         The records in the target not present in the source.
·         Duplicate records present in Source\Target.
Data Mismatch: - There might be data mismatch also in the data present in both source and target. Data mismatch can be column names level, field level data, formatting issues etc. Representation of data in the same way in both source and target is an essential property that has to be noticed in ETL testing. This requires manual formatting. The data will need to be formatted to the common and accepted format so that no error will be thrown on the validation of the same.
No primary key: - Sometimes, it may happen that the source or the target table may not have a unique key combination. This involves more commitment in validation than the usual validation with primary keys. The person who is testing the system needs to be clear on the data used for both source and target and the columns and data contained in it.
The validation and transformation of data can be carried out in multiple ways. This involves the processes of Find and Replace, Convert case, Merging data, NULL Value handling, Data type conversion, splitting of files to multiple files, removing duplicates, reindexing existing data, streamlining the formatting of information that has been merged from different parent databases etc.
There may be situations when there are field level differences for many no of records in the target when compared to the source. In this case, it becomes absolutely impossible for the tester to go and change the values of each separate values of the particular affected field when the record count is huge. The field level differences validation can be performed in the following ways –
Find and Replace – We can use the Find and Replace option when many no of record values for a particular field has to be changed to a specified value.
Convert Case – This validation is done when the target records are in different case, either in upper or lower when compared to the source records.
Null value handling – This happens when a particular field in the target data is replaced by ‘NULL’ where the source records for that particular field has got proper data values. Tester has to give prime importance in tracing out these kinds of issues which often does not get into vicinity of issues.
Data Type Conversion – This validation is done when the data types of the data in both source and target differ. The difference may be due to the entire data type mismatch, difference in range/size of data types etc.
Duplicate Removal – When target table data is loaded and when performing validation, there are times when the target data show duplicate records which results in the entire count and data mismatch between the source and target. This might have occurred due to data refresh issue or job run issues. We need to identify the duplicate records and remove them and go on with the remaining data validation.
The requirement may be like only certain columns of the source table are needed to be loaded into the target system. But in the target system, more columns may/may not exist. So, this kind of validation is performed during the transformation stage. Also some validations demands sorting on the basis of the primary keys or else if no primary key group of all key combination sorting may be needed. Unless sorted, the validation may not return the right result. After cleansing, the data will become consistent with the required set of data.  Once the data is cleansed, it has to be integrated. Data integration is a vital part in data warehouse testing.
The next stage is Aggregation of data. This usually comes into place when we need data segregated according to different criteria’s. Suppose, if the user wants the list of a particular data present in multiple rows to be segregated, then this is termed aggregation. For example, in the case of a sales organization, the manager needs the sales of a particular product in a particular geography. This enables easy reporting as we are able to generate separate sets of data with the specific criteria’s provided by the user.
LOAD
This stage is used for loading data into the target system or data warehouse. This is the final stage of the ETL process. The load process interacts with a database. As a result, the filters and conditions applied in the database schema are verified which will give more quality to the data.
Different ETL tools:

There are many ETL tools available today. Some of them are,

Informatica
AbInitio
Cognos
Data stage….

Validation check-points in ETL testing:
ETL testing is carried at different stages, each of which is performed to validate data completeness, transformation logic and data quality and performance scalability. During ETL application testing, we test for the following:
1.    Record count verification – We check whether there are any variances between the record count for a particular table between the source and the target system. We write SQL queries to check the record count from both the source and the target tables.
2.    Database verification – In this, we check for the changes made to the existing tables and also check the tables added in different schemas. It means we have to manually verify all the fields in different tables from all the Schemas. This also includes the data type validation on both the tables.
3.    History data verification – This is the entire field to field validation of all the data flowing from the source to other schemas. Comparison is done for all the fields mapped between the source and target tables. 
4.     Verification of calculated fields – In ETL testing, we may need to perform some calculations. There will be many fields, which are obtained based on certain calculations. Ex: Total Cost = Sum (All the cost), Available Qty = Order – shipped Qty etc. These fields are derived fields and hence we need to test these fields separately.

Why Data warehouse testing is important?
In a single data warehouse there are multiples of tables and schemas where in some millions of records are transformed as per the business requirements.
There is every possibility that a minute change in the value of one field can affect thousands of records
High Complexity of business rules and Business transformation of values in the each field requires a thorough testing.

A data ware house consists of many thousands of records each inside many tables and schemas. So, according to the user requirements, user may go ahead and change the records and apply transformations to them. Since data inside a data warehouse could be independent, modification of one record may cause changes in so many other records too.

Concept of ETL in Data warehouse:
A properly designed and implemented ETL transforms data into information, is the most essential and critical part of the entire data warehouse process. Once this step has been accomplished, the data warehouse should be able to provide fast and useful information to assist in optimizing the business and its processes.
Challenges in ETL testing:
·         Voluminous data to be handled
·         Complexity of the data
·         Inconsistent and redundant data
·         Increased number of source tables
·         Non availability of history data
·         The data used for predictions is different and have to be mapped with actual data, based on different business rules.

Benefits of ETL testing:
·         Provides accurate data
·         Resulting data will be consistent and reliable
·         Ensures Data Security
·         We can easily reduce the data loss risk
·         Detect errors at early stage
·         Efficient and User friendly
·         Clear and precise data
·         Validation of upstream and downstream data














1 comments :

Thanks for sharing valuable information and very well explained. Keep posting.

etl testing online
etl testing training

Post a Comment