Saturday 16 August 2014

ETL Testing With SAS Eg Tool




1.    Introduction:
SAS Enterprise Guide is used for publishing result by analyzing and manipulating the data.  It is a point and click tool which enable business user to perform task with the help of menu and wizard. It also has programming capability which generates code to speed up the productivity for analyses and forecasting the in real time data. It has other feature as centralized and role-based security. It allows easy access to enterprise data sources and any external file.

Based on your need and requirement, features of SAS EG can be explored. We have explored the features of the tool to perform ETL Testing.



a.    SAS can validate data from various heterogeneous sources such as Oracle, MS SQL, DB2, Teradata and Flat File.
b.    SAS provides record level comparison report.
c.    Delta load can be validated as we store previous snapshot of the target database in Dataset.
d.    SAS has in-build Application automation framework (AOF) which can be used to automate and schedule execution.
e.    SAS scripts are portable. Base SAS script written can be re-used across different environment.


Extract, Transform and Load (ETL) involves below steps:
1) Extract data from outside sources
2) Transform extracted data to fit business/operational needs
3) Load transformed data to the end target (Eg. An operational data store or a data mart or a extract or data warehouse)

ETL Testing aims at validating the ETL process. So in ETL Testing involves:
1)    Extracting source data from outside sources
2)    Transforming it to fit operational needs (which can include quality levels)
3)    Importing the Target data after success ETL.
4)    Comparison between Transformed source data and Target Data to find any discrepancies.





SAS EG is basically used as:
1)     Analytic and Reporting tool.
2)    ETL tool and
3)    Forecasting tool.
We are using SAS tool for testing ETL process.
In Enterprise Guide we have a mechanism to organize the work and can also group the sequence of tasks, data items and results into Projects. All the code and tasks that are executed within an Enterprise Guide session are available in the Active Project (the project in which the code is being executed) and each of them are displayed as separate nodes within Process Flow view and Project Tree view.


Useful Terms:
The files which are understandable by SAS are called as SAS files. All SAS file will reside in a SAS data library. There are three types of SAS files – SAS dataset, SAS catalogue and SAS stored program.
                          i.    SAS Dataset:
  A SAS Dataset contains the data in a structured format that SAS can process.
                         ii.    SAS Stored Program:
  SAS Stored program contains the compiled code that a user creates and saves for repeated use.

SAS files can be External File (ASCII or text file and files generated by other application apart from SAS) or Internal File (Created by SAS program). All internal SAS file is stored in a SAS library, which is a collection of SAS files. Based on operating environments, a library can be physical or logical collection of files.
By default, SAS defines several libraries:

                        i.      SAS helpis a permanent library that contains sample data and other files that control how SAS works at your site, this is a read-only library

                       ii.      SAS useris a permanent library that contains SAS files in the Profile catalog that store user’s personal settings. This is also a convenient place to store your own files. Permanent SAS libraries are available to you during subsequent SAS sessions as well. To store files permanently in a SAS data library, you specify a library name other than the default library name Work.

                      iii.      Work is a temporary library for files that do not need to be saved from session to session. Temporary SAS libraries last only for the current SAS session. If you don't specify a library name when you create a file (or if you specify the library name Work), the file is stored in the temporary SAS data library.


                        i.        Data Step:
SAS Dataset can be created using DATA step. Dataset is a collection of data and data dictionary. Data dictionary defines the variable and their properties. SAS procedures can read data which is in the form of a SAS dataset.

                       ii.        PROC Step:
PROC is abbreviation of PROCEDURE. A PROC step calls an in-build SAS procedure for analyzing or processing a SAS dataset.
Features:
·         Predefined routines to analyze and process the data in a SAS data set.
·         Present data in the form of report
o    create a report that lists the data
o    produce descriptive statistics
o    create a summary report
o    produce plots and charts
·         Create a new data set for storing the result of procedure.
·         OPERATIONS: list, sort, summarize data

For example: PROC SQL; PROC PRINT; PROC MEANS; PROC CONTENT;

External Files are those files whose structure of data is not recognized by SAS. Raw data, SAS program statements or procedure output can be the present in External Files.    
The below are few useful System options provided by SAS. These system options can be used in DATA Step.

PAGESIZE option specifies the maximum number of lines to print on each page of the output.
a.    SKIP:
SKIP option skips a specified number of lines in the SAS log.
b.    CENTER:
CENTER option verifies whether the output is centered. By default titles and procedure output are centered by SAS.
c.    LABEL:
This system option is used to assign descriptive portion to the SAS procedures.
d.    TITLE:
This system option prints titles at the top of each output page. By default, SAS prints the titles: “The SAS system”. TITLE system option can be used to replace the default title or specify other descriptive titles for SAS programs. NULL TITLE system option can be used to suppress a TITLE statement.



a.      PROC PRINT:
This Proc prints out a data set .The title statement is optional.
b.      PROC IMPORT:
PROC IMPORT is a very useful tool for converting external files (text, excel, xml etc.) to SAS data sets.                
c.      PROC SQL:
It is the SAS implementation of the Standard Query Language (SQL) standard to work with databases and is used in Teradata, Oracle, Microsoft SQL Server, Microsoft Access and other database systems. PROC SQL is a powerful language that allows sophisticated merges. When you merge datasets with PROC SQL you do not need to sort them. QUIT statement is used to end PROC SQL.  The SQL query is between PROC SQL and QUIT (as a very long statement).
d.      PROC Sort:
This Proc statement is used to sort the observations in the datasets.
e.      PROC Compare:
This Proc statement is used to compare two datasets, one of the two datasets involved in the comparison will be a Source dataset and the other dataset will be the target dataset.

f.       PROC PLOT:
This proc plots two variables. The plot command species the variables to be plotted.
g.      PROC UNIVARIATE and PROC MEANS:
Both these compute descriptive statistics (means, variances, other quantities). UNIVARIATE computes more descriptive statistics. MEANS gives more compact output and is easier to use to produce a new data set. BY command is used for descriptive statistics.



                              i.        To reference a permanent SAS data set in your SAS programs, you use a two-level name : libref.filename
                             ii.        To reference a temporary SAS data set in your SAS programs, you use a one-level name, the default libref Work is assumed.


                              i.        For many procedures and for some DATA step statements, data must be in the form of a SAS data set to be processed.
                             ii.        Conceptually, a SAS data set is a file that consists of two parts
§  A descriptor portion: The descriptor portion of a SAS data set contains information about the data set, which include name of the data set, date and time of data set creation, the number of observations(rows), the number of variables
§  A data portion: The data portion contains the actual data arranged in a rectangular tabular format.
                              i.        Sometimes a SAS data set also points to one or more indexes

Each time a step is executed, SAS generates a log of the processing activities and the results of the processing. The SAS log collects messages about the processing of SAS programs and about any errors that occur. You get separate sets of messages for each step in the program. All SAS programs produce log messages, but some SAS programs produce only log messages.
You can specify result formats to create your output as:
a.     SAS Report
b.     HTML document
c.     RTF document
d.     PDF
e.     Text Output
f.      

Below framework can be followed to validate target using ETL tool:


1.    Importing data from External file as SAS Dataset: Any external file such as delimited flat file (flat file, fixed length file, XML file) can be imported into SAS environment by using the “IMPORT” option form the FILE Menu. Below figure shows the process of importing a file into SAS EG.

2.    Importing data from relational database asSAS Dataset:
Importing target data from relational database can be done in two methods.
Method 1(By using “CONNECT TO” SAS keyword):
Below are the steps to import data into SAS EG by connecting to a relational database by using “CONNECT TO” keyword provided by SAS EG.
STEP1: Mention the database name like Oracle/Teradata/DB2 as Source.
STEP2: Enter the login Credentials of the database that needs to be connected.
STEP3: Provide the SERVER name and the DATABASE name of the database, from where the data is to be fetched.
STEP4: Embed the above written SQL query within the “PROC SQL” and “Quit” statements.

Method 2(By using “LIBNAME” statement of SAS EG):
Below are the steps to import data into SAS EG by connecting to a relational database by using “LIBNAME” keyword provided by SAS EG.

STEP1: Mention the database name like Oracle/Teradata/DB2.
STEP2: Enter the login Credentials of the database that needs to be connected.
STEP3: Provide the SERVER name and the DATABASE name of the database, from where the data is to be fetched.
STEP4: Assign a reference variable to this LIBNAME statement. This reference variable is used by SAS EG for connecting to a database.

3.    Creating scripts for Applying Transformation & Mapping rules: Using SAS Language element (Procedures and DATA Steps) scripts can be created to implement the mapping logics and the transformation rules provided in the mapping document.

4.    Execution of Scripts: Scripts written in above step is executed to generate Transformed dataset from source dataset. Refer the log to see the execution result. Remove any syntax/logical error while execution.

5.    Comparing the Transformed Dataset with target Dataset: Compare the above generated datasets by making using ‘PROC COMPARE’ procedure statement.

6.    Stored the comparison result in ODS: Comparison result can be stored into ODS for future processing and analysis.







This execution can be applicable for both Inbound and Outbound process in data Warehouse:
     Building Block:
§  Relative SAS Server paths
§  Environment variables
§  Schema details of source and target(if applicable)
Follow the below process for each extract:
     Import source and target data in SAS data sets
§  Create Load sas script to fetch source/target data in SAS (if source/target is extract file) (E.g. Asset_Load_ExtractName)
              OR
     Create setup sas script to fetch source/target data in SAS (If source/target is table) (E.g. Asset_Setup_ExtractName)

     Control and balance
§  Row Count:  Write Check Row Count script for checking Control and balance. (E.g.  Asset_CheckRowCount )

      Straight Move:
§  Write One To One Move script for Straight Move. (E.g. Asset_OneToOneMove)

     Transformed Move:
§  Create Mapping script for each transformation move. As per the test case document execute script corresponding to each transformation move. (E.g. Asset_ExtractName_Mapping)

     Column Comparison
§  Create Compare Column script for column to column comparison in source and target data.(E.g. Asset_CompareColumn )


SAS EG is a versatile tool with lot of capabilities. SAS EG is available to a wide range of users with a variety of skill levels. It includes easy-to-use interface and programmable capabilities. It has basic methods of data exploration with descriptive statistics. SAS Enterprise Guide enables you to manage and manipulate data effectively and efficiently to prepare it for analysis.




9 comments :

I must confess that it is very promising,if you are interested visit below link

http://www.tekclasses.com/

Nice list of Citations.

http://www.tekclasses.com/

Well,Certainly a treasured Article.if you are interested visit below link

http://www.tekclasses.com/

Your post about technology was very helpful to me. Very clear step-by-step instructions. I appreciate your hard work and thanks for sharing.
SAS Training in Chennai
SAS Course in Chennai

Your post is extremely awesome. This blog is extremely instructive and valuable. A debt of gratitude is in order for sharing this data.

Article Submission sites | Latest Updates | Technology

Much appreciated for clarifying for all intents and purposes. Incredible Post! Get more data.
Education | Article Submission sites | Technology

So glad I stumbled on this blog post! I'm going to post a link to you on my blog because I think this is such a great tutorial JAVA Training in Chennai |
JAVA Course in Chennai |
Best JAVA Training in Chennai

Thanks for sharing valuable information and very well explained. keep posting.

etl testing online
etl testing online training

Post a Comment