Saturday 16 August 2014

Awk - UNIX Command and Its Use in ETL Testing


 

Introduction             

This document describes the possible ways of using the awk command in UNIX, its advantages and different usages with examples and an ETL case study

Usage

awk –F\<<delimiter>> ‘{print $0}’ <<file name>>
Ex:  awk –F\| ‘{print $0}’ Unix_File.dat
This command will display all the records in the file Unix_File.dat
‘|’ is the delimiter
‘$0’ refers to the entire set of records
This is as good as cat <<filename>>

Purpose

The purpose of awk command is to know exactly what value is present in particular column of a flat file on UNIX and also to display the records that match a particular criteria. If cat command is used, user may have to manually search for the value in bunch of records and it will be very tedious. To make it easy grep command also could be used but filtering criteria can be better applied using awk
Using awk we can:
1. Display entire contents of the file
2. Display data in particular fields of the file
3. Display the number of fields in each record
4. Display the rows matching a particular criterion




Example to display entire file content:
Consider a flat file on UNIX with name Sample_file.txt
The contents of the file are as below:
cat Sample_file.txt      
0^AWER^1A
1^DEcY^B9
2^    ^c1O
User wants to display all the records. The command to be used is:
awk -F\^ '{print $0}' Sample_file.txt
Output:
0^AWER^1A
1^DEcY^B9
2^    ^c1O
Note:
‘^’ is the delimiter      
‘$0’ refers to the record set

Example to display data in a particular field:
Taking the file used in above example, user wants to display only the values in second column of the file.
The command to be used is:
awk -F\^ '{print $2}' Sample_file.txt
Output:
AWER
DEcY
Note:
    ‘^’ is the delimiter
‘$2’ refers to 2nd column of the file

Example to display the number of fields in every record of the file:
User wants to know the number of fields in every record for the file Sample_Vendorfile.txt with below content:
cat Sample_Vendorfile.txt
H|VenZ
D|2|AW23
D|3|Zwe3
D|4|$5T1
T|3

Command to be used:
awk -F\| '{print NF "fields  found in row#" NR}' Sample_Vendorfile.txt
Output:
2fields found in row#1
3fields found in row#2
3fields found in row#3
3fields found in row#4
2fields found in row#5     

Note:
 ‘|’ is the delimiter
‘NF’ refers to number of fields
‘NR’ refers to record number
Example to display rows matching particular criteria:
Taking the file used in above example, user would like to display all the records whose 1st field is D
Command to be used:
awk -F\| '($1=="D") {print $0}' Sample_Vendorfile.txt
Output:
D|2|AW23
D|3|Zwe3
D|4|$5T1

Note:
$1 refers to 1st field, $2 refers to 2nd field…….$X refers to the Xth field etc.



            Sample ETL testing case study


UNIX Server
Flat file
Database
ETL
                                                Fig 1.0

Consider the ETL scenario in Fig 1.0 where data is being loaded into database tables from flat files on UNIX
Tester has written few test cases covering the below scenarios
1. Verify the length of each field in the flat file
2. Verify the number of fields per record
3. Perform data validations between file data and table data
4. Verify that a particular field is having a particular value only (if it is hardcoded as per the design)
For example, consider an ETL process between an external flat UNIX file and a database table
File name:  Unix_Source_File.dat

File Content:
1|Sashi|Tester|Infosys|TE   
2|Koushik|Delivery Manager|Infosys|DM
3|Prabhu|Process Lead|Infosys|PL
Verifying field length:
To test that the length of the 5th field in the file is not more than 2 characters, this is the command to be used:
awk –F\| ‘{print  length($5)}’  Unix_Source_File.dat
Output:
2
2
2
Or this can be done like this
awk  -F\| ‘(length($5)!=2) { print “Field length is not 2 for record#” NR}’  Unix_Source_File.dat
Output:
This will display the message “Field length is not 2 for record#X” only for that particular record X where length of the 5th field is not 2 characters  

Verifying the no. of fields per record:
To verify that the number of fields per record is 5, tester can use the awk command like this:
awk -F\| '(NF!=5) {print "Field number mismatch for record#"NR}' Unix_Source_File.dat


Output:
This will display the message “Field number mismatch for record#X” only for that particular record X where no. of fields is not 5

Data validation:
Source UNIX flat file
1|Sashi|Tester|Infosys|TE
2|Koushik|Delivery Manager|Infosys|DM
3|Prabhu|Process Lead|Infosys|PL
ETL
Target database table
EMP_Table
                                                          Fig 2.0

As shown in fig 2.0, only the first 4 field’s data is being loaded into target database. We can export both target and source data onto a spread sheet and compare the data. The below steps have to be followed.
Step1: Execute below awk command on UNIX
awk -F\| '{print $1"|",$2"|",$3"|",$4}' Unix_Source_File.dat >Temp.txt
The data in Temp.txt would look like this
1|Sashi|Tester|Infosys
2|Koushik|Delivery Manager|Infosys
3|Prabhu|Process Lead|Infosys
Step2: Place the file Temp.txt onto local desktop using Secure FX tool
Step3: Copy the data present in Temp.txt in MS excel
Step4: On database fire the below SQL query
Select Emp_Id,Emp_name,Portfolio,Organization from EMP_Table;
Step5: Export the data resultant of above SQL query onto the same spreadsheet of step3
Step6: Compare both source and target data
Verifying data in a particular field:        
To verify that the field #4 is having the value as only “Infosys”, use awk command as below:
awk -F\| '($4!="Infosys") {print "Data mismatch for record#"NR}' Unix_Source_File.dat

Output:
The above command will display the message “Data mismatch for record#X” only for that particular record X where field# 4 data is not “Infosys”

Advantages of awk command

1.    Helps to apply filters over files on UNIX
2.    Easy to locate and display data from huge chunk of records in a file
3.    Data validation on UNIX flat files will be easy with awk command
4.    Helps to reformat the data while displaying
5.    Helps in effective testing of ETL scenarios

Limitations

The awk command can be used only on serial ASCII files which have data in delimited format.