Saturday 16 August 2014

Awk - UNIX Command and Its Use in ETL Testing


 

Introduction             

This document describes the possible ways of using the awk command in UNIX, its advantages and different usages with examples and an ETL case study

Usage

awk –F\<<delimiter>> ‘{print $0}’ <<file name>>
Ex:  awk –F\| ‘{print $0}’ Unix_File.dat
This command will display all the records in the file Unix_File.dat
‘|’ is the delimiter
‘$0’ refers to the entire set of records
This is as good as cat <<filename>>

Purpose

The purpose of awk command is to know exactly what value is present in particular column of a flat file on UNIX and also to display the records that match a particular criteria. If cat command is used, user may have to manually search for the value in bunch of records and it will be very tedious. To make it easy grep command also could be used but filtering criteria can be better applied using awk
Using awk we can:
1. Display entire contents of the file
2. Display data in particular fields of the file
3. Display the number of fields in each record
4. Display the rows matching a particular criterion




Example to display entire file content:
Consider a flat file on UNIX with name Sample_file.txt
The contents of the file are as below:
cat Sample_file.txt      
0^AWER^1A
1^DEcY^B9
2^    ^c1O
User wants to display all the records. The command to be used is:
awk -F\^ '{print $0}' Sample_file.txt
Output:
0^AWER^1A
1^DEcY^B9
2^    ^c1O
Note:
‘^’ is the delimiter      
‘$0’ refers to the record set

Example to display data in a particular field:
Taking the file used in above example, user wants to display only the values in second column of the file.
The command to be used is:
awk -F\^ '{print $2}' Sample_file.txt
Output:
AWER
DEcY
Note:
    ‘^’ is the delimiter
‘$2’ refers to 2nd column of the file

Example to display the number of fields in every record of the file:
User wants to know the number of fields in every record for the file Sample_Vendorfile.txt with below content:
cat Sample_Vendorfile.txt
H|VenZ
D|2|AW23
D|3|Zwe3
D|4|$5T1
T|3

Command to be used:
awk -F\| '{print NF "fields  found in row#" NR}' Sample_Vendorfile.txt
Output:
2fields found in row#1
3fields found in row#2
3fields found in row#3
3fields found in row#4
2fields found in row#5     

Note:
 ‘|’ is the delimiter
‘NF’ refers to number of fields
‘NR’ refers to record number
Example to display rows matching particular criteria:
Taking the file used in above example, user would like to display all the records whose 1st field is D
Command to be used:
awk -F\| '($1=="D") {print $0}' Sample_Vendorfile.txt
Output:
D|2|AW23
D|3|Zwe3
D|4|$5T1

Note:
$1 refers to 1st field, $2 refers to 2nd field…….$X refers to the Xth field etc.



            Sample ETL testing case study


UNIX Server
Flat file
Database
ETL
                                                Fig 1.0

Consider the ETL scenario in Fig 1.0 where data is being loaded into database tables from flat files on UNIX
Tester has written few test cases covering the below scenarios
1. Verify the length of each field in the flat file
2. Verify the number of fields per record
3. Perform data validations between file data and table data
4. Verify that a particular field is having a particular value only (if it is hardcoded as per the design)
For example, consider an ETL process between an external flat UNIX file and a database table
File name:  Unix_Source_File.dat

File Content:
1|Sashi|Tester|Infosys|TE   
2|Koushik|Delivery Manager|Infosys|DM
3|Prabhu|Process Lead|Infosys|PL
Verifying field length:
To test that the length of the 5th field in the file is not more than 2 characters, this is the command to be used:
awk –F\| ‘{print  length($5)}’  Unix_Source_File.dat
Output:
2
2
2
Or this can be done like this
awk  -F\| ‘(length($5)!=2) { print “Field length is not 2 for record#” NR}’  Unix_Source_File.dat
Output:
This will display the message “Field length is not 2 for record#X” only for that particular record X where length of the 5th field is not 2 characters  

Verifying the no. of fields per record:
To verify that the number of fields per record is 5, tester can use the awk command like this:
awk -F\| '(NF!=5) {print "Field number mismatch for record#"NR}' Unix_Source_File.dat


Output:
This will display the message “Field number mismatch for record#X” only for that particular record X where no. of fields is not 5

Data validation:
Source UNIX flat file
1|Sashi|Tester|Infosys|TE
2|Koushik|Delivery Manager|Infosys|DM
3|Prabhu|Process Lead|Infosys|PL
ETL
Target database table
EMP_Table
                                                          Fig 2.0

As shown in fig 2.0, only the first 4 field’s data is being loaded into target database. We can export both target and source data onto a spread sheet and compare the data. The below steps have to be followed.
Step1: Execute below awk command on UNIX
awk -F\| '{print $1"|",$2"|",$3"|",$4}' Unix_Source_File.dat >Temp.txt
The data in Temp.txt would look like this
1|Sashi|Tester|Infosys
2|Koushik|Delivery Manager|Infosys
3|Prabhu|Process Lead|Infosys
Step2: Place the file Temp.txt onto local desktop using Secure FX tool
Step3: Copy the data present in Temp.txt in MS excel
Step4: On database fire the below SQL query
Select Emp_Id,Emp_name,Portfolio,Organization from EMP_Table;
Step5: Export the data resultant of above SQL query onto the same spreadsheet of step3
Step6: Compare both source and target data
Verifying data in a particular field:        
To verify that the field #4 is having the value as only “Infosys”, use awk command as below:
awk -F\| '($4!="Infosys") {print "Data mismatch for record#"NR}' Unix_Source_File.dat

Output:
The above command will display the message “Data mismatch for record#X” only for that particular record X where field# 4 data is not “Infosys”

Advantages of awk command

1.    Helps to apply filters over files on UNIX
2.    Easy to locate and display data from huge chunk of records in a file
3.    Data validation on UNIX flat files will be easy with awk command
4.    Helps to reformat the data while displaying
5.    Helps in effective testing of ETL scenarios

Limitations

The awk command can be used only on serial ASCII files which have data in delimited format.


11 comments :

brillant piece of information,this blog is fantastic...for more details

http://www.tekclasses.com/

Appreciation for nice Updates,nice blog article.for more details

http://www.tekclasses.com/

Very nice and informative blog.Nice piece of knowledge!For more details

http://www.tekclasses.com/

Nice post. By reading your blog, i get inspired and this provides some useful information. Thank you for posting this exclusive post for our vision. 
Oracle training in btm
Oracle Online Training
Oracle training in chennai
Oracle training in Bangalore

A very nice guide. I will definitely follow these tips. Thank you for sharing such detailed article. I am learning a lot from you.
Oracle training in pune
Oracle Online Training
Oracle training in Bangalore
Oracle training in Sholingaanallur
Oracle training in marathahalli

I admired your helpful words. Top class contribution. I really hope you’ll write more. I'll continue looking for.
Oracle apps training |
Oracle Training in Chennai

This comment has been removed by the author.


Very Nice Blog. Thank you for sharing such a wonderful details.
Swaran Homeopathic Interstitial Lung Disease homeopathy medicine Treatment offers a personalized and holistic approach to managing ILD symptoms. While further research is needed to establish its effectiveness, individuals interested in exploring alternative therapies for ILD can consider consulting with a qualified homeopathic practitioner Dr. Amit Mehta a well-known name in homeopathy to discuss their options and develop a comprehensive treatment plan..



For More Information Contact us Now - 9310212429

https://swaranhomoeopathic.com/interstitial-lung-disease-treatment/

H-649/650 Palam extn, Sector-7, Dwarka, Delhi-110075

Post a Comment