Introduction
This document describes the possible ways of using the
awk command in UNIX, its advantages and different usages with examples and an
ETL case study
Usage
awk –F\<<delimiter>> ‘{print $0}’ <<file
name>>
Ex: awk –F\|
‘{print $0}’ Unix_File.dat
This command will display all the records in the file
Unix_File.dat
‘|’ is the delimiter
‘$0’ refers to the entire set of records
This is as good as cat
<<filename>>
Purpose
The purpose of awk command is to know exactly what
value is present in particular column of a flat file on UNIX and also to
display the records that match a particular criteria. If cat command is used,
user may have to manually search for the value in bunch of records and it will
be very tedious. To make it easy grep command also could be used but filtering
criteria can be better applied using awk
Using awk we can:
1. Display entire contents of the file
2. Display data in particular fields of the file
3. Display the number of fields in each record
4. Display the rows matching a particular criterion
Example
to display entire file content:
Consider a flat file on UNIX with name Sample_file.txt
The contents of the file are as below:
cat Sample_file.txt
0^AWER^1A
1^DEcY^B9
2^ ^c1O
User wants to display all the records. The command to
be used is:
awk -F\^ '{print $0}' Sample_file.txt
Output:
0^AWER^1A
1^DEcY^B9
2^ ^c1O
Note:
‘^’ is the delimiter
‘$0’ refers to the record
set
Example
to display data in a particular field:
Taking the file used in
above example, user wants to display only the values in second column of the
file.
The command to be used is:
awk -F\^ '{print $2}'
Sample_file.txt
Output:
AWER
DEcY
Note:
‘^’ is the delimiter
‘$2’ refers to 2nd
column of the file
Example
to display the number of fields in every record of the file:
User wants to know the
number of fields in every record for the file Sample_Vendorfile.txt with below
content:
cat Sample_Vendorfile.txt
H|VenZ
D|2|AW23
D|3|Zwe3
D|4|$5T1
T|3
Command to be used:
awk -F\| '{print NF
"fields found in row#" NR}'
Sample_Vendorfile.txt
Output:
2fields found in row#1
3fields found in row#2
3fields found in row#3
3fields found in row#4
2fields found in
row#5
Note:
‘|’ is the delimiter
‘NF’ refers to number of
fields
‘NR’ refers to record
number
Example
to display rows matching particular criteria:
Taking the file used in
above example, user would like to display all the records whose 1st
field is D
Command to be used:
awk -F\| '($1=="D") {print $0}'
Sample_Vendorfile.txt
Output:
D|2|AW23
D|3|Zwe3
D|4|$5T1
Note:
$1 refers to 1st field, $2 refers to 2nd
field…….$X refers to the Xth field etc.
Sample ETL testing case study
UNIX
Server
|
Flat
file
|
Database
|
ETL
|
Fig 1.0
Consider the ETL scenario
in Fig 1.0 where data is being loaded into database tables from flat files on
UNIX
Tester has written few
test cases covering the below scenarios
1. Verify the length of
each field in the flat file
2. Verify the number of
fields per record
3. Perform data
validations between file data and table data
4. Verify that a
particular field is having a particular value only (if it is hardcoded as per
the design)
For example, consider an
ETL process between an external flat UNIX file and a database table
File
name:
Unix_Source_File.dat
File
Content:
1|Sashi|Tester|Infosys|TE
2|Koushik|Delivery Manager|Infosys|DM
3|Prabhu|Process Lead|Infosys|PL
Verifying
field length:
To test that the length of
the 5th field in the file is not more than 2 characters, this is the
command to be used:
awk –F\| ‘{print length($5)}’
Unix_Source_File.dat
Output:
2
2
2
Or this can be done like
this
awk -F\| ‘(length($5)!=2) { print “Field length
is not 2 for record#” NR}’
Unix_Source_File.dat
Output:
This will display the
message “Field length is not 2 for record#X” only for that particular record X
where length of the 5th field is not 2 characters
Verifying
the no. of fields per record:
To verify that the number
of fields per record is 5, tester can use the awk command like this:
awk -F\| '(NF!=5) {print
"Field number mismatch for record#"NR}' Unix_Source_File.dat
Output:
This will display the
message “Field number mismatch for record#X” only for that particular record X
where no. of fields is not 5
Data
validation:
Source
UNIX flat file
1|Sashi|Tester|Infosys|TE
2|Koushik|Delivery
Manager|Infosys|DM
3|Prabhu|Process
Lead|Infosys|PL
|
ETL
|
Target
database table
EMP_Table
|
Fig
2.0
As shown in fig 2.0, only the first 4 field’s data is
being loaded into target database. We can export both target and source data
onto a spread sheet and compare the data. The below steps have to be followed.
Step1: Execute below awk command on UNIX
awk -F\| '{print
$1"|",$2"|",$3"|",$4}' Unix_Source_File.dat
>Temp.txt
The data in Temp.txt would look like this
1|Sashi|Tester|Infosys
2|Koushik|Delivery Manager|Infosys
3|Prabhu|Process Lead|Infosys
Step2: Place the file Temp.txt onto local desktop using
Secure FX tool
Step3: Copy the data present in Temp.txt in MS excel
Step4: On database fire the below SQL query
Select
Emp_Id,Emp_name,Portfolio,Organization from EMP_Table;
Step5: Export the data resultant of above SQL query
onto the same spreadsheet of step3
Step6: Compare both source and target data
Verifying
data in a particular field:
To verify that the field #4 is having the value as only
“Infosys”, use awk command as below:
awk -F\| '($4!="Infosys") {print "Data
mismatch for record#"NR}' Unix_Source_File.dat
Output:
The above command will display the message “Data
mismatch for record#X” only for that particular record X where field# 4 data is
not “Infosys”
Advantages of awk command
1.
Helps to apply filters over files on UNIX
2.
Easy to locate and display data from huge
chunk of records in a file
3.
Data validation on UNIX flat files will be
easy with awk command
4.
Helps to reformat the data while displaying
5.
Helps in effective testing of ETL scenarios
Limitations
The awk command can be used only on serial ASCII files
which have data in delimited format.