0. Introduction

This file lists code snippets and a small mini tutorial to work with AWK.

AWK is an excellent tool to filter and manipulate data that comes with rows and columns, which is very common with biological data. Awk requires no compiling, and allows the user to use variables, numeric functions, string functions, and logical operators. Importantly, it can deal fast and effectively with very large data files.

Examples of things AWK can do:

  • Scans a file line by line
  • Splits each input line into fields
  • Compares input line/fields to pattern
  • Performs action(s) on matched lines

When you want to work with AWK a basic knowledge about how to use the command line is required. We have a small tutorial for this here

Here, we test basic awk operations with several input files that can be found in the 1_Inputfiles folder. The files usually look like this:

We have a file with

  • 4 rows
  • 5 columns
  • The columns have information for one 1_Inputfiles/Experiment with control and nitrogen measurements (two things were measured)

In this tutorial we learn how to:

  • Subset dataframes by columns
  • Subset dataframes based on patterns
  • Merge dataframes
  • etc.

General info on this tutorial

  • grey box. : The code that we want to execute
  • red box. : Urgent comments ;-)
  • Code : In some cases the code might be hidden and you only see a little box at the right hand side. Hidden codes means that from the previous sessions you should have enough knowledge to do this step or it is is a test of your skills. If you run into issues you can open the code and check how to run your command by clicking on the Code button to reveal the code
  • hyperlink : If the text is highlighted like this during a text section, then you can click on it and a hyperlink gives you more info (either the publication or an online tutorial)
  • Exercise: : This will appear if you should have the required background to finish all the steps yourself. If you run into trouble, you can find the answers if you click the Code butto, which will reveal the answer.
  • Sometimes we also just ask questions for you to think about the output of certain programs. The text is then hidden and you have to hover over the “spoiler” to see the answer to the question

Basic introduction

An essential awk command looks like follows:

awk ‘instructions’ files

The Input is read a line at a time. The instructions are enclosed by single quotes to protect them from the shell, since curly brackets or dollars signs are otherwise not interpreted correctly.

Same as with sed, we can issue multiple commands by using a semicolon.

awk pattern { action }

  • pattern = when the action is performed
  • action = what we wanto do

In case we want to run an awk script with more complex commands, we run:

awk -f script file

Subsetting dataframes

  1. To test things you can use some files generated for this purpose in the 1_Inputfoles folder.

  2. By default the examples below will print to the screen. In case you want to save into a new file you can do parse the output into a new file with a >. Example: awk command 1_Inputfiles/Experiment1.txt > File_new.txt

  3. The file **1_Inputfiles/Experiment*.txt contains five columns: The 1_Inputfiles/ExperimentID, conditions, two measurments and a column with some comments. The file 1_Inputfiles/list** contains a list of addresses, which uses the comma as a separator.

  4. If you want to follow this tutorial do the following steps

  • Make a new folder inside the workshop directory, name it Awk_tutorial
  • Go inside that folder
  • Copy the folder with the test data from /export/lv3/scratch/workshop_2021/Sxx_Practice/Awk/1_Inputfiles/
  • familirize yourself with the content of the indiv. files
#make folder
mkdir Awk_tutorial

#go inside that new folder
cd Awk_tutorial

#get the test data
cp -r /export/lv3/scratch/workshop_2021/Sxx_Practice/Awk/1_Inputfiles/ .

#check files
head 1_Inputfiles/list
head 1_Inputfiles/Experiment1.txt
head 1_Inputfiles/Metadata
head 1_Inputfiles/nameState

Splitting data by categories

Now imagine we want a separate table with the names of people for each different state, with awk that is relatively easy to do

#just print names from MA and PA
awk -F, '{print > $3}'  1_Inputfiles/list2

#use ll to see what happened
ls -l

#check the content of a file
head VA

To make this a bit nicer, we even can add an extension to the new files that are generaetd

#just print names from MA and PA
awk -F, '{print > $3".txt"}'  1_Inputfiles/list2

#use ll to see what happened
ls -l

#check the content of a file
head VA.txt

We can also just print certain data, i.e. just the names

#just print names from MA and PA
awk -F, '{print $1 > $3".txt"}'  1_Inputfiles/list2

#use ll to see what happened
ls -l

#check the content of a file
head VA.txt

Combining several actions

We can combine multiple actions by using the semicolon (same as in sed)

#just print addresses from MA and just print the names
awk -F, '{print $1; print $2; print $3}' 1_Inputfiles/list

Options used:

  • -F: we define the input field delimiter and say to only see the comma as a delimiter not the space

Add a column

We can use this if we want to add some extra information into our tables. Later we can also do this a bit more sophisticated using if/else statements

#add a new column in the beginning
awk 'BEGIN{FS=OFS="\t"}{print "redo",$0}' 1_Inputfiles/Experiment1.txt

If we run this, we see that we have a new first column followed by the original dataframe.

We can exactly control where we add new information. For example, with the code below, we add the extra column between the 2nd and 3rd column of the original dataframe.

#add a new column as new third column
awk 'BEGIN{FS=OFS="\t"}{print $1,$2,"redo",$3,$4,$5}' 1_Inputfiles/Experiment1.txt

Exercixe:

Imagine we want to add a column with the T of the experiment. Add a new column at the end with 36C

#add a new column as new third column
awk 'BEGIN{FS=OFS="\t"}{print $0,"36C"}' 1_Inputfiles/Experiment1.txt

Change the content of an existing column

Let’s say, we do not like to use the word Experiment in our table, but rather would say Repetition. We can change the content of the first column like this:

#change the content in column 1 to**repetition**
awk 'BEGIN{FS=OFS="\t"}{$1="Repetition1"} {print $0 }' 1_Inputfiles/Experiment1.txt

Exercise

After plotting the data, we realized that the measurements that we initially labelled as control and N[0.4Um]. We realized we did a mistake and did not change the N levels. Rename everything in column 2 to control.

#qst1
awk 'BEGIN{FS=OFS="\t"}{$2="control"} {print $0 }' 1_Inputfiles/Experiment1.txt

Split a column

We can also split columns. For example, the Ex1 in the first column, maybe we would want to split before and after the x and print the results into new columns. This is a random example, but when dealing with proteins, you often have something like GenomeName-ProteinID and for some operations we might want to have that information in different columns/

Some new syntax for AWK:

  • split() = the function ‘split’ divides a string into pieces
  • $1 = The column we want to split
  • “x” = The pattern we want to use for splitting
  • a = We name the ARRAY we generate ‘a’. An ARRAY is similar to a variable you just can store more information in the array we store the different pieces that were split
  • a[1] = retrieve the first part of the split array =E
  • a[2] = retrieve the second part of the split array =1

Here, we see the output and what our arrays are:

#split using  **x** in the 1st column and print the two arrays generated in that step
awk 'BEGIN{FS=OFS="\t"}{split($1,a,"x")} {print $1,a[1],a[2],$2,$3}' 1_Inputfiles/Experiment1.txt

The output of this line of code should look like this:

Exercise

  • In column 5, split after the ’_’. Print the first two columns and the first and second part of the array
#qst1
awk 'BEGIN{FS=OFS="\t"}{split($5,a,"_")} {print $1,$2,a[1],a[2]}' 1_Inputfiles/Experiment1.txt

Dealing with more than one column

In this example we work with the split function we used above but the syntax stays the same for other commands.

More specifically, we want to do the following:

  1. Split the third column by the dot
  2. Split the second column by the dot
  3. print the 1st, 2nd, the first array of the first split (=a) and the first array of the second split (=b)

The curly brackets allow us to separate two different commands (version 1). We can also do this by using a semicolon (version 2).

#version 1
awk -F'\t' -v OFS='\t'  '{split($3,a,".")} {split($4,b,".")} {print $1,$2,a[1],b[1]}' 1_Inputfiles/Experiment1.txt

#version 2
awk -F'\t' -v OFS='\t'  '{split($3,a,"."); split($4,b,"."); print $1,$2,a[1],b[1]}' 1_Inputfiles/Experiment1.txt

Use search and replace

We can search for patterns and if a pattern is found we can replace it with something else. For example, we can search for Ex and replace it with Experiment

We can also use wildcards here. The default behaviour of wildcards is explained in more detail in the Unix tutorial.

#default
awk -F'\t' -v OFS='\t'  '{gsub(/Ex/,"Experiment")}{print $0}' 1_Inputfiles/Experiment1.txt

#use gsub together with wildcards
#Here, we replace control1 + control2 with blank
awk -F'\t' -v OFS='\t'  '{gsub(/control[12]/,"blank")}{print $0}' 1_Inputfiles/Experiment1.txt

The second example looks like this:

We can see that since we use the wildcard ([12]) both control1 and control2 are replaced with blank.

Exercise

  • replace the N[0.4uM] with nitrogen. Do not forget that brackets mean something else for awk and we need to tell it to use the brackets literally. If you do not know how to do this, check the general Unix tutorial
  • replace the Ex1 with repetition and the control[12] with blank
#question1
awk -F'\t' -v OFS='\t'  '{gsub(/N\[0.4uM\]/,"nitrogen")}{print $0}' 1_Inputfiles/Experiment1.txt

#question2
awk -F'\t' -v OFS='\t'  '{gsub(/Ex1/,"Experiment"); gsub(/control[12]/,"blank")}{print $0}' 1_Inputfiles/Experiment1.txt

Only print rows that have a pattern of interest

We can also only print rows that we find relevant. I.e. we might only want to print the columns with our control measurements. Or we might want to only print the information from specific treatments, 1_Inputfiles/Metadata for certain organisms, etc…

New syntax that we use here:

-? = Searches for any Number/Letter (exactly one) - == = is a logical operator/Boolean expression. This one means find a pattern that is equal to the pattern we gave - != = means unequal to - ~ = means find a pattern that is the same AND allow for wildcards - || or operator

#only print rows with control treatments (search across all columns)
awk -F'\t' -v OFS='\t'  '/control?/''{print $0}' 1_Inputfiles/Experiment1.txt

#only print rows if we find the term control2 in column2
awk -F'\t' -v OFS='\t'  '$2=="control2"''{print $0}' 1_Inputfiles/Experiment1.txt

#only print rows wit the control in column2 (AND use wildcards)
awk -F'\t' -v OFS='\t'  '$2~"control?"''{print $0}' 1_Inputfiles/Experiment1.txt

#print everything EXCEPT control2
awk -F'\t' -v OFS='\t'  '$2!="control2"' 1_Inputfiles/Experiment1.txt

#we can search for several patterns arcoss different columns like this
awk -F'\t' -v OFS='\t' '$2 =="N" || $3>6''{print $0}' 1_Inputfiles/Experiment1.txt

Exercise

  • Find the rows in which the experiment was ok and where we did not have issues with contamination (the extra info is stored in column 5)
awk -F'\t' -v OFS='\t' '$5=="#all_good"''{print $0}' 1_Inputfiles/Experiment1.txt
#or
awk -F'\t' -v OFS='\t' '$5~"good"''{print $0}' 1_Inputfiles/Experiment1.txt

Only print data if a condition is met

Here: Only print rows if the second column has the pattern N

#search pattern across the whole file 
awk -F'\t' -v OFS='\t'  '/N/ {print $0}' 1_Inputfiles/Experiment1.txt
  
#search pattern in specific column
awk -F'\t' -v OFS='\t'  '$2 ~ "N" { print $0 }' 1_Inputfiles/Experiment1.txt

Subset dataframes by using cutoffs and using relational operators

Relational operators

awk 'BEGIN {a = 10; b = 10; if (a == b) print "a==b"}'

awk 'BEGIN {a = 10; b = 20; if (a != b) print "a = "a,"b = "b,"therefore", "a!=b"}'
    
awk 'BEGIN {a = 10; b = 20; if (a <= b) print "a<=b"}'

awk 'BEGIN {a = 10; b = 10; if (a < b) print "a==b"; else print "false"}'

We might also want to subset data by numeric values. I.e. we only might want to retain columns were values are larger than 10, or larger than a certain e-value, etc..

This can be very helpful when subsetting blast tables

We can combine two statements using || , which means OR Alternatively, we can use && , which means AND

#only print rows if the third column has values larger than 5
awk -F'\t' -v OFS='\t'  '$3>5' 1_Inputfiles/Experiment1.txt

#if the value in the third column is equal to 5.2 do NOT print
awk -F'\t' -v OFS='\t'  '$3!=5.2' 1_Inputfiles/Experiment1.txt

#if the third column is larger than 5 OR the fourth column is greater than 0.2 then print
awk -F'\t' -v OFS='\t'  '($3>5)||($4>0.2)' 1_Inputfiles/Experiment1.txt

#only print if the values in the third column are equal to or greater than 5
awk -F'\t' -v OFS='\t' '$3>=5.2' 1_Inputfiles/Experiment1.txt

#only print if the third column is equal/larger than 5.2 AND the fourth column is larger than 4.5
awk -F'\t' -v OFS='\t' '($3>=5.2)&&($4>4.5)' 1_Inputfiles/Experiment1.txt

Exercise

  • search for data, where we measured the N treatments and where the value in the third column is greater than 10
awk -F'\t' -v OFS='\t' '($2~"N")&&($3>10)''{print $0}' 1_Inputfiles/Experiment1.txt

Subsetting data and storing the results into different tables

Remember when we split our list into separate files depending on what states people lived in? We can also do this with number and store numbers smaller than one in a new file.

awk -F'\t' -v OFS='\t' '{if($3<=1)print > "small.txt";else print > "large.txt"}' 1_Inputfiles/Experiment1.txt

#check if that worked
head small.txt 

Using a so-called if-else statement we say: If it is lesser or equal to 1, the record goes to “smallL.txt”, else to “large.txt”.

Changing specific cells

This specific command can be quite useful when we for example want to change headers.

Here, we learn several new statements that can be used in AWK:

  • if = Do the action only IF a specific condition is met
  • NR = Number of fields (or “rows”)
  • NR==3 = We are in the 3rd field
  • if(NR==3) = if we are in field 3 do sth
  • if(NR==3) $2=“P” = if we are in field 3 replace the value in the 2nd column with P
#change a the pattern in the 2nd column to **P** if we are in the 3rd ROW 
awk 'BEGIN{FS="\t"; OFS="\t"}{if(NR==3) $2="P"} {print $0 }' 1_Inputfiles/Experiment1.txt

Exercise

We made an error in our comment section (column 5). Instead of #all_good for our second measurements, we wanted to write #stunted_growth. Correct this error

#change a the pattern in the 2nd column to **P** if we are in the 3rd ROW 
awk 'BEGIN{FS="\t"; OFS="\t"}{if(NR==2) $5="#stunted_growth"} {print $0 }' 1_Inputfiles/Experiment1.txt

AWK and variables

awk provides a “-v” option to pass arguments. Using this, we can pass the shell variable to it

#define variable 
x=3

#add 3 to our third column
awk -v val=$x '{print $0"-->"$3+val}' 1_Inputfiles/Experiment1.txt

Math with awk

use Arithmetic operators

awk 'BEGIN {a = 50; b = 20; print "(a + b) = ", (a + b)}'

awk 'BEGIN {a = 50; b = 20; print "(a - b) = ", (a - b)}'
    
awk 'BEGIN {a = 50; b = 20; print "(a * b) = ", (a * b)}'

awk 'BEGIN {a = 50; b = 20; print "(a / b) = ", (a / b)}'

We can also use this on our data tables like this:

#add the values of the 3rd and 4th column
awk '{print $3+$4}' 1_Inputfiles/Experiment1.txt

#subtract
awk '{print $3-$4}' 1_Inputfiles/Experiment1.txt

#multiply
awk '{print $3*$4}' 1_Inputfiles/Experiment1.txt

Using variables

Using one variable

Here, we create a variable named SUM that stores added values of the 3rd and 4th column. We can use this variable for printing.

awk '{SUM=$3+$4}{print $1,$2,SUM}' 1_Inputfiles/Experiment1.txt

Exercise

  • Substract column 3 from colum 4 and store the result in a variable with the name ‘diff’. Print the second columnd and diff
awk '{diff=$4-$3}{print $2,diff}' 1_Inputfiles/Experiment1.txt

Using several variables

Here, we again create the variable SUM. Then, we run a second command (using the semicolon) and add a 1 to the values stored in SUM and create a new variable named SUM2

awk '{SUM=$3+$4;SUM2=SUM+1}{print $1,$2,SUM,SUM2}' 1_Inputfiles/Experiment1.txt

In a more useful way we can use this to create basic functions and for example calculate the average and print it:

awk '{SUM = $3+$4; avg = SUM/2}{print $1,$2,avg}' 1_Inputfiles/Experiment1.txt

Summarizing data across columns

Now, we want to add all the values of the third column together:

  • SUM+ = SUM = the variable we define and ‘+’ = is the summary function
  • 2x{} = 2 action blocks we are executing
  • END = Execute the 1st action block until the end of the file is reached = marks the end of action block 1
#using the variable SUM
awk '{SUM+=$3}END{print SUM}' 1_Inputfiles/Experiment1.txt

#using the variable new_name
awk '{New_name+=$3}END{print New_name}' 1_Inputfiles/Experiment1.txt

#add sum text to the results
awk '{SUM+=$3}END{print "The result is",SUM}' 1_Inputfiles/Experiment1.txt

#also here we can combine two actions into one command, i.e. summarize the 3rd and 4th column
awk '{SUM_Col3+=$3;SUMCol4+=$4}END{print SUM_Col3,SUMCol4}' 1_Inputfiles/Experiment1.txt

If we do not use END this gets messed up, so be careful. Basically without this variable, we consecutively add values within the column and print the individual values. You can check this unwanted behaviour with the code below.

#not using END
awk '{New_name+=$3}{print New_name}' 1_Inputfiles/Experiment1.txt

Summarizing across columns by using categories

In this example, we want to only summarize the values across our three categories: control1, control2 and N

New elements to keep in mind:

  • {SUM[$2]+=$3} = Sum column3 based on the categories in column2. SUM is stored in an array, where we have 3 indices (i) for control1, control2 and N[0.4uM]
  • for(i in SUM) = a for loop that loops through our array loop through every of the 3 indices we stored
  • print i = print each index stored in SUM
  • print SUM[i] = print each calculated sum stored in the array
#using the default categories
awk '{SUM[$2]+=$3}END{for(i in SUM) print i,SUM[i]}' 1_Inputfiles/Experiment1.txt

#combining control1 and control2 into one categorie (we use pipes again, Notice we can use pipes to combine different progamms)
sed 's/control[12]/control/g' 1_Inputfiles/Experiment1.txt | awk '{SUM[$2]+=$4}END{for(i in SUM)print i,SUM[i]}'

Using the build in variables

The NF command

The NF command is used to keep a count of the fields within the current input record.

awk '{print  $0"-->"NF;}' 1_Inputfiles/Experiment1.txt

Question

Why does the 3 row have 6 fields?

Remember that awk usually sees both space and tab as field separator. So without specifying the tab as unique separator, #contamination yellow_discoloration? is seen as two separate fields.

The NR command

NR gives you the total number of records being processed or line number.

awk '{print $0"-->"NR;}' 1_Inputfiles/Experiment1.txt

The Filename command

While this is not math this is incredibly useful if you have thousands of files that you want to merge (as long as you can distinguish them by the file name). For example, if we have the 100 different files with measurements from different 1_Inputfiles/Experiments, we can combine them into one document as long as they have the same layout in terms of nr. of columns and a useful header

awk '{print FILENAME,$0}' 1_Inputfiles/Experiment1.txt

With our example we would get something like this:

Since, the .txt is not pretty we can also remove it by using gsub and a pipe. Pipes are explained in the general notebook but basically allow to combine two commands into one.

awk '{print FILENAME,$0}' 1_Inputfiles/Experiment1.txt | awk '{gsub(/.txt/,"")}{print $0}'

The FNR command

When awk reads from the multiple input file, awk NR variable will give the total number of records relative to all the input file. Awk FNR will give you number of records for each input file.

awk '{print FILENAME, FNR;}' 1_Inputfiles/Experiment1.txt 1_Inputfiles/Metadata 

Dealing with headers

Sometimes dealing with headers can be a challenging since they are an extra column we might want to exclude if we for example want to summarize data. Now that we now about special variables, we can deal with this, for example in the Experiment2.txt file that comes with a header

#check how the file looks like
head 1_Inputfiles/Experiment2.txt

#remove the header
awk 'NR!=1{print $0}' 1_Inputfiles/Experiment2.txt

Exercise

  • For Experiment2.txt calculate the sum of the third plus the fourth column. Do this with and without removing the header
awk 'NR!=1{SUM=$3+$4}{print $1,$2,SUM}' 1_Inputfiles/Experiment2.txt

awk '{SUM=$3+$4}{print $1,$2,SUM}' 1_Inputfiles/Experiment2.txt

use Increment and decrement operators

  • ++a : increments the value of an operand by ‘1’. first increments the value of the operand, then returns the incremented value.
  • a++: first returns the value of the operand, then it decrements its value
  • Printf = allows more option for printing
awk 'BEGIN {a = 10; b = ++a; print b}'

awk 'BEGIN {a = 10; b = ++a; printf "a = %d, b = %d\n", a, b}'

awk 'BEGIN {a = 10; b = --a; printf "a = %d, b = %d\n", a, b}'
    
awk 'BEGIN {a = 10; b = a--; printf "a = %d, b = %d\n", a, b}'

We could for example tell awk to count the number of lines in which a specified pattern is matched and then output that ‘count’. For example, lets count how often we did control measurements?

awk '/control/{++cnt} END {print "Count = ", cnt}'  1_Inputfiles/Experiment1.txt

Exercise

  • Count how often the experiment was good
  • Count how many control measurements and how many nitrogen measurements we have. Do not forget about your semicolons.
#qst1
awk '/good/{++cnt} END {print "Count = ", cnt}'  1_Inputfiles/Experiment1.txt

#qst2
awk '/control/{++cnt_c};/N/{++cnt_n} END {print "Count control = ", cnt_c, "Count nitrogen = ", cnt_n}'  1_Inputfiles/Experiment1.txt

Merging tables

Inside the file 1_Inputfiles/Metadata we have some additional information on how we did our different 1_Inputfiles/Experiments. I.e. the date, temperature and how long they were run. With awk we can add this info into our 1_Inputfiles/Experiment1.txt table.

New things to learn:

  • FNR = record number (typically the line number per file)
  • NR = total record number across files
  • == = comparison operator that means equal to
  • a[$1] = We use the 1st Column of 1_Inputfiles/Metadata as our KEY = do we find the ‘Ex1’ in 1_Inputfiles/Metadata also in 1_Inputfiles/Experiment1.txt?
  • a[$1]=$0 = in 1_Inputfiles/Metadata use first column as key AND store the info from all the columns. If we only would want to have the fourth column we could use a[$1]=$4
awk 'FNR==NR{a[$1]=$0;next}{print $0,a[$1]}' 1_Inputfiles/Metadata 1_Inputfiles/Experiment1.txt

If you are interested in how the first command works FNR==NR we can break this down to

awk '{print "NR:", NR " FNR:", FNR}' 1_Inputfiles/Metadata 1_Inputfiles/Experiment1.txt

Which will give this:

The next statement means: the print statement is only executed after we stored the data from 1_Inputfiles/Metadata1 (using FNR==NR). If we do not use next we record 1_Inputfiles/Metadata1 but do not “store” it in the background and print in one go.

Below you can test what happens if we do not use next

#example without using next
awk 'FNR==NR{a[$1]=$0}{print $0,a[$1]}' 1_Inputfiles/Metadata 1_Inputfiles/Experiment1.txt`

The KEY a[$1] is important as it tells awk were to look for a common pattern that can be used for merging. In this example, both columns have Ex1 in the first column. If we use the second column, that has no common pattern, nothing would be merged.

AWK and sed

AWK and sed generally work nicely together and we can combine these two programs for our purposes using the | symbol.

As an example, lets assume we already have a sed script ready that replaces the names for the states in cat 1_Inputfiles/nameState and we can use this script to pipe it into awk. Before running the command, make sure to have a look at the sed script to understand what is happening.

#check the sed script
head 1_Inputfiles/nameState

We can see that we defined several lines of search and replace patterns. I.e we want sed to

  • replace MA with Massachusetts
  • replace PA with Pennsylvania
  • and so on…

So we prepared a file with several search and replace patterns, which we can feed into sed with -f

#clean up the state names
sed -f 1_Inputfiles/nameState 1_Inputfiles/list

If we use our pipes we can then use awk to only print the state names

#print the state names only
sed -f 1_Inputfiles/nameState 1_Inputfiles/list | awk -F, '{print $4}'

In this example, awk is processing the output produced by the sed script.

AWK scripts

Same as for sed, we can write awk scripts if your commands get more complicated.

For example we ant to make a list for each state that lists who lives in that state

For example, we can store a script that we open with nano byState and inside the script we store the following lines:

#! /bin/sh
awk -F, '{print $4 ", " $0}' $* |
sort |
awk -F, '
$1 == LastState{print "\t" $2}
$1 != LastState{LastState = $1
      print $1
      print "\t" $2}
'

We execute this script like this:

sed -f 1_Inputfiles/nameState 1_Inputfiles/list | sh byState  

To try to dissect things it is always useful to split a script apart, i.e. we can run the first lines like this

#step1
sed -f 1_Inputfiles/nameState 1_Inputfiles/list | awk -F, '{print $4 ", " $0}' 

#step2
sed -f 1_Inputfiles/nameState 1_Inputfiles/list | awk -F, '{print $4 ", " $0}' | sort 

#step3
sed -f 1_Inputfiles/nameState 1_Inputfiles/list | awk -F, '{print $4 ", " $0}' | sort | awk -F, '$1 == LastState{print "\t" $2}$1 != LastState{LastState = $1;print $1; print "\t" $2}'

We see that the script:

  1. Adjust the state name (that is the sed part is doing this)
  2. We put the state name into the first column
  3. We sort by the different states

In the second part of the command, we do some programming:

  1. We look at the first field of each record (the state) and we check if it is the same as in the previous record.
  2. If they are not the same, the name of the state is printed followed by the persons name
  3. If they are the same then only the persons name is printed