Record separators

So far, you've seen examples where awk automatically splits input line by line based on the newline character. Just like you can control how those lines are further split into fields using FS and other features, awk provides a way to control what constitutes a line in the first place. In awk parlance, the term record is used to describe the contents that gets placed in the $0 variable. And similar to OFS, you can control the string that gets added at the end for the print function. This chapter will also discuss how you can use special variables that have information related to record (line) numbers.

The example_files directory has all the files used in the examples.

Input record separator

The RS special variable is used to control how the input content is split into records. The default is the newline character, as evident from the examples used in the previous chapters. The special variable NR keeps track of the current record number.

# change the input record separator to a comma character
# note the content of the 2nd record where newline is just another character
$ printf 'this,is\na,sample,text' | awk -v RS=, '{print NR ")", $0}'
1) this
2) is
a
3) sample
4) text

Recall that default FS will split input record based on spaces, tabs and newlines. Now that you've seen how RS can be something other than \n, here's an example to show the full effect of the default record splitting.

$ s='   a\t\tb:1000\n\n\t \n\n123 7777:x  y \n \n z  :apple banana cherry'
$ printf '%b' "$s" | awk -v RS=: -v OFS=, '{$1=$1} 1'
a,b
1000,123,7777
x,y,z
apple,banana,cherry

Similar to FS, the RS value is treated as a string literal and then converted to a regexp. For now, consider an example with multiple characters for RS but without needing regexp metacharacters.

$ cat report.log
blah blah Error: second record starts
something went wrong
some more details Error: third record
details about what went wrong

# use 'Error:' as the input record separator
# print all the records that contains 'something'
$ awk -v RS='Error:' '/something/' report.log
 second record starts
something went wrong
some more details

If IGNORECASE is set, it will affect record separation as well. Except when the record separator is a single character, which can be worked around by using a character class.

$ awk -v IGNORECASE=1 -v RS='error:' 'NR==1' report.log
blah blah 

# when RS is a single character
$ awk -v IGNORECASE=1 -v RS='e' 'NR==1' report.log
blah blah Error: s
$ awk -v IGNORECASE=1 -v RS='[e]' 'NR==1' report.log
blah blah

The default line ending for text files varies between different platforms. For example, a text file downloaded from the internet or a file originating from Windows OS would typically have lines ending with carriage return and line feed characters. So, you'll have to use RS='\r\n' for such files. See also stackoverflow: Why does my tool output overwrite itself and how do I fix it? for a detailed discussion and mitigation methods.

Output record separator

The ORS special variable is used to customize the output record separator. ORS is the string that gets added to the end of every call to the print function. The default value for ORS is a single newline character, just like RS.

# change NUL record separator to dot and newline
$ printf 'apple\0banana\0cherry\0' | awk -v RS='\0' -v ORS='.\n' '1'
apple.
banana.
cherry.

$ cat msg.txt
Hello there.
It will rain to-
day. Have a safe
and pleasant jou-
rney.
# here ORS is an empty string
$ awk -v RS='-\n' -v ORS= '1' msg.txt
Hello there.
It will rain today. Have a safe
and pleasant journey.

Note that the $0 variable is assigned after removing trailing characters matched by RS. Thus, you cannot directly manipulate those characters. With tools that don't automatically strip record separator, such as perl, the previous example can be solved as perl -pe 's/-\n//' msg.txt.

Many a times, you need to change ORS depending upon contents of input record or some other condition. The cond ? expr1 : expr2 ternary operator is often used in such scenarios. The below example assumes that input is evenly divisible, you'll have to add more logic if that is not the case.

# can also use RS instead of "\n" here
$ seq 6 | awk '{ORS = NR%3 ? "-" : "\n"} 1'
1-2-3
4-5-6

If the last line of input didn't end with the input record separator, it might get added in the output if print is used, as ORS gets appended.
# here last line of the input doesn't end with a newline character
# but gets added via ORS when 'print' is used
$ printf '1\n2' | awk '1; END{print 3}'
1
2
3

Regexp RS and RT

As mentioned before, the value passed to RS is treated as a string literal and then converted to a regexp. Here are some examples.

# set input record separator as one or more digit characters
# print records containing both 'i' and 't'
$ printf 'Sample123string42with777numbers' | awk -v RS='[0-9]+' '/i/ && /t/'
string
with

# similar to FS, the value passed to RS is treated as a string
# which is then converted to a regexp, so need \\ instead of \ here
$ printf 'load;err_msg--ant,r2..not' | awk -v RS='\\W+' '/an/'
ant

First record will be empty if RS matches from the start of input file. However, if RS matches until the very last character of the input file, there won't be an empty record as the last record. This is different from how FS behaves if it matches until the last character.

# first record is empty and the last record is a newline character
# change 'echo' command to 'printf' and see what changes
$ echo '123string42with777' | awk -v RS='[0-9]+' '{print NR ") [" $0 "]"}'
1) []
2) [string]
3) [with]
4) [
]

# difference between FS and RS when they match till the end of the input
$ printf '123string42with777' | awk -v FS='[0-9]+' '{print NF}'
4
$ printf '123string42with777' | awk -v RS='[0-9]+' 'END{print NR}'
3

The RT special variable contains the text that was matched by RS. This variable gets updated for every input record.

# print record number and the value of RT for that record
# last record has empty RT because it didn't end with digits
$ echo 'Sample123string42with777numbers' | awk -v RS='[0-9]+' '{print NR, RT}'
1 123
2 42
3 777
4

Paragraph mode

As a special case, when RS is set to an empty string, one or more consecutive empty lines is used as the input record separator. Consider the below sample file:

$ cat para.txt
Hello World

Hi there
How are you

Just do-it
Believe it

banana
papaya
mango

Much ado about nothing
He he he
Adios amigo

Here's an example of processing input paragraph wise:

# print all paragraphs containing 'do'
# note that there'll be an empty line after the last record
$ awk -v RS= -v ORS='\n\n' '/do/' para.txt
Just do-it
Believe it

Much ado about nothing
He he he
Adios amigo

The empty line at the end is a common problem when dealing with custom record separators. You could either process the output further to remove it or add logic to handle the issue in awk itself. Here's one possible workaround for the previous example:

# here ORS is left as the default newline character
# uninitialized variable 's' will be empty for the first match
# afterwards, 's' will provide the empty line separation
$ awk -v RS= '/do/{print s $0; s="\n"}' para.txt
Just do-it
Believe it

Much ado about nothing
He he he
Adios amigo

Paragraph mode is not the same as using RS='\n\n+' because awk does a few more operations when RS is empty. See gawk manual: multiline records for details. Important points are quoted below and illustrated with examples.

However, there is an important difference between RS = "" and RS = "\n\n+". In the first case, leading newlines in the input data file are ignored

$ s='\n\n\na\nb\n\n12\n34\n\nhi\nhello\n'

# paragraph mode
$ printf '%b' "$s" | awk -v RS= -v ORS='\n---\n' 'NR<=2'
a
b
---
12
34
---

# RS is '\n\n+' instead of paragraph mode
$ printf '%b' "$s" | awk -v RS='\n\n+' -v ORS='\n---\n' 'NR<=2'

---
a
b
---

and if a file ends without extra blank lines after the last record, the final newline is removed from the record. In the second case, this special processing is not done.

$ s='\n\n\na\nb\n\n12\n34\n\nhi\nhello\n'

# paragraph mode
$ printf '%b' "$s" | awk -v RS= -v ORS='\n---\n' 'END{print}'
hi
hello
---

# RS is '\n\n+' instead of paragraph mode
$ printf '%b' "$s" | awk -v RS='\n\n+' -v ORS='\n---\n' 'END{print}'
hi
hello

---

When RS is set to the empty string and FS is set to a single character, the newline character always acts as a field separator. This is in addition to whatever field separations result from FS. When FS is the null string ("") or a regexp, this special feature of RS does not apply. It does apply to the default field separator of a single space: FS = " "

$ s='a:b\nc:d\n\n1\n2\n3'

# FS is a single character in paragraph mode
$ printf '%b' "$s" | awk -F: -v RS= -v ORS='\n---\n' '{$1=$1} 1'
a b c d
---
1 2 3
---

# FS is a regexp in paragraph mode
$ printf '%b' "$s" | awk -F'[:]' -v RS= -v ORS='\n---\n' '{$1=$1} 1'
a b
c d
---
1
2
3
---

# FS is a single character and RS is '\n\n+' instead of paragraph mode
$ printf '%b' "$s" | awk -F: -v RS='\n\n+' -v ORS='\n---\n' '{$1=$1} 1'
a b
c d
---
1
2
3
---

NR vs FNR

There are two special variables related to record numbering. You've seen NR earlier in the chapter, but here are some more examples.

# same as: head -n2
$ seq 5 | awk 'NR<=2'
1
2

# same as: tail -n1
$ awk 'END{print}' table.txt
yellow banana window shoes 3.14

# change the first field content only for the second line
$ awk 'NR==2{$1="green"} 1' table.txt
brown bread mat hair 42
green cake mug shirt -7
yellow banana window shoes 3.14

All the examples with NR so far has been with a single file input. If there are multiple file inputs, then you can choose between NR and the second special variable FNR. The difference is that NR contains total records read so far whereas FNR contains record number of only the current file being processed. Here are some examples to show them in action. You'll see more examples in later chapters as well.

$ awk -v OFS='\t' 'BEGIN{print "NR", "FNR", "Content"}
                   {print NR, FNR, $0}' report.log table.txt
NR      FNR     Content
1       1       blah blah Error: second record starts
2       2       something went wrong
3       3       some more details Error: third record
4       4       details about what went wrong
5       1       brown bread mat hair 42
6       2       blue cake mug shirt -7
7       3       yellow banana window shoes 3.14

# same as: head -q -n1
$ awk 'FNR==1' report.log table.txt
blah blah Error: second record starts
brown bread mat hair 42

For large input files, use exit to avoid unnecessary record processing.

$ seq 3542 4623452 | awk 'NR==2452{print; exit}'
5993
$ seq 3542 4623452 | awk 'NR==250; NR==2452{print; exit}'
3791
5993

# here is a sample time comparison
$ time seq 3542 4623452 | awk 'NR==2452{print; exit}' > f1
real    0m0.004s
$ time seq 3542 4623452 | awk 'NR==2452' > f2
real    0m0.395s

Summary

This chapter showed you how to change the way input content is split into records and how to set the string to be appended when print is used. The paragraph mode is useful for processing multiline records separated by empty lines. You also learned two special variables related to record numbers and when to use them.

So far, you've used awk to manipulate file content without modifying the source file. The next chapter will discuss how to write back the changes to the original input files.

Exercises

The exercises directory has all the files used in this section.

1) The input file jumbled.txt consists of words separated by various delimiters. Display all words that contain an or at or in or it, one per line.

$ cat jumbled.txt
overcoats;furrowing-typeface%pewter##hobby
wavering:concession/woof\retailer
joint[]seer{intuition}titanic

$ awk ##### add your solution here
overcoats
furrowing
wavering
joint
intuition
titanic

2) Emulate paste -sd, with awk.

# this command joins all input lines with the ',' character
$ paste -sd, addr.txt
Hello World,How are you,This game is good,Today is sunny,12345,You are funny
# make sure there's no ',' at end of the line
# and that there's a newline character at the end of the line
$ awk ##### add your solution here
Hello World,How are you,This game is good,Today is sunny,12345,You are funny

# if there's only one line in input, again make sure there's no trailing ','
$ printf 'fig' | paste -sd,
fig
$ printf 'fig' | awk ##### add your solution here
fig

3) For the input file scores.csv, add another column named GP which is calculated out of 100 by giving 50% weightage to Maths and 25% each for Physics and Chemistry.

$ awk ##### add your solution here
Name,Maths,Physics,Chemistry,GP
Blue,67,46,99,69.75
Lin,78,83,80,79.75
Er,56,79,92,70.75
Cy,97,98,95,96.75
Ort,68,72,66,68.5
Ith,100,100,100,100

4) For the input file sample.txt, extract paragraphs containing do and exactly two lines.

$ cat sample.txt
Hello World

Good day
How are you

Just do-it
Believe it

Today is sunny
Not a bit funny
No doubt you like it too

Much ado about nothing
He he he

# note that there's no extra empty line at the end of the output
$ awk ##### add your solution here
Just do-it
Believe it

Much ado about nothing
He he he

5) For the input file sample.txt, change each paragraph to a single line by joining lines using . and a space character as the separator. Also, add a final . to each paragraph.

# note that there's no extra empty line at the end of the output
$ awk ##### add your solution here
Hello World.

Good day. How are you.

Just do-it. Believe it.

Today is sunny. Not a bit funny. No doubt you like it too.

Much ado about nothing. He he he.

6) The various input/output separators can be changed dynamically and comes into effect during the next input/output operation. For the input file mixed_fs.txt, retain only the first two fields from each input line. The field separators should be space for the first two lines and , for the rest of the lines.

$ cat mixed_fs.txt
rose lily jasmine tulip
pink blue white yellow
car,mat,ball,basket
green,brown,black,purple
apple,banana,cherry

$ awk ##### add your solution here
rose lily
pink blue
car,mat
green,brown
apple,banana

7) For the input file table.txt, print other than the second line.

$ awk ##### add your solution here
brown bread mat hair 42
yellow banana window shoes 3.14

8) For the table.txt file, print only the line number for lines containing air or win.

$ awk ##### add your solution here
1
3

9) For the input file table.txt, calculate the sum of numbers in the last column, excluding the second line.

$ awk ##### add your solution here
45.14

10) Print the second and fourth line for every block of five lines.

$ seq 15 | awk ##### add your solution here
2
4
7
9
12
14

11) For the input file odd.txt, surround all whole words with {} that start and end with the same word character. This is a contrived exercise to make you use the RT variable (sed -E 's/\b(\w)(\w*\1)?\b/{&}/g' odd.txt would be a simpler solution).

$ cat odd.txt
-oreo-not:a _a2_ roar<=>took%22
RoaR to wow-

$ awk ##### add your solution here
-{oreo}-not:{a} {_a2_} {roar}<=>took%{22}
{RoaR} to {wow}-

12) Print only the second field of the third line, if any, from these input files: addr.txt, sample.txt and copyright.txt. Consider space as the field separator.

$ awk ##### add your solution here
game
day
bla

13) The input file ip.txt has varying amount of empty lines between the records, change them to be always two empty lines. Also, remove the empty lines at the start and end of the file.

$ awk ##### add your solution here
hello


world


apple
banana
cherry


tea coffee
chocolate

14) The sample string shown below uses cat as the record separator (irrespective of case). Display only the even numbered records separated by a single empty line.

$ s='applecatfigCaT12345cAtbananaCATguava:caT:mangocat3'
$ echo "$s" | awk ##### add your solution here
fig

banana

:mango

15) Input has the ASCII NUL character as the record separator. Change it to dot and newline characters as shown below.

$ printf 'apple\npie\0banana\ncherry\0' | awk ##### add your solution here
apple
pie.
banana
cherry.

CLI text processing with GNU awk