join
The join
command helps you to combine lines from two files based on a common field. This works best when the input is already sorted by that field.
Default join
By default, join
combines two files based on the first field content (also referred as key). Only the lines with common keys will be part of the output.
The key field will be displayed first in the output (this distinction will come into play if the first field isn't the key). Rest of the line will have the remaining fields from the first and second files, in that order. One or more blanks (space or tab) will be considered as the input field separator and a single space will be used as the output field separator. If present, blank characters at the start of the input lines will be ignored.
# sample sorted input files
$ cat shopping_jan.txt
apple 10
banana 20
soap 3
tshirt 3
$ cat shopping_feb.txt
banana 15
fig 100
pen 2
soap 1
# combine common lines based on the first field
$ join shopping_jan.txt shopping_feb.txt
banana 20 15
soap 3 1
If a field value is present multiple times in the same input file, all possible combinations will be present in the output. As shown below, join
will also ensure to add a final newline character even if it wasn't present in the input.
$ join <(printf 'a f1_x\na f1_y') <(printf 'a f2_x\na f2_y')
a f1_x f2_x
a f1_x f2_y
a f1_y f2_x
a f1_y f2_y
Note that the collating order used for
join
should be same as the one used tosort
the input files. Usejoin -i
to ignore case, similar tosort -f
usage.
If the input files are not sorted,
join
will produce an error if there are unpairable lines. You can use the--nocheck-order
option to ignore this error. However, as per the documentation, this option "is not guaranteed to produce any particular output."
Non-matching lines
By default, only the lines having common keys are part of the output. You can use the -a
option to also include the non-matching lines from the input files. Use 1
and 2
as the argument for the first and second file respectively. You'll later see how to fill missing fields with a custom string.
# includes non-matching lines from the first file
$ join -a1 shopping_jan.txt shopping_feb.txt
apple 10
banana 20 15
soap 3 1
tshirt 3
# includes non-matching lines from both the files
$ join -a1 -a2 shopping_jan.txt shopping_feb.txt
apple 10
banana 20 15
fig 100
pen 2
soap 3 1
tshirt 3
If you use -v
instead of -a
, the output will have only the non-matching lines.
$ join -v2 shopping_jan.txt shopping_feb.txt
fig 100
pen 2
$ join -v1 -v2 shopping_jan.txt shopping_feb.txt
apple 10
fig 100
pen 2
tshirt 3
Change field separator
You can use the -t
option to specify a single byte character as the field separator. The output field separator will be same as the value used for the -t
option. Use \0
to specify NUL as the separator. Empty string will cause entire input line content to be considered as keys. Depending on your shell you can use ANSI-C quoting to use escapes like \t
instead of a literal tab character.
$ cat marks.csv
ECE,Raj,53
ECE,Joel,72
EEE,Moi,68
CSE,Surya,81
EEE,Raj,88
CSE,Moi,62
EEE,Tia,72
ECE,Om,92
CSE,Amy,67
$ cat dept.txt
CSE
ECE
# get all lines from marks.csv based on the first field keys in dept.txt
$ join -t, <(sort marks.csv) dept.txt
CSE,Amy,67
CSE,Moi,62
CSE,Surya,81
ECE,Joel,72
ECE,Om,92
ECE,Raj,53
Files with headers
Use the --header
option to ignore first lines of both the input files from sorting consideration. Without this option, the join
command might still work correctly if unpairable lines aren't found, but it is preferable to use --header
when applicable. This option will also help when --check-order
option is active.
$ cat report_1.csv
Name,Maths,Physics
Amy,78,95
Moi,88,75
Raj,67,76
$ cat report_2.csv
Name,Chemistry
Amy,85
Joel,78
Raj,72
$ join --check-order -t, report_1.csv report_2.csv
join: report_1.csv:2: is not sorted: Amy,78,95
$ join --check-order --header -t, report_1.csv report_2.csv
Name,Maths,Physics,Chemistry
Amy,78,95,85
Raj,67,76,72
Change key field
By default, the first field of both the input files are used to combine the lines. You can use -1
and -2
options followed by a field number to specify a different field number. You can use the -j
option if the field number is the same for both the files.
Recall that the key field is the first field in the output. You'll later see how to customize the output field order.
$ cat names.txt
Amy
Raj
Tia
# combine based on the second field of the first file
# and the first field of the second file (default)
$ join -t, -1 2 <(sort -t, -k2,2 marks.csv) names.txt
Amy,CSE,67
Raj,ECE,53
Raj,EEE,88
Tia,EEE,72
Customize output field list
Use the -o
option to customize the fields required in the output and their order. Especially useful when the first field isn't the key. Each output field is specified as file number followed by a .
character and then the field number. You can specify multiple fields separated by a ,
character. As a special case, you can use 0
to indicate the key field.
# output field order is 1st, 2nd and 3rd fields from the first file
$ join -t, -1 2 -o 1.1,1.2,1.3 <(sort -t, -k2,2 marks.csv) names.txt
CSE,Amy,67
ECE,Raj,53
EEE,Raj,88
EEE,Tia,72
# 1st field from the first file, 2nd field from the second file
# and then 2nd and 3rd fields from the first file
$ join --header -t, -o 1.1,2.2,1.2,1.3 report_1.csv report_2.csv
Name,Chemistry,Maths,Physics
Amy,85,78,95
Raj,72,67,76
Same number of output fields
If you use auto
as the argument for the -o
option, first line of both the input files will be used to determine the number of output fields. If the other lines have extra fields, they will be discarded.
$ join <(printf 'a 1 2\nb p q r') <(printf 'a 3 4\nb x y z')
a 1 2 3 4
b p q r x y z
$ join -o auto <(printf 'a 1 2\nb p q r') <(printf 'a 3 4\nb x y z')
a 1 2 3 4
b p q x y
If the other lines have lesser number of fields, the -e
option will determine the string to be used as a filler (empty string is the default).
# the second line has two empty fields
$ join -o auto <(printf 'a 1 2\nb p') <(printf 'a 3 4\nb x')
a 1 2 3 4
b p x
$ join -o auto -e '-' <(printf 'a 1 2\nb p') <(printf 'a 3 4\nb x')
a 1 2 3 4
b p - x -
As promised earlier, here are some examples of filling fields for non-matching lines:
$ join -o auto -a1 -e 'NA' shopping_jan.txt shopping_feb.txt
apple 10 NA
banana 20 15
soap 3 1
tshirt 3 NA
$ join -o auto -a1 -a2 -e 'NA' shopping_jan.txt shopping_feb.txt
apple 10 NA
banana 20 15
fig NA 100
pen NA 2
soap 3 1
tshirt 3 NA
Set operations
This section covers whole line set operations you can perform on already sorted input files. Equivalent sort
and uniq
solutions will also be mentioned as comments (useful for unsorted inputs). Assume that there are no duplicate lines within an input file.
These two sorted input files will be used for the examples to follow:
$ paste colors_1.txt colors_2.txt
Blue Black
Brown Blue
Orange Green
Purple Orange
Red Pink
Teal Red
White White
Here's how you can get union and symmetric difference results. Recall that -t ''
will cause the entire input line content to be considered as keys.
# union
# unsorted input: sort -u colors_1.txt colors_2.txt
$ join -t '' -a1 -a2 colors_1.txt colors_2.txt
Black
Blue
Brown
Green
Orange
Pink
Purple
Red
Teal
White
# symmetric difference
# unsorted input: sort colors_1.txt colors_2.txt | uniq -u
$ join -t '' -v1 -v2 colors_1.txt colors_2.txt
Black
Brown
Green
Pink
Purple
Teal
Here's how you can get intersection and difference results. The equivalent comm
solutions for sorted input is also mentioned in the comments.
# intersection, same as: comm -12 colors_1.txt colors_2.txt
# unsorted input: sort colors_1.txt colors_2.txt | uniq -d
$ join -t '' colors_1.txt colors_2.txt
Blue
Orange
Red
White
# difference, same as: comm -13 colors_1.txt colors_2.txt
# unsorted input: sort colors_1.txt colors_1.txt colors_2.txt | uniq -u
$ join -t '' -v2 colors_1.txt colors_2.txt
Black
Green
Pink
# difference, same as: comm -23 colors_1.txt colors_2.txt
# unsorted input: sort colors_1.txt colors_2.txt colors_2.txt | uniq -u
$ join -t '' -v1 colors_1.txt colors_2.txt
Brown
Purple
Teal
As mentioned before, join
will display all the combinations if there are duplicate entries. Here's an example to show the differences between sort
, comm
and join
solutions for displaying common lines:
$ paste list_1.txt list_2.txt
apple cherry
banana cherry
cherry mango
cherry papaya
cherry
cherry
# only one entry per common line
$ sort list_1.txt list_2.txt | uniq -d
cherry
# minimum of 'no. of entries in file1' and 'no. of entries in file2'
$ comm -12 list_1.txt list_2.txt
cherry
cherry
# 'no. of entries in file1' multiplied by 'no. of entries in file2'
$ join -t '' list_1.txt list_2.txt
cherry
cherry
cherry
cherry
cherry
cherry
cherry
cherry
NUL separator
Use the -z
option if you want to use NUL character as the line separator. In this scenario, join
will ensure to add a final NUL character even if not present in the input.
$ join -z <(printf 'a 1\0b x') <(printf 'a 2\0b y') | cat -v
a 1 2^@b x y^@
Alternatives
Here are some alternate commands you can explore if join
isn't enough to solve your task. These alternatives do not require input to be sorted.
- zet — set operations on one or more input files
- Comparing lines between files section from my GNU grep ebook
- Two file processing chapter from my GNU awk ebook, has examples for both line and field based comparisons
- Two file processing chapter from my Perl one-liners ebook, has examples for both line and field based comparisons
Exercises
The exercises directory has all the files used in this section.
Assume that the input files are already sorted for these exercises.
1) Use appropriate options to get the expected outputs shown below.
# no output
$ join <(printf 'apple 2\nfig 5') <(printf 'Fig 10\nmango 4')
# expected output 1
##### add your solution here
fig 5 10
# expected output 2
##### add your solution here
apple 2
fig 5 10
mango 4
2) Use the join
command to display only the non-matching lines based on the first field.
$ cat j1.txt
apple 2
fig 5
lemon 10
tomato 22
$ cat j2.txt
almond 33
fig 115
mango 20
pista 42
# first field items present in j1.txt but not j2.txt
##### add your solution here
apple 2
lemon 10
tomato 22
# first field items present in j2.txt but not j1.txt
##### add your solution here
almond 33
mango 20
pista 42
3) Filter lines from j1.txt
and j2.txt
that match the items from s1.txt
.
$ cat s1.txt
apple
coffee
fig
honey
mango
pasta
sugar
tea
##### add your solution here
apple 2
fig 115
fig 5
mango 20
4) Join the marks_1.csv
and marks_2.csv
files to get the expected output shown below.
$ cat marks_1.csv
Name,Biology,Programming
Er,92,77
Ith,100,100
Lin,92,100
Sil,86,98
$ cat marks_2.csv
Name,Maths,Physics,Chemistry
Cy,97,98,95
Ith,100,100,100
Lin,78,83,80
##### add your solution here
Name,Biology,Programming,Maths,Physics,Chemistry
Ith,100,100,100,100,100
Lin,92,100,78,83,80
5) By default, the first field is used to combine the lines. Which options are helpful if you want to change the key field to be used for joining?
6) Join the marks_1.csv
and marks_2.csv
files to get the expected output with specific fields as shown below.
##### add your solution here
Name,Programming,Maths,Biology
Ith,100,100,100
Lin,100,78,92
7) Join the marks_1.csv
and marks_2.csv
files to get the expected output shown below. Use 50
as the filler data.
##### add your solution here
Name,Biology,Programming,Maths,Physics,Chemistry
Cy,50,50,97,98,95
Er,92,77,50,50,50
Ith,100,100,100,100,100
Lin,92,100,78,83,80
Sil,86,98,50,50,50
8) When you use the -o auto
option, what'd happen to the extra fields compared to those in the first lines of the input data?
9) From the input files j3.txt
and j4.txt
, filter only the lines are unique — i.e. lines that are not common to these files. Assume that the input files do not have duplicate entries.
$ cat j3.txt
almond
apple pie
cold coffee
honey
mango shake
pasta
sugar
tea
$ cat j4.txt
apple
banana shake
coffee
fig
honey
mango shake
milk
tea
yeast
##### add your solution here
almond
apple
apple pie
banana shake
coffee
cold coffee
fig
milk
pasta
sugar
yeast
10) From the input files j3.txt
and j4.txt
, filter only the lines are common to these files.
##### add your solution here
honey
mango shake
tea