join

The join command helps you to combine lines from two files based on a common field. This works best when the input is already sorted by that field.

Default join

By default, join combines two files based on the first field content (also referred as key). Only the lines with common keys will be part of the output.

The key field will be displayed first in the output (this distinction will come into play if the first field isn't the key). Rest of the line will have the remaining fields from the first and second files, in that order. One or more blanks (space or tab) will be considered as the input field separator and a single space will be used as the output field separator. If present, blank characters at the start of the input lines will be ignored.

# sample sorted input files
$ cat shopping_jan.txt
apple   10
banana  20
soap    3
tshirt  3
$ cat shopping_feb.txt
banana  15
fig     100
pen     2
soap    1

# combine common lines based on the first field
$ join shopping_jan.txt shopping_feb.txt
banana 20 15
soap 3 1

If a field value is present multiple times in the same input file, all possible combinations will be present in the output. As shown below, join will also ensure to add a final newline character even if it wasn't present in the input.

$ join <(printf 'a f1_x\na f1_y') <(printf 'a f2_x\na f2_y')
a f1_x f2_x
a f1_x f2_y
a f1_y f2_x
a f1_y f2_y

Note that the collating order used for join should be same as the one used to sort the input files. Use join -i to ignore case, similar to sort -f usage.

If the input files are not sorted, join will produce an error if there are unpairable lines. You can use the --nocheck-order option to ignore this error. However, as per the documentation, this option "is not guaranteed to produce any particular output."

Non-matching lines

By default, only the lines having common keys are part of the output. You can use the -a option to also include the non-matching lines from the input files. Use 1 and 2 as the argument for the first and second file respectively. You'll later see how to fill missing fields with a custom string.

# includes non-matching lines from the first file
$ join -a1 shopping_jan.txt shopping_feb.txt
apple 10
banana 20 15
soap 3 1
tshirt 3

# includes non-matching lines from both the files
$ join -a1 -a2 shopping_jan.txt shopping_feb.txt
apple 10
banana 20 15
fig 100
pen 2
soap 3 1
tshirt 3

If you use -v instead of -a, the output will have only the non-matching lines.

$ join -v2 shopping_jan.txt shopping_feb.txt
fig 100
pen 2

$ join -v1 -v2 shopping_jan.txt shopping_feb.txt
apple 10
fig 100
pen 2
tshirt 3

Change field separator

You can use the -t option to specify a single byte character as the field separator. The output field separator will be same as the value used for the -t option. Use \0 to specify NUL as the separator. Empty string will cause entire input line content to be considered as keys. Depending on your shell you can use ANSI-C quoting to use escapes like \t instead of a literal tab character.

$ cat marks.csv
ECE,Raj,53
ECE,Joel,72
EEE,Moi,68
CSE,Surya,81
EEE,Raj,88
CSE,Moi,62
EEE,Tia,72
ECE,Om,92
CSE,Amy,67
$ cat dept.txt
CSE
ECE

# get all lines from marks.csv based on the first field keys in dept.txt
$ join -t, <(sort marks.csv) dept.txt
CSE,Amy,67
CSE,Moi,62
CSE,Surya,81
ECE,Joel,72
ECE,Om,92
ECE,Raj,53

Files with headers

Use the --header option to ignore first lines of both the input files from sorting consideration. Without this option, the join command might still work correctly if unpairable lines aren't found, but it is preferable to use --header when applicable. This option will also help when --check-order option is active.

$ cat report_1.csv
Name,Maths,Physics
Amy,78,95
Moi,88,75
Raj,67,76
$ cat report_2.csv
Name,Chemistry
Amy,85
Joel,78
Raj,72

$ join --check-order -t, report_1.csv report_2.csv
join: report_1.csv:2: is not sorted: Amy,78,95
$ join --check-order --header -t, report_1.csv report_2.csv
Name,Maths,Physics,Chemistry
Amy,78,95,85
Raj,67,76,72

Change key field

By default, the first field of both the input files are used to combine the lines. You can use -1 and -2 options followed by a field number to specify a different field number. You can use the -j option if the field number is the same for both the files.

Recall that the key field is the first field in the output. You'll later see how to customize the output field order.

$ cat names.txt
Amy
Raj
Tia

# combine based on the second field of the first file
# and the first field of the second file (default)
$ join -t, -1 2 <(sort -t, -k2,2 marks.csv) names.txt
Amy,CSE,67
Raj,ECE,53
Raj,EEE,88
Tia,EEE,72

Customize output field list

Use the -o option to customize the fields required in the output and their order. Especially useful when the first field isn't the key. Each output field is specified as file number followed by a . character and then the field number. You can specify multiple fields separated by a , character. As a special case, you can use 0 to indicate the key field.

# output field order is 1st, 2nd and 3rd fields from the first file
$ join -t, -1 2 -o 1.1,1.2,1.3 <(sort -t, -k2,2 marks.csv) names.txt
CSE,Amy,67
ECE,Raj,53
EEE,Raj,88
EEE,Tia,72

# 1st field from the first file, 2nd field from the second file
# and then 2nd and 3rd fields from the first file
$ join --header -t, -o 1.1,2.2,1.2,1.3 report_1.csv report_2.csv
Name,Chemistry,Maths,Physics
Amy,85,78,95
Raj,72,67,76

Same number of output fields

If you use auto as the argument for the -o option, first line of both the input files will be used to determine the number of output fields. If the other lines have extra fields, they will be discarded.

$ join <(printf 'a 1 2\nb p q r') <(printf 'a 3 4\nb x y z')
a 1 2 3 4
b p q r x y z

$ join -o auto <(printf 'a 1 2\nb p q r') <(printf 'a 3 4\nb x y z')
a 1 2 3 4
b p q x y

If the other lines have lesser number of fields, the -e option will determine the string to be used as a filler (empty string is the default).

# the second line has two empty fields
$ join -o auto <(printf 'a 1 2\nb p') <(printf 'a 3 4\nb x')
a 1 2 3 4
b p  x 

$ join -o auto -e '-' <(printf 'a 1 2\nb p') <(printf 'a 3 4\nb x')
a 1 2 3 4
b p - x -

As promised earlier, here are some examples of filling fields for non-matching lines:

$ join -o auto -a1 -e 'NA' shopping_jan.txt shopping_feb.txt
apple 10 NA
banana 20 15
soap 3 1
tshirt 3 NA

$ join -o auto -a1 -a2 -e 'NA' shopping_jan.txt shopping_feb.txt
apple 10 NA
banana 20 15
fig NA 100
pen NA 2
soap 3 1
tshirt 3 NA

Set operations

This section covers whole line set operations you can perform on already sorted input files. Equivalent sort and uniq solutions will also be mentioned as comments (useful for unsorted inputs). Assume that there are no duplicate lines within an input file.

These two sorted input files will be used for the examples to follow:

$ paste colors_1.txt colors_2.txt
Blue    Black
Brown   Blue
Orange  Green
Purple  Orange
Red     Pink
Teal    Red
White   White

Here's how you can get union and symmetric difference results. Recall that -t '' will cause the entire input line content to be considered as keys.

# union
# unsorted input: sort -u colors_1.txt colors_2.txt
$ join -t '' -a1 -a2 colors_1.txt colors_2.txt
Black
Blue
Brown
Green
Orange
Pink
Purple
Red
Teal
White

# symmetric difference
# unsorted input: sort colors_1.txt colors_2.txt | uniq -u
$ join -t '' -v1 -v2 colors_1.txt colors_2.txt
Black
Brown
Green
Pink
Purple
Teal

Here's how you can get intersection and difference results. The equivalent comm solutions for sorted input is also mentioned in the comments.

# intersection, same as: comm -12 colors_1.txt colors_2.txt
# unsorted input: sort colors_1.txt colors_2.txt | uniq -d
$ join -t '' colors_1.txt colors_2.txt
Blue
Orange
Red
White

# difference, same as: comm -13 colors_1.txt colors_2.txt
# unsorted input: sort colors_1.txt colors_1.txt colors_2.txt | uniq -u
$ join -t '' -v2 colors_1.txt colors_2.txt
Black
Green
Pink

# difference, same as: comm -23 colors_1.txt colors_2.txt
# unsorted input: sort colors_1.txt colors_2.txt colors_2.txt | uniq -u
$ join -t '' -v1 colors_1.txt colors_2.txt
Brown
Purple
Teal

As mentioned before, join will display all the combinations if there are duplicate entries. Here's an example to show the differences between sort, comm and join solutions for displaying common lines:

$ paste list_1.txt list_2.txt
apple   cherry
banana  cherry
cherry  mango
cherry  papaya
cherry  
cherry  

# only one entry per common line
$ sort list_1.txt list_2.txt | uniq -d
cherry

# minimum of 'no. of entries in file1' and 'no. of entries in file2'
$ comm -12 list_1.txt list_2.txt
cherry
cherry

# 'no. of entries in file1' multiplied by 'no. of entries in file2'
$ join -t '' list_1.txt list_2.txt
cherry
cherry
cherry
cherry
cherry
cherry
cherry
cherry

NUL separator

Use the -z option if you want to use NUL character as the line separator. In this scenario, join will ensure to add a final NUL character even if not present in the input.

$ join -z <(printf 'a 1\0b x') <(printf 'a 2\0b y') | cat -v
a 1 2^@b x y^@

Alternatives

Here are some alternate commands you can explore if join isn't enough to solve your task. These alternatives do not require input to be sorted.

zet — set operations on one or more input files
Comparing lines between files section from my GNU grep ebook
Two file processing chapter from my GNU awk ebook, has examples for both line and field based comparisons
Two file processing chapter from my Perl one-liners ebook, has examples for both line and field based comparisons

Exercises

The exercises directory has all the files used in this section.

Assume that the input files are already sorted for these exercises.

1) Use appropriate options to get the expected outputs shown below.

# no output
$ join <(printf 'apple 2\nfig 5') <(printf 'Fig 10\nmango 4')

# expected output 1
##### add your solution here
fig 5 10

# expected output 2
##### add your solution here
apple 2
fig 5 10
mango 4

2) Use the join command to display only the non-matching lines based on the first field.

$ cat j1.txt
apple   2
fig     5
lemon   10
tomato  22
$ cat j2.txt
almond  33
fig     115
mango   20
pista   42

# first field items present in j1.txt but not j2.txt
##### add your solution here
apple 2
lemon 10
tomato 22

# first field items present in j2.txt but not j1.txt
##### add your solution here
almond 33
mango 20
pista 42

3) Filter lines from j1.txt and j2.txt that match the items from s1.txt.

$ cat s1.txt
apple
coffee
fig
honey
mango
pasta
sugar
tea

##### add your solution here
apple 2
fig 115
fig 5
mango 20

4) Join the marks_1.csv and marks_2.csv files to get the expected output shown below.

$ cat marks_1.csv
Name,Biology,Programming
Er,92,77
Ith,100,100
Lin,92,100
Sil,86,98
$ cat marks_2.csv
Name,Maths,Physics,Chemistry
Cy,97,98,95
Ith,100,100,100
Lin,78,83,80

##### add your solution here
Name,Biology,Programming,Maths,Physics,Chemistry
Ith,100,100,100,100,100
Lin,92,100,78,83,80

5) By default, the first field is used to combine the lines. Which options are helpful if you want to change the key field to be used for joining?

6) Join the marks_1.csv and marks_2.csv files to get the expected output with specific fields as shown below.

##### add your solution here
Name,Programming,Maths,Biology
Ith,100,100,100
Lin,100,78,92

7) Join the marks_1.csv and marks_2.csv files to get the expected output shown below. Use 50 as the filler data.

##### add your solution here
Name,Biology,Programming,Maths,Physics,Chemistry
Cy,50,50,97,98,95
Er,92,77,50,50,50
Ith,100,100,100,100,100
Lin,92,100,78,83,80
Sil,86,98,50,50,50

8) When you use the -o auto option, what'd happen to the extra fields compared to those in the first lines of the input data?

9) From the input files j3.txt and j4.txt, filter only the lines are unique — i.e. lines that are not common to these files. Assume that the input files do not have duplicate entries.

$ cat j3.txt
almond
apple pie
cold coffee
honey
mango shake
pasta
sugar
tea
$ cat j4.txt
apple
banana shake
coffee
fig
honey
mango shake
milk
tea
yeast

##### add your solution here
almond
apple
apple pie
banana shake
coffee
cold coffee
fig
milk
pasta
sugar
yeast

10) From the input files j3.txt and j4.txt, filter only the lines are common to these files.

##### add your solution here
honey
mango shake
tea

CLI text processing with GNU Coreutils