join

The join command helps you to combine lines from two files based on a common field. This works best when the input is already sorted by that field.

Default join

By default, join combines two files based on the first field content (also referred as key). Only the lines with common keys will be part of the output.

The key field will be displayed first in the output (this distinction will come into play if the first field isn't the key). Rest of the line will have the remaining fields from the first and second files, in that order. One or more blanks (space or tab) will be considered as the input field separator and a single space will be used as the output field separator. If present, blank characters at the start of the input lines will be ignored.

# sample sorted input files
$ cat shopping_jan.txt
apple   10
banana  20
soap    3
tshirt  3
$ cat shopping_feb.txt
banana  15
fig     100
pen     2
soap    1

# combine common lines based on the first field
$ join shopping_jan.txt shopping_feb.txt
banana 20 15
soap 3 1

If a field value is present multiple times in the same input file, all possible combinations will be present in the output. As shown below, join will also ensure to add a final newline character even if not present in the input.

$ join <(printf 'a f1_x\na f1_y') <(printf 'a f2_x\na f2_y')
a f1_x f2_x
a f1_x f2_y
a f1_y f2_x
a f1_y f2_y

info Note that the collating order used for join should be same as the one used to sort the input files. Use join -i to ignore case, similar to sort -f usage.

info If the input files are not sorted, join will produce an error if there are unpairable lines. You can use the --nocheck-order option to ignore this error. However, as per the documentation, this option "is not guaranteed to produce any particular output."

Non-matching lines

By default, only the lines having common keys are part of the output. You can use the -a option to also include the non-matching lines from the input files. Use 1 and 2 as the argument for the first and second file respectively. You'll later see how to fill missing fields with a custom string.

# includes non-matching lines from the first file
$ join -a1 shopping_jan.txt shopping_feb.txt
apple 10
banana 20 15
soap 3 1
tshirt 3

# includes non-matching lines from both the files
$ join -a1 -a2 shopping_jan.txt shopping_feb.txt
apple 10
banana 20 15
fig 100
pen 2
soap 3 1
tshirt 3

If you use -v instead of -a, the output will have only the non-matching lines.

$ join -v1 shopping_jan.txt shopping_feb.txt
apple 10
tshirt 3

$ join -v1 -v2 shopping_jan.txt shopping_feb.txt
apple 10
fig 100
pen 2
tshirt 3

Change field separator

You can use the -t option to specify a single byte character as the field separator. The output field separator will be same as the value used for the -t option. Use \0 to specify NUL as the separator. Empty string will cause entire input line content to be considered as keys. Depending on your shell you can use ANSI-C quoting to use escapes like \t instead of a literal tab character.

$ cat marks.csv 
ECE,Raj,53
ECE,Joel,72
EEE,Moi,68
CSE,Surya,81
EEE,Raj,88
CSE,Moi,62
EEE,Tia,72
ECE,Om,92
CSE,Amy,67
$ cat dept.txt
CSE
ECE

# get all lines from marks.csv based on the first field keys in dept.txt
$ join -t, <(sort marks.csv) dept.txt
CSE,Amy,67
CSE,Moi,62
CSE,Surya,81
ECE,Joel,72
ECE,Om,92
ECE,Raj,53

Files with headers

Use the --header option to ignore first lines of both the input files from sorting consideration. Without this option, the join command might still work correctly if unpairable lines aren't found, but it is preferable to use --header when applicable. This option will also help when --check-order option is active.

$ cat report_1.csv
Name,Maths,Physics
Amy,78,95
Moi,88,75
Raj,67,76
$ cat report_2.csv
Name,Chemistry
Amy,85
Joel,78
Raj,72

$ join --check-order -t, report_1.csv report_2.csv
join: report_1.csv:2: is not sorted: Amy,78,95
$ join --check-order --header -t, report_1.csv report_2.csv
Name,Maths,Physics,Chemistry
Amy,78,95,85
Raj,67,76,72

Change key field

By default, the first field of both the input files are used to combine the lines. You can use -1 and -2 options followed by a field number to specify a different field number. You can use the -j option if the field number is the same for both the files.

Recall that the key field is the first field in the output. You'll later see how to customize the output field order.

$ cat names.txt
Amy
Raj
Tia

# combine based on second field of the first file
# and first field of the second file (default)
$ join -t, -1 2 <(sort -t, -k2,2 marks.csv) names.txt
Amy,CSE,67
Raj,ECE,53
Raj,EEE,88
Tia,EEE,72

Customize output field list

You can use the -o option to customize the fields required in the output and their order. Especially useful when the first field isn't the key. Each output field is specified as file number followed by a . character and then the field number. You can specify multiple fields separated by a , character. As a special case, you can use 0 to indicate the key field.

# output field order is 1st, 2nd and 3rd fields from the first file
$ join -t, -1 2 -o 1.1,1.2,1.3 <(sort -t, -k2,2 marks.csv) names.txt
CSE,Amy,67
ECE,Raj,53
EEE,Raj,88
EEE,Tia,72

# 1st field from the first file, 2nd field from the second file
# and then 2nd and 3rd fields from the first file
$ join --header -t, -o 1.1,2.2,1.2,1.3 report_1.csv report_2.csv
Name,Chemistry,Maths,Physics
Amy,85,78,95
Raj,72,67,76

Same number of output fields

If you use auto as the argument for the -o option, first line of both the input files will be used to determine the number of output fields. If the other lines have extra fields, they will be discarded.

$ join <(printf 'a 1 2\nb p q r') <(printf 'a 3 4\nb x y z')
a 1 2 3 4
b p q r x y z

$ join -o auto <(printf 'a 1 2\nb p q r') <(printf 'a 3 4\nb x y z')
a 1 2 3 4
b p q x y

If the other lines have lesser number of fields, the -e option will determine the string to be used as a filler (default is empty string).

$ join -o auto <(printf 'a 1 2\nb p') <(printf 'a 3 4\nb x')
a 1 2 3 4
b p  x 

$ join -o auto -e '-' <(printf 'a 1 2\nb p') <(printf 'a 3 4\nb x')
a 1 2 3 4
b p - x -

As promised earlier, here's an example of filling fields for non-matching lines:

$ join -o auto -a1 -e 'NA' shopping_jan.txt shopping_feb.txt
apple 10 NA
banana 20 15
soap 3 1
tshirt 3 NA

$ join -o auto -a1 -a2 -e 'NA' shopping_jan.txt shopping_feb.txt
apple 10 NA
banana 20 15
fig NA 100
pen NA 2
soap 3 1
tshirt 3 NA

Set operations

This section covers whole line set operations you can perform on already sorted input files. Equivalent sort and uniq solutions will also be mentioned as comments (useful for unsorted inputs). Assume that there are no duplicate lines within an input file.

These two sorted input files will be used for the examples to follow:

$ paste colors_1.txt colors_2.txt
Blue    Black
Brown   Blue
Orange  Green
Purple  Orange
Red     Pink
Teal    Red
White   White

Here's how you can get union and symmetric difference results. Recall that -t '' will cause entire input line content to be considered as keys.

# union
# unsorted input: sort -u colors_1.txt colors_2.txt
$ join -t '' -a1 -a2 colors_1.txt colors_2.txt
Black
Blue
Brown
Green
Orange
Pink
Purple
Red
Teal
White

# symmetric difference
# unsorted input: sort colors_1.txt colors_2.txt | uniq -u
$ join -t '' -v1 -v2 colors_1.txt colors_2.txt
Black
Brown
Green
Pink
Purple
Teal

Here's how you can get intersection and difference results. The equivalent comm solutions for sorted input is also mentioned in the comments.

# intersection, same as: comm -12 colors_1.txt colors_2.txt
# unsorted input: sort colors_1.txt colors_2.txt | uniq -d
$ join -t '' colors_1.txt colors_2.txt
Blue
Orange
Red
White

# difference, same as: comm -13 colors_1.txt colors_2.txt
# unsorted input: sort colors_1.txt colors_1.txt colors_2.txt | uniq -u
$ join -t '' -v2 colors_1.txt colors_2.txt
Black
Green
Pink

# difference, same as: comm -23 colors_1.txt colors_2.txt
# unsorted input: sort colors_1.txt colors_2.txt colors_2.txt | uniq -u
$ join -t '' -v1 colors_1.txt colors_2.txt
Brown
Purple
Teal

As mentioned before, join will display all the combinations if there are duplicate entries. Here's an example to show the differences between sort, comm and join solutions for displaying common lines:

$ paste list_1.txt list_2.txt
apple   cherry
banana  cherry
cherry  mango
cherry  papaya
cherry  
cherry  

# only one entry per common line
$ sort list_1.txt list_2.txt | uniq -d
cherry

# minimum of 'no. of entries in file1' and 'no. of entries in file2'
$ comm -12 list_1.txt list_2.txt
cherry
cherry

# 'no. of entries in file1' multiplied by 'no. of entries in file2'
$ join -t '' list_1.txt list_2.txt
cherry
cherry
cherry
cherry
cherry
cherry
cherry
cherry

NUL separator

Use -z option if you want to use NUL character as the line separator. In this scenario, join will ensure to add a final NUL character even if not present in the input.

$ join -z <(printf 'a 1\0b x') <(printf 'a 2\0b y') | cat -v
a 1 2^@b x y^@

Alternatives

Here's some alternate commands you can explore if join isn't enough to solve your task. These alternatives do not require input to be sorted.