join command helps you to combine lines from two files based on a common field. This works best when the input is already sorted by that field.
join combines two files based on the first field content (also referred as key). Only the lines with common keys will be part of the output.
The key field will be displayed first in the output (this distinction will come into play if the first field isn't the key). Rest of the line will have the remaining fields from the first and second files, in that order. One or more blanks (space or tab) will be considered as the input field separator and a single space will be used as the output field separator. If present, blank characters at the start of the input lines will be ignored.
# sample sorted input files $ cat shopping_jan.txt apple 10 banana 20 soap 3 tshirt 3 $ cat shopping_feb.txt banana 15 fig 100 pen 2 soap 1 # combine common lines based on the first field $ join shopping_jan.txt shopping_feb.txt banana 20 15 soap 3 1
If a field value is present multiple times in the same input file, all possible combinations will be present in the output. As shown below,
join will also ensure to add a final newline character even if not present in the input.
$ join <(printf 'a f1_x\na f1_y') <(printf 'a f2_x\na f2_y') a f1_x f2_x a f1_x f2_y a f1_y f2_x a f1_y f2_y
Note that the collating order used for
joinshould be same as the one used to
sortthe input files. Use
join -ito ignore case, similar to
If the input files are not sorted,
joinwill produce an error if there are unpairable lines. You can use the
--nocheck-orderoption to ignore this error. However, as per the documentation, this option "is not guaranteed to produce any particular output."
By default, only the lines having common keys are part of the output. You can use the
-a option to also include the non-matching lines from the input files. Use
2 as the argument for the first and second file respectively. You'll later see how to fill missing fields with a custom string.
# includes non-matching lines from the first file $ join -a1 shopping_jan.txt shopping_feb.txt apple 10 banana 20 15 soap 3 1 tshirt 3 # includes non-matching lines from both the files $ join -a1 -a2 shopping_jan.txt shopping_feb.txt apple 10 banana 20 15 fig 100 pen 2 soap 3 1 tshirt 3
If you use
-v instead of
-a, the output will have only the non-matching lines.
$ join -v1 shopping_jan.txt shopping_feb.txt apple 10 tshirt 3 $ join -v1 -v2 shopping_jan.txt shopping_feb.txt apple 10 fig 100 pen 2 tshirt 3
You can use the
-t option to specify a single byte character as the field separator. The output field separator will be same as the value used for the
-t option. Use
\0 to specify NUL as the separator. Empty string will cause entire input line content to be considered as keys. Depending on your shell you can use ANSI-C quoting to use escapes like
\t instead of a literal tab character.
$ cat marks.csv ECE,Raj,53 ECE,Joel,72 EEE,Moi,68 CSE,Surya,81 EEE,Raj,88 CSE,Moi,62 EEE,Tia,72 ECE,Om,92 CSE,Amy,67 $ cat dept.txt CSE ECE # get all lines from marks.csv based on the first field keys in dept.txt $ join -t, <(sort marks.csv) dept.txt CSE,Amy,67 CSE,Moi,62 CSE,Surya,81 ECE,Joel,72 ECE,Om,92 ECE,Raj,53
--header option to ignore first lines of both the input files from sorting consideration. Without this option, the
join command might still work correctly if unpairable lines aren't found, but it is preferable to use
--header when applicable. This option will also help when
--check-order option is active.
$ cat report_1.csv Name,Maths,Physics Amy,78,95 Moi,88,75 Raj,67,76 $ cat report_2.csv Name,Chemistry Amy,85 Joel,78 Raj,72 $ join --check-order -t, report_1.csv report_2.csv join: report_1.csv:2: is not sorted: Amy,78,95 $ join --check-order --header -t, report_1.csv report_2.csv Name,Maths,Physics,Chemistry Amy,78,95,85 Raj,67,76,72
By default, the first field of both the input files are used to combine the lines. You can use
-2 options followed by a field number to specify a different field number. You can use the
-j option if the field number is the same for both the files.
Recall that the key field is the first field in the output. You'll later see how to customize the output field order.
$ cat names.txt Amy Raj Tia # combine based on second field of the first file # and first field of the second file (default) $ join -t, -1 2 <(sort -t, -k2,2 marks.csv) names.txt Amy,CSE,67 Raj,ECE,53 Raj,EEE,88 Tia,EEE,72
You can use the
-o option to customize the fields required in the output and their order. Especially useful when the first field isn't the key. Each output field is specified as file number followed by a
. character and then the field number. You can specify multiple fields separated by a
, character. As a special case, you can use
0 to indicate the key field.
# output field order is 1st, 2nd and 3rd fields from the first file $ join -t, -1 2 -o 1.1,1.2,1.3 <(sort -t, -k2,2 marks.csv) names.txt CSE,Amy,67 ECE,Raj,53 EEE,Raj,88 EEE,Tia,72 # 1st field from the first file, 2nd field from the second file # and then 2nd and 3rd fields from the first file $ join --header -t, -o 1.1,2.2,1.2,1.3 report_1.csv report_2.csv Name,Chemistry,Maths,Physics Amy,85,78,95 Raj,72,67,76
If you use
auto as the argument for the
-o option, first line of both the input files will be used to determine the number of output fields. If the other lines have extra fields, they will be discarded.
$ join <(printf 'a 1 2\nb p q r') <(printf 'a 3 4\nb x y z') a 1 2 3 4 b p q r x y z $ join -o auto <(printf 'a 1 2\nb p q r') <(printf 'a 3 4\nb x y z') a 1 2 3 4 b p q x y
If the other lines have lesser number of fields, the
-e option will determine the string to be used as a filler (default is empty string).
$ join -o auto <(printf 'a 1 2\nb p') <(printf 'a 3 4\nb x') a 1 2 3 4 b p x $ join -o auto -e '-' <(printf 'a 1 2\nb p') <(printf 'a 3 4\nb x') a 1 2 3 4 b p - x -
As promised earlier, here's an example of filling fields for non-matching lines:
$ join -o auto -a1 -e 'NA' shopping_jan.txt shopping_feb.txt apple 10 NA banana 20 15 soap 3 1 tshirt 3 NA $ join -o auto -a1 -a2 -e 'NA' shopping_jan.txt shopping_feb.txt apple 10 NA banana 20 15 fig NA 100 pen NA 2 soap 3 1 tshirt 3 NA
This section covers whole line set operations you can perform on already sorted input files. Equivalent
uniq solutions will also be mentioned as comments (useful for unsorted inputs). Assume that there are no duplicate lines within an input file.
These two sorted input files will be used for the examples to follow:
$ paste colors_1.txt colors_2.txt Blue Black Brown Blue Orange Green Purple Orange Red Pink Teal Red White White
Here's how you can get union and symmetric difference results. Recall that
-t '' will cause entire input line content to be considered as keys.
# union # unsorted input: sort -u colors_1.txt colors_2.txt $ join -t '' -a1 -a2 colors_1.txt colors_2.txt Black Blue Brown Green Orange Pink Purple Red Teal White # symmetric difference # unsorted input: sort colors_1.txt colors_2.txt | uniq -u $ join -t '' -v1 -v2 colors_1.txt colors_2.txt Black Brown Green Pink Purple Teal
Here's how you can get intersection and difference results. The equivalent
comm solutions for sorted input is also mentioned in the comments.
# intersection, same as: comm -12 colors_1.txt colors_2.txt # unsorted input: sort colors_1.txt colors_2.txt | uniq -d $ join -t '' colors_1.txt colors_2.txt Blue Orange Red White # difference, same as: comm -13 colors_1.txt colors_2.txt # unsorted input: sort colors_1.txt colors_1.txt colors_2.txt | uniq -u $ join -t '' -v2 colors_1.txt colors_2.txt Black Green Pink # difference, same as: comm -23 colors_1.txt colors_2.txt # unsorted input: sort colors_1.txt colors_2.txt colors_2.txt | uniq -u $ join -t '' -v1 colors_1.txt colors_2.txt Brown Purple Teal
As mentioned before,
join will display all the combinations if there are duplicate entries. Here's an example to show the differences between
join solutions for displaying common lines:
$ paste list_1.txt list_2.txt apple cherry banana cherry cherry mango cherry papaya cherry cherry # only one entry per common line $ sort list_1.txt list_2.txt | uniq -d cherry # minimum of 'no. of entries in file1' and 'no. of entries in file2' $ comm -12 list_1.txt list_2.txt cherry cherry # 'no. of entries in file1' multiplied by 'no. of entries in file2' $ join -t '' list_1.txt list_2.txt cherry cherry cherry cherry cherry cherry cherry cherry
-z option if you want to use NUL character as the line separator. In this scenario,
join will ensure to add a final NUL character even if not present in the input.
$ join -z <(printf 'a 1\0b x') <(printf 'a 2\0b y') | cat -v a 1 2^@b x y^@
Here's some alternate commands you can explore if
join isn't enough to solve your task. These alternatives do not require input to be sorted.