This is a work-in-progress draft version.
zet
From github: zet:
This is a command-line utility for doing set operations on files considered as sets of lines. For instance,
zet union x y z
outputs the lines that occur in any ofx
,y
, orz
, andzet intersect x y z
those that occur in all of them.
Zet reads entire files into memory. Its memory usage is roughly proportional to the file size of its largest argument plus the size of the (eventual) output.
Installation
See zet: releases for details.
Commands and documentation
From zet --help
:
Each operation prints lines meeting a different condition:
Operation Prints lines appearing in
========== =========================
intersect: EVERY file
union: ANY file
diff: the FIRST file, and no other
single: exactly ONE file
multiple: MORE THAN one file
Each line is output at most once, no matter how many times it occurs in the file(s). Lines are not sorted, but are printed in the order they occur in the input.
Removing duplicates
Sample input files:
$ cat color_list1.txt
teal
light blue
yellow
green
$ cat color_list2.txt
light blue
black
dark green
yellow
Applying union
will preserve only the first copy of duplicate lines. This will work even with single input file.
$ zet union color_list1.txt color_list2.txt
teal
light blue
yellow
green
black
dark green
$ zet union <(printf 'car\nbike\ncar\n')
car
bike
Finding common lines
The intersect
command will give lines that are present in all of the input files.
$ zet intersect color_list1.txt color_list2.txt
light blue
yellow
$ zet intersect color_list1.txt color_list2.txt <(printf 'red\nyellow')
yellow
Lines not present in other files
When you apply diff
, lines from the first file argument will be displayed in the output only if it is not found in any of the other file arguments.
$ zet diff color_list1.txt color_list2.txt
teal
green
$ zet diff color_list2.txt color_list1.txt
black
dark green
$ zet diff color_list2.txt color_list1.txt <(echo 'black')
dark green
Lines present only once
To get lines that are present only in one of the file arguments, use the single
argument. Duplicates present in only one of the files will still be displayed once in the output.
# 'car' occurs multiple times, but only in one of the files
# 'yellow' is present in more than one files, so not part of the output
$ zet single color_list1.txt color_list2.txt <(printf 'car\nbike\ncar\n')
teal
green
black
dark green
car
bike
Lines present in two or more files
The intersect
command gives lines that are common to all the file arguments. If you want lines that are present in at least two of the files, use the multiple
command instead.
$ zet intersect color_list1.txt color_list2.txt <(printf 'red\nyellow')
yellow
$ zet multiple color_list1.txt color_list2.txt <(printf 'red\nyellow')
light blue
yellow
Speed comparison
- Case 1: shorter file passed as the first argument
$ time frawk 'NR==FNR{a[$0]; next} $0 in a' words.txt SCOWL-wl.txt > f1
real 0m0.091s
$ time zet intersect words.txt SCOWL-wl.txt > f2
real 0m0.104s
- Case 2: longer file passed as the first argument
$ time frawk 'NR==FNR{a[$0]; next} $0 in a' SCOWL-wl.txt words.txt > f1
real 0m0.204s
$ time zet intersect SCOWL-wl.txt words.txt > f2
real 0m0.118s