warning warning warning This is a work-in-progress draft version.



zet

From github: zet:

This is a command-line utility for doing set operations on files considered as sets of lines. For instance, zet union x y z outputs the lines that occur in any of x, y, or z, and zet intersect x y z those that occur in all of them.

warning Zet reads entire files into memory. Its memory usage is roughly proportional to the file size of its largest argument plus the size of the (eventual) output.

Installation

See zet: releases for details.

Commands and documentation

From zet --help:

Each operation prints lines meeting a different condition:
    Operation  Prints lines appearing in
    ========== =========================
    intersect: EVERY file
    union:     ANY file
    diff:      the FIRST file, and no other
    single:    exactly ONE file
    multiple:  MORE THAN one file

Each line is output at most once, no matter how many times it occurs in the file(s). Lines are not sorted, but are printed in the order they occur in the input.

Removing duplicates

Sample input files:

$ cat color_list1.txt
teal
light blue
yellow
green

$ cat color_list2.txt
light blue
black
dark green
yellow

Applying union will preserve only the first copy of duplicate lines. This will work even with single input file.

$ zet union color_list1.txt color_list2.txt
teal
light blue
yellow
green
black
dark green

$ zet union <(printf 'car\nbike\ncar\n')
car
bike

Finding common lines

The intersect command will give lines that are present in all of the input files.

$ zet intersect color_list1.txt color_list2.txt
light blue
yellow

$ zet intersect color_list1.txt color_list2.txt <(printf 'red\nyellow')
yellow

Lines not present in other files

When you apply diff, lines from the first file argument will be displayed in the output only if it is not found in any of the other file arguments.

$ zet diff color_list1.txt color_list2.txt
teal
green

$ zet diff color_list2.txt color_list1.txt
black
dark green

$ zet diff color_list2.txt color_list1.txt <(echo 'black')
dark green

Lines present only once

To get lines that are present only in one of the file arguments, use the single argument. Duplicates present in only one of the files will still be displayed once in the output.

# 'car' occurs multiple times, but only in one of the files
# 'yellow' is present in more than one files, so not part of the output
$ zet single color_list1.txt color_list2.txt <(printf 'car\nbike\ncar\n')
teal
green
black
dark green
car
bike

Lines present in two or more files

The intersect command gives lines that are common to all the file arguments. If you want lines that are present in at least two of the files, use the multiple command instead.

$ zet intersect color_list1.txt color_list2.txt <(printf 'red\nyellow')
yellow
$ zet multiple color_list1.txt color_list2.txt <(printf 'red\nyellow')
light blue
yellow

Speed comparison

  • Case 1: shorter file passed as the first argument
$ time frawk 'NR==FNR{a[$0]; next} $0 in a' words.txt SCOWL-wl.txt > f1
real    0m0.091s

$ time zet intersect words.txt SCOWL-wl.txt > f2
real    0m0.104s
  • Case 2: longer file passed as the first argument
$ time frawk 'NR==FNR{a[$0]; next} $0 in a' SCOWL-wl.txt words.txt > f1
real    0m0.204s

$ time zet intersect SCOWL-wl.txt words.txt > f2
real    0m0.118s