Removing duplicates irrespective of field order

I posted a coding challenge in the tenth issue of learnbyexample weekly. I discuss the problem and various solutions in this blog post.

Problem statement🔗

Retain only the first copy of duplicate lines irrespective of the order of the fields. Input order should be maintained. Assume space as the field separator with exactly two fields on each line. For example, hehe haha and haha hehe will be considered as duplicates.

$ cat twos.txt
hehe haha
door floor
haha hehe
6;8 3-4
true blue
hehe bebe
floor door
3-4 6;8
tru eblue
haha hehe

Expected output for the above sample:

hehe haha
door floor
6;8 3-4
true blue
hehe bebe
tru eblue

Python solution🔗

Here's one possible solution for this problem:

filename = 'twos.txt'
keys = set()

with open(filename) as f:
    for line in f:
        fields = line.split()
        key1 = f'{fields[0]} {fields[1]}'
        key2 = f'{fields[1]} {fields[0]}'
        if not (key1 in keys or key2 in keys):
            print(line, end='')
            keys.add(key1)

The main trick in the above solution is to check the input field order as well as the reversed order against elements in a set. A subtle point to note is that the split() string method also removes whitespaces from the start and end of the input line. If you had to use another field delimiter (for example, comma) you'll have to remove the line ending before splitting the input.

And here's a generic solution for any number of fields, which also makes the solution look simpler:

filename = 'twos.txt'
keys = set()

with open(filename) as f:
    for line in f:
        fields = line.split()
        sorted_key = ' '.join(sorted(fields))
        if sorted_key not in keys:
            print(line, end='')
            keys.add(sorted_key)

In case you are wondering why space is used to join the field contents, it is necessary to avoid false matches. tru eblue shouldn't be considered as a duplicate of true blue or blue true. Space is a safe character to use since it is the field separator.

info See my 100 Page Python Intro ebook if you already know programming basics but new to Python.

GNU awk one-liner🔗

Here's a solution for CLI enthusiasts:

$ awk '!(($1,$2) in seen || ($2,$1) in seen); {seen[$1,$2]}' twos.txt
hehe haha
door floor
6;8 3-4
true blue
hehe bebe
tru eblue

The above solution is similar to the first Python solution with a notable difference. The fields are joined using \034 (a non-printing character), which is usually not present in text files.

A solution using the field separator instead of \034 would look like:

awk '!(($1 FS $2) in seen || ($2 FS $1) in seen); {seen[$1 FS $2]}'

info See my CLI text processing with GNU awk ebook if you are interested in such one-liners.

Contents

Problem statement🔗

Python solution🔗

GNU awk one-liner🔗