Dreaming solutions
This SO question was interesting and had various approaches to solve it. Here's a sample example to explain the problem to be solved:
$ cat ip.txt
caller_number=034082394234324, clear_number=33335345435, direction=1,
caller_number=83479234234, clear_number=34836424733, direction=2,
caller_number=83479234234, clear_number=64237384533, direction=2,
$ cat list.txt
642
3333
534234235
$ cat op.txt
caller_number=83479234234, clear_number=64237384533, direction=2,
Any data present in list.txt
has to be matched immediately after clear_number=
and the input line should also have direction=2,
. In the sample above, first line matches 3333
but not the second criteria. The second line fails even though it has 642
since it is not immediately after clear_number=
. The list.txt
file can have 10K-50K lines and ip.txt
is around 10GB.
Here's a slightly modified answer based on existing solutions on that thread. Since the data present in list.txt
has to be partially matched after clear_number=
, a single direct comparison with the keys saved in arr
is not possible. This solution loops over all the keys for every input line that matches the direction=2,
criteria (breaks the loop if a match is found early).
FNR==NR{ arr["=" $0]; next }
$3=="direction=2,"{
for(i in arr)
if(index($2,i)){
print
next
}
}
To run the solutions, use mawk -f script.awk list.txt ip.txt
In my dreams that night, I realized that the solution can be improved drastically by looping over the digits after clear_number=
instead of looping over keys saved in arr
. Matching a key is O(1)
, so the time saving is huge since the inner loop is now a maximum of 12 (length of digits after clear_number=
) instead of looping a maximum of 10K-50K times! With a 35M sample input file and 12K keys that I created for testing, I found this solution to be about 200 times faster.
FNR==NR{ arr[$0]; next }
$3=="direction=2,"{
val=substr($2,14)
for(i=1; i<length(val); i++)
if(substr(val,1,i) in arr){
print
next
}
}