Dreaming solutions

This SO question was interesting and had various approaches to solve it. Here's a sample example to explain the problem to be solved:

$ cat ip.txt
caller_number=034082394234324, clear_number=33335345435,  direction=1,
caller_number=83479234234,     clear_number=34836424733, direction=2,
caller_number=83479234234,     clear_number=64237384533, direction=2,

$ cat list.txt
642
3333
534234235

$ cat op.txt
caller_number=83479234234,     clear_number=64237384533, direction=2,

Any data present in list.txt has to be matched immediately after clear_number= and the input line should also have direction=2,. In the sample above, first line matches 3333 but not the second criteria. The second line fails even though it has 642 since it is not immediately after clear_number=. The list.txt file can have 10K-50K lines and ip.txt is around 10GB.

Here's a slightly modified answer based on existing solutions on that thread. Since the data present in list.txt has to be partially matched after clear_number=, a single direct comparison with the keys saved in arr is not possible. This solution loops over all the keys for every input line that matches the direction=2, criteria (breaks the loop if a match is found early).

FNR==NR{ arr["=" $0]; next }

$3=="direction=2,"{
    for(i in arr)
        if(index($2,i)){
            print
            next
        }
}

info To run the solutions, use mawk -f script.awk list.txt ip.txt

In my dreams that night, I realized that the solution can be improved drastically by looping over the digits after clear_number= instead of looping over keys saved in arr. Matching a key is O(1), so the time saving is huge since the inner loop is now a maximum of 12 (length of digits after clear_number=) instead of looping a maximum of 10K-50K times! With a 35M sample input file and 12K keys that I created for testing, I found this solution to be about 200 times faster.

FNR==NR{ arr[$0]; next }

$3=="direction=2,"{
    val=substr($2,14)
    for(i=1; i<length(val); i++)
        if(substr(val,1,i) in arr){
            print
            next
        }
}