[Solved] comparing two text file and find out the related word in python

Question

While it is still unclear what you mean by a “partial query”, the code below can do that, simply by you redefining a partial query in the function filter_out_common_queries. E.g. if you are looking for an exact match of the query in search.txt, you could replace # add your logic here by return [' '.join(querylist), ].

import datetime as dt
from collections import defaultdict

def filter_out_common_queries(querylist):
    # add your logic here
    return querylist

queries_time = defaultdict(list)  # personally, I'd use 'set' as the default factory
with open('log.txt') as f:
    for line in f:
        fields = [ x.strip() for x in line.split(',') ]
        timestamp = dt.datetime.strptime(fields[0], "%H:%M:%S")
        queries_time[fields[1]].append(timestamp)  

with open('search.txt') as inputf, open('search_output.txt', 'w') as outputf:
    for line in inputf:
        fields = [ x.strip() for x in line.split(',') ]
        timestamp = dt.datetime.strptime(fields[0], "%H:%M:%S")
        queries = filter_out_common_queries(fields[1].split())  # "adidas watches for men" -> "adidas" "watches" "for" "men". "for" is a very generic keyword. You should do well to filter these out
        results = []
        for q in queries:
            poss_timestamps = queries_time[q]
            for ts in poss_timestamps:
                if timestamp - dt.timedelta(seconds=15) <= ts <= timestamp:
                    results.append(q)
        outputf.write(line.strip() + " - {}\n".format(results))

Output based on your input data:

19:00:15  , mouse , FALSE - []
19:00:15  , branded luggage bags and trolley , TRUE - []
19:00:15  , Leather shoes for men , FALSE - []
19:00:15  , printers , TRUE - []
19:00:16  , adidas watches for men , TRUE - ['adidas', 'adidas', 'adidas', 'adidas', 'adidas', 'adidas']
19:00:16  , Mobile Charger Stand/Holder black , FALSE - ['black']
19:00:16  , watches for men , TRUE - []

Remark that a match for ‘black’ in “Mobile Charger Stand/Holder black” was found. That’s because in the code above, I looked for each separate word in itself.

Edit: To implement your comment, you would redefine filter_out_common_queries like so:

def filter_out_common_queries(querylist):
    basequery = ' '.join(querylist)
    querylist = []
    for n in range(2,len(basequery)+1):
        querylist.append(basequery[:n])
    return querylist

Accepted Answer

While it is still unclear what you mean by a “partial query”, the code below can do that, simply by you redefining a partial query in the function filter_out_common_queries. E.g. if you are looking for an exact match of the query in search.txt, you could replace # add your logic here by return [' '.join(querylist), ].

import datetime as dt
from collections import defaultdict

def filter_out_common_queries(querylist):
    # add your logic here
    return querylist

queries_time = defaultdict(list)  # personally, I'd use 'set' as the default factory
with open('log.txt') as f:
    for line in f:
        fields = [ x.strip() for x in line.split(',') ]
        timestamp = dt.datetime.strptime(fields[0], "%H:%M:%S")
        queries_time[fields[1]].append(timestamp)  

with open('search.txt') as inputf, open('search_output.txt', 'w') as outputf:
    for line in inputf:
        fields = [ x.strip() for x in line.split(',') ]
        timestamp = dt.datetime.strptime(fields[0], "%H:%M:%S")
        queries = filter_out_common_queries(fields[1].split())  # "adidas watches for men" -> "adidas" "watches" "for" "men". "for" is a very generic keyword. You should do well to filter these out
        results = []
        for q in queries:
            poss_timestamps = queries_time[q]
            for ts in poss_timestamps:
                if timestamp - dt.timedelta(seconds=15) <= ts <= timestamp:
                    results.append(q)
        outputf.write(line.strip() + " - {}\n".format(results))

Output based on your input data:

19:00:15  , mouse , FALSE - []
19:00:15  , branded luggage bags and trolley , TRUE - []
19:00:15  , Leather shoes for men , FALSE - []
19:00:15  , printers , TRUE - []
19:00:16  , adidas watches for men , TRUE - ['adidas', 'adidas', 'adidas', 'adidas', 'adidas', 'adidas']
19:00:16  , Mobile Charger Stand/Holder black , FALSE - ['black']
19:00:16  , watches for men , TRUE - []

Remark that a match for ‘black’ in “Mobile Charger Stand/Holder black” was found. That’s because in the code above, I looked for each separate word in itself.

Edit: To implement your comment, you would redefine filter_out_common_queries like so:

def filter_out_common_queries(querylist):
    basequery = ' '.join(querylist)
    querylist = []
    for n in range(2,len(basequery)+1):
        querylist.append(basequery[:n])
    return querylist