[Solved] Fastest way to read file searching for pattern matches

Question

‘grep’ contains decade’s worth of optimizations, and re-implementing it in any programming language, not just Python, will be slower. *1

Therefore, if speed is important to you, your technique of calling ‘grep’ directly is probably the way to go. To do this using ‘subprocess’, without having to write any temporary files, use the ‘subprocess.PIPE’ mechanism:

from subprocess import Popen, PIPE

COMMAND = 'zcat file* | grep oldconfig'
process = Popen(COMMAND, shell=True, stderr=PIPE, stdout=PIPE)
output, errors = process.communicate()
assert process.returncode == 0, process.returncode
assert errors == '', errors
print('{} lines match'.format(len(output.splitlines())))

This works for me on Python3.5. I’ve avoided using any of the higher-level interfaces added on top of subprocess recently, so it should work fine on older versions of Python too.

(*1 for example, even with an empty ‘for’ loop, as you show in your question, grep is likely to still be faster, because it does not read the input line-by-line. Instead it determine the max number of characters it can seek forwards through the file, ignoring newlines completely, reading one char after each seek, searching for chars that might match any part of the regex. Only if it finds a match does it then look at the characters surrounding that match, to see if the rest of the regex matches and appropriate newlines are present. On top of that it dynamically generates code that is hard-coded to check for matches to the given regex, meaning it executes around 3 x86 instructions per input byte that it examines, and it skips examining most input bytes completely)

Accepted Answer

‘grep’ contains decade’s worth of optimizations, and re-implementing it in any programming language, not just Python, will be slower. *1

Therefore, if speed is important to you, your technique of calling ‘grep’ directly is probably the way to go. To do this using ‘subprocess’, without having to write any temporary files, use the ‘subprocess.PIPE’ mechanism:

from subprocess import Popen, PIPE

COMMAND = 'zcat file* | grep oldconfig'
process = Popen(COMMAND, shell=True, stderr=PIPE, stdout=PIPE)
output, errors = process.communicate()
assert process.returncode == 0, process.returncode
assert errors == '', errors
print('{} lines match'.format(len(output.splitlines())))

This works for me on Python3.5. I’ve avoided using any of the higher-level interfaces added on top of subprocess recently, so it should work fine on older versions of Python too.

(*1 for example, even with an empty ‘for’ loop, as you show in your question, grep is likely to still be faster, because it does not read the input line-by-line. Instead it determine the max number of characters it can seek forwards through the file, ignoring newlines completely, reading one char after each seek, searching for chars that might match any part of the regex. Only if it finds a match does it then look at the characters surrounding that match, to see if the rest of the regex matches and appropriate newlines are present. On top of that it dynamically generates code that is hard-coded to check for matches to the given regex, meaning it executes around 3 x86 instructions per input byte that it examines, and it skips examining most input bytes completely)