ibnaleem/bgrep
Search contents of a file using binary search
bgrep
Grep contents of a file using binary search
⚠️ Disclaimer
This project is only to learn how binary search works for searching through files. It is NOT meant to be used in production.
Binary Search Overview
Binary Search is an O(log n) time-complexity algorithm that searches through a sorted array by halfing its search space with each iteration. It does this by computing a midpoint and checking whether or not the element at that midpoint index equals the target; if the value is greater than the target, it computes the midpoint between the start and midpoint - 1 :
if arr[middle] > target:
return find_target(target, arr, start, middle - 1)If the value at the midpoint index is less than the target, it simply increases the start:
if arr[middle] < target:
return find_target(target, arr, middle + 1, end)If you guessed a person's number being 50 and they told you to go lower, your midpoint = (1 + 49) / 2) = 25. Your new guess is 25.
Getting Started
$ git clone https://github.com/ibnaleem/bgrep.git
$ cd bgrep.git
$ python3 bgrep.py -f /path/to/file -t TARGET -s TrueYou can also sort the wordlist yourself:
$ LC_ALL=C sort wordlist.dict > sorted-wordlist.dict
$ python3 bgrep.py -f sorted-wordlist.dict -t superman** You must use LC_ALL=C or else it won't sort correctly**
Benchmarks
Test script:
import time
target = input("Enter a password to search: ").strip()
start_time = time.perf_counter()
with open('rockyou.txt', 'r', encoding='utf-8', errors='ignore') as f:
lines = [line.strip() for line in f]
if target in lines:
print("Found")
else:
print("Not found")
end_time = time.perf_counter()
elapsed_time = end_time - start_time
hours = int(elapsed_time // 3600)
minutes = int((elapsed_time % 3600) // 60)
seconds = int(elapsed_time % 60)
microseconds = int((elapsed_time % 1) * 1_000_000)
print(f"in {hours} hours, {minutes} minutes, {seconds} seconds, {microseconds} microseconds")Searching superman using -s option
Raw search time
python3 bgrep.py -f rockyou.txt -t superman -s True ✔
[*] Finding superman in rockyou.txt...
[*] Sorted option enabled: sorting rockyou.txt...
[+] superman found @ index 13121375 in rockyou.txt
[+] Found in 0 hours, 0 minutes, 0 seconds, 59 microsecondsOverall time (timing sort as well)
python3 bgrep.py -f rockyou.txt -t superman -s True ✔
[*] Finding superman in rockyou.txt...
[*] Sorted option enabled: sorting rockyou.txt...
[+] superman found @ index 13121375 in rockyou.txt
[+] Found in 0 hours, 0 minutes, 2 seconds, 411741 microsecondsSearching superman on sorted wordlist
python3 bgrep.py -f sorted-rockyou.txt -t superman ✔
[*] Finding superman in sorted-rockyou.txt...
[+] superman found @ index 13121351 in sorted-rockyou.txt
[+] Found in 0 hours, 0 minutes, 0 seconds, 172 microsecondsSearching superman using iteration
python3 test.py
Enter a password to search: superman
Found
in 0 hours, 0 minutes, 1 seconds, 368325 microsecondsVerdict
- Pure Binary Search:
11,898xfaster (115 microseconds vs 1 seconds, 368325 microseconds) - Sorting + Binary Search:
~76%slower (20,971×slower compared to pure binary search)
Therefore the binary search is only faster if the wordlist is already sorted. Using the sorting option drastically slows down the time