GitHunt

bgrep

Grep contents of a file using binary search

⚠️ Disclaimer

This project is only to learn how binary search works for searching through files. It is NOT meant to be used in production.

Binary Search Overview

Binary Search is an O(log n) time-complexity algorithm that searches through a sorted array by halfing its search space with each iteration. It does this by computing a midpoint and checking whether or not the element at that midpoint index equals the target; if the value is greater than the target, it computes the midpoint between the start and midpoint - 1 :

if arr[middle] > target:
    return find_target(target, arr, start, middle - 1)

If the value at the midpoint index is less than the target, it simply increases the start:

if arr[middle] < target:
    return find_target(target, arr, middle + 1, end)

If you guessed a person's number being 50 and they told you to go lower, your midpoint = (1 + 49) / 2) = 25. Your new guess is 25.

Getting Started

$ git clone https://github.com/ibnaleem/bgrep.git
$ cd bgrep.git
$ python3 bgrep.py -f /path/to/file -t TARGET -s True

You can also sort the wordlist yourself:

$ LC_ALL=C sort wordlist.dict > sorted-wordlist.dict
$ python3 bgrep.py -f sorted-wordlist.dict -t superman

** You must use LC_ALL=C or else it won't sort correctly**

Benchmarks

Test script:

import time

target = input("Enter a password to search: ").strip()

start_time = time.perf_counter()

with open('rockyou.txt', 'r', encoding='utf-8', errors='ignore') as f:
    lines = [line.strip() for line in f]

    if target in lines:
        print("Found")
    else:
        print("Not found")

end_time = time.perf_counter()

elapsed_time = end_time - start_time

hours = int(elapsed_time // 3600)
minutes = int((elapsed_time % 3600) // 60)
seconds = int(elapsed_time % 60)
microseconds = int((elapsed_time % 1) * 1_000_000)

print(f"in {hours} hours, {minutes} minutes, {seconds} seconds, {microseconds} microseconds")

Searching superman using -s option

Raw search time

python3 bgrep.py -f rockyou.txt -t superman -s True                           ✔
[*] Finding superman in rockyou.txt...
[*] Sorted option enabled: sorting rockyou.txt...
[+] superman found @ index 13121375 in rockyou.txt
[+] Found in 0 hours, 0 minutes, 0 seconds, 59 microseconds

Overall time (timing sort as well)

python3 bgrep.py -f rockyou.txt -t superman -s True                           ✔
[*] Finding superman in rockyou.txt...
[*] Sorted option enabled: sorting rockyou.txt...
[+] superman found @ index 13121375 in rockyou.txt
[+] Found in 0 hours, 0 minutes, 2 seconds, 411741 microseconds

Searching superman on sorted wordlist

python3 bgrep.py -f sorted-rockyou.txt -t superman                            ✔
[*] Finding superman in sorted-rockyou.txt...
[+] superman found @ index 13121351 in sorted-rockyou.txt
[+] Found in 0 hours, 0 minutes, 0 seconds, 172 microseconds

Searching superman using iteration

python3 test.py
Enter a password to search: superman
Found
in 0 hours, 0 minutes, 1 seconds, 368325 microseconds

Verdict

  • Pure Binary Search: 11,898x faster (115 microseconds vs 1 seconds, 368325 microseconds)
  • Sorting + Binary Search: ~76% slower (20,971× slower compared to pure binary search)

Therefore the binary search is only faster if the wordlist is already sorted. Using the sorting option drastically slows down the time

ibnaleem/bgrep | GitHunt