oaphotodna.py

oaphotodna.py computes PhotoDNA-like hashes (based on the reversed-engineered version available at https://github.com/ArcaneNibble/open-alleged-photodna) for images, compares two images with normalized similarity scoring, and supports a local FAISS-backed nearest-neighbor index for fast lookup of visually similar images.

This version adds:

a FAISS local vector index
persistent on-disk metadata in meta.json
exact L2 nearest-neighbor search
similarity scores normalized to the same 0..1 scale as direct image comparison
query-time filtering by minimum similarity or maximum Euclidean distance

Requirements

Python 3.8+
Pillow
NumPy (recommended)
faiss-cpu for FAISS features

Install dependencies:

pip install pillow numpy faiss-cpu

What the script does

The script supports four main workflows:

Compute the hash of a single image.
Compute hashes for every file in a directory and emit JSON.
Compare two images using either Euclidean or Manhattan distance.
Build and query a local FAISS index of previously hashed images.

The PhotoDNA-like hash is represented internally as a flat vector of 144 values. FAISS stores these vectors and searches for nearest neighbors using L2 distance.

Help

Print the top-level help:

python bin/oaphotodna.py --help

The CLI uses traditional, flag-prefixed arguments (for example --hash, --compare, --faiss-query) rather than positional subcommands.

adulau@blakley:~/git/photodna/bin$ python3 oaphotodna.py
usage: oaphotodna.py [-h] (--hash IMAGE | --hash-dir DIRECTORY | --compare IMAGE1 IMAGE2 | --faiss-build ARG [ARG ...] | --faiss-add ARG [ARG ...] | --faiss-query ARG [ARG ...]) [--metric {euclidean,manhattan}]
                     [--min-similarity MIN_SIMILARITY] [--max-distance MAX_DISTANCE]

Compute and compare PhotoDNA-like hashes, with optional FAISS local indexing.

options:
  -h, --help            show this help message and exit
  --hash IMAGE          Compute the hash of one image
  --hash-dir DIRECTORY  Compute hashes for every file in a directory and output JSON
  --compare IMAGE1 IMAGE2
                        Compare two images
  --faiss-build ARG [ARG ...]
                        Create a new FAISS index: INDEX META IMAGE [IMAGE ...]
  --faiss-add ARG [ARG ...]
                        Append images to an existing FAISS index: INDEX META IMAGE [IMAGE ...]
  --faiss-query ARG [ARG ...]
                        Find closest indexed matches: INDEX META QUERY_IMAGE [TOP_K]
  --metric {euclidean,manhattan}
                        Distance metric for --compare
  --min-similarity MIN_SIMILARITY
                        With --faiss-query, filter results below this similarity threshold [0,1]
  --max-distance MAX_DISTANCE
                        With --faiss-query, filter results above this Euclidean distance

Basic usage

1) Hash a single image

python bin/oaphotodna.py --hash image.jpg

Output:

73,71,74,32,...

2) Hash every file in a directory as JSON

python bin/oaphotodna.py --hash-dir tests/monochrome

Example output:

[
  {
    "filename": "55147310088_42a69416d3_5k.jpg",
    "path": "/full/path/to/tests/monochrome/55147310088_42a69416d3_5k.jpg",
    "photodna": [73, 71, 74, 32]
  }
]

Each JSON object includes the base filename, the absolute file path, and the 144-byte PhotoDNA-like vector. Files are processed in sorted filename order, and non-file directory entries are skipped.

3) Compare two images

Default metric is Euclidean:

python bin/oaphotodna.py --compare image1.jpg image2.jpg

Use Manhattan distance instead:

python bin/oaphotodna.py --compare image1.jpg image2.jpg --metric manhattan

Example output:

Distance (euclidean): 3.7417
Similarity: 0.998779

Similarity scale

The script reports a normalized similarity value between 0 and 1.

1.0 means identical hashes
values close to 1.0 mean very similar hashes
values closer to 0.0 mean more distant hashes

For Euclidean distance, similarity is derived from the maximum possible distance for a 144-dimensional hash with values in the range 0..255:

similarity = 1 - (euclidean_distance / max_possible_distance)

The FAISS query path uses the same normalization so that the similarity reported by --faiss-query is directly comparable to the Similarity: line from --compare.

FAISS local database

Files used

The local database consists of two files:

index.faiss — the FAISS vector index
meta.json — sidecar metadata used to map FAISS IDs back to files and hashes

What `meta.json` contains

meta.json stores information that FAISS does not store for you in an application-friendly way:

dimension — vector length, normally 144
metric — stored metric type
next_id — next numeric ID to assign
items — indexed records

Each item in items contains:

id — numeric FAISS ID
path — canonicalized file path
hash — stored 144-element hash
extra — optional metadata placeholder

Build an index

Create a new index from a set of images:

python bin/oaphotodna.py --faiss-build index.faiss meta.json img1.jpg img2.jpg img3.jpg

Expected output:

Indexed 3 file(s) into index.faiss

Add images to an existing index

Append more images later:

python bin/oaphotodna.py --faiss-add index.faiss meta.json img4.jpg img5.jpg

Expected output:

Added 2 file(s) into index.faiss

Query the index

Search for the closest matches to a query image:

python bin/oaphotodna.py --faiss-query index.faiss meta.json query.jpg

Specify the number of results to return:

python bin/oaphotodna.py --faiss-query index.faiss meta.json query.jpg 20

Example output:

Query: query.jpg
Results: 3

[1] /data/images/img2.jpg
    id=17
    distance=3.7417
    similarity=0.998779
    distance_squared=14.0000

[2] /data/images/img7.jpg
    id=42
    distance=5.2915
    similarity=0.998273
    distance_squared=28.0000

Filter query results by similarity

Only return matches at or above a similarity threshold:

python bin/oaphotodna.py --faiss-query index.faiss meta.json query.jpg 20 --min-similarity 0.95

Filter query results by Euclidean distance

Only return matches at or below a maximum Euclidean distance:

python bin/oaphotodna.py --faiss-query index.faiss meta.json query.jpg 20 --max-distance 12

Combine both filters

python bin/oaphotodna.py --faiss-query index.faiss meta.json query.jpg 20 --min-similarity 0.98 --max-distance 8

FAISS distance notes

FAISS returns squared L2 distance internally.

The script converts that into:

distance_squared — raw FAISS value
distance — Euclidean distance (sqrt(distance_squared))
similarity — normalized 0..1 score derived from Euclidean distance

adulau/photodna

oaphotodna.py

Requirements

What the script does

Help

Basic usage

1) Hash a single image

2) Hash every file in a directory as JSON

3) Compare two images

Similarity scale

FAISS local database

Files used

What `meta.json` contains

Build an index

Add images to an existing index

Query the index

Filter query results by similarity

Filter query results by Euclidean distance

Combine both filters

FAISS distance notes

On this page

Languages

Contributors

adulau/photodna

oaphotodna.py

Requirements

What the script does

Help

Basic usage

1) Hash a single image

2) Hash every file in a directory as JSON

3) Compare two images

Similarity scale

FAISS local database

Files used

What meta.json contains

Build an index

Add images to an existing index

Query the index

Filter query results by similarity

Filter query results by Euclidean distance

Combine both filters

FAISS distance notes

On this page

Languages

Contributors

What `meta.json` contains