GitHunt
KN

knit-bee/paradedup

Find near-duplicates on paragraph level

paradedup

Find near-duplicates in a text corpus of short documents.

Installation

$ pip install git+https://github.com/knit-bee/paradedup.git

Requirements

  • Python>=3.8
  • datasketch>=1.5
  • mmhash3>=3.0

Usage

$ paradedup --help
usage: paradedup [-h] [--output OUTPUT] [--permutations PERMUTATIONS] [--lsh-threshold LSH_THRESHOLD]
                 [--shingle-size SHINGLE_SIZE] [--character-shingle] [--case-insensitive]
                 [--ignore-numbers] [--ignore-whitespace] [--ignore-punctuation]
                 directory

Find near-duplicates among documents by using minhashing and locality-sensitive hashing. See below for
options for minhashing and preprocessing.

positional arguments:
  directory             Directory of files to process

options:
  -h, --help              show this help message and exit
  --output OUTPUT, -o OUTPUT
                          Name of output file to store results of near-duplicate detection. Default is
                          'output.json'.
  --permutations PERMUTATIONS, -p PERMUTATIONS
                          Number of permutations to use for minhashing. Default is 128. This value should
                          be at least 2 and maximal 2**32.
  --lsh-threshold LSH_THRESHOLD, -l LSH_THRESHOLD
                          Threshold for locality sensitive hashing. Takes value between 0.0 and 1.0.
                          Default is 0.9
  --shingle-size SHINGLE_SIZE, -k SHINGLE_SIZE
                          Size of shingles/n-grams for set representation of documents. Default is 3.
  --character-shingle, -s
                          Create shingles/n-grams on character level. If this option is not used, shingles
                          are created on token level which whitespace and word boundary tokenization
                          performed.
  --case-insensitive, -c
                          Convert documents to lowercase.
  --ignore-numbers, -n    Remove digits from documents.
  --ignore-whitespace, -w
                          Strip whitespace characters from documents. This implies the use of --character-
                          shingle.
  --ignore-punctuation, -i
                          Strip punctuation and other special symbols from documents

Example

$ ls my-dir
file1.txt     
file2.txt     
file3.txt
$ paradedup my-dir -w --lsh-threshold 0.5 --character-shingle
$ cat output.json
{"my-dir/file1.txt": [], "my-dir/file2.txt": [["my-dir/file3.txt", 0.3984375]], "my-dir/file3.txt": [["my-dir/file2.txt", 0.3984375]]}

License

This project is licensed under the GNU General Public License v3.0.

Languages

Python100.0%

Contributors

GNU General Public License v3.0
Created August 26, 2022
Updated February 13, 2023
knit-bee/paradedup | GitHunt