RE
redis-developer/docai_pipeline
Invoice de-duplication via Azure Form Recognition, OpenAI, Apache Airflow and Redis Enterprise VSS
Invoice De-duplication Demo
Contents
Summary
This is a demonstration of duplication detection of invoice documents. This leverages Apache Airflow to create a task flow that performs the following:
- Local file deposit triggering of workflow
- OCR of a given invoice file via Azure Form Recognizer
- Embedding of OCR output via Azure OpenAI
- De-duplication via Redis Vector Similarity Search (VSS)
Features
- Kubernetes architecture (local - Kind)
- Redis Enterprise: 3 node cluster
- Apache Airflow-managed workflow (Dag)
- Azure Document Intelligence form parsing (Form Recognizer)
- Azure OpenAI embedding
- Redis vector/metadata storage + search (VSS)
Prerequisites
- kind
- kubectl
- docker
- azure cli
- azure account
Installation
git clone https://github.com/Redislabs-Solution-Architects/docai_pipeline.git && cd docai_pipeline- Note 1: This is scripted to be fully-automatic; however, the first usage of Azure's Document Intelligence API(s) requires a manual step of building a resource/deployment and then accepting their AI-usage terms.
- Note 2: Apache Airflow will be writing to the local 'invoices' directory. Airflow operates with a uid of 50000 and gid of 0 (root). You will need to change the group of the dags, invoices, and logs directories such that Airflow has access to them.
sudo chgrp -R root dags
sudo chgrp -R root invoices
sudo chgrp -R root logsUsage
Start
Kubernetes Environment Build Out
./start.shDAG Trigger
The Invoice DAG is currently set with no schedule. This is form demo purposes. In a normal setting this DAG would be scheduled for hourly or daily execution. Use Admin UI to manually start the DAG via the 'Trigger DAG' button. Username: admin Password: admin
Stop
./stop.shArchitecture
Task Flow
Results
Input Dataset
There are total of 10 sample invoices. Four are duplicates.
- Adatum-1-converted.png: File format conversion of Adatum-1 (PDF to PNG)
- Adatum-2-rotated.png: Format conversion and 90 degree rotation of Adatum-2
- Contoso-3-reduced.png: Format conversion and size reduction of Contoso-3.
- Contoso-4-blurred.jpg: Format conversion and blurring of Contoso-4.
Airflow Variables
Airflow Status
Invoice File Dispositions
On this page
Languages
Python56.4%Shell42.7%Dockerfile0.9%
Other
Created September 15, 2023
Updated June 21, 2024






