boorjanunezz/SmartInvoice-ETL
Solución inteligente para la digitalización y gestión de facturas. Transforma documentos PDF no estructurados en datos SQL procesables mediante IA, optimizando el flujo de trabajo financiero.
SmartInvoice-ETL
SmartInvoice-ETL is a robust pipeline designed to extract data from invoice PDFs using Azure Document Intelligence and store it in a SQL Server database. It includes a simulation mode for development and testing without Azure costs or external dependencies.
Features
- Azure Integration: Extracts key fields (Invoice Number, Date, Client, NIF, Amount) from PDF invoices.
- SQL Server Storage: Automatically inserts extracted data into a structured relational database.
- Simulation Mode: Generates realistic mock data using
Fakerto test the pipeline without Azure API calls. - Robust Error Handling: Logging and error management for production reliability.
- Secure Configuration: Uses environment variables for sensitive credentials.
Prerequisites
- Python 3.8+
- SQL Server (Express or Standard)
- ODBC Driver 17 for SQL Server
Setup
-
Clone the repository:
git clone https://github.com/yourusername/SmartInvoice-ETL.git cd SmartInvoice-ETL -
Create a virtual environment:
python -m venv venv .\venv\Scripts\activate # Windows # source venv/bin/activate # Linux/Mac
-
Install dependencies:
pip install -r requirements.txt
-
Configure Environment:
Create a.envfile based on.env.example:AZURE_ENDPOINT="your_endpoint" AZURE_KEY="your_key" SQL_SERVER="localhost\SQLEXPRESS" SQL_DB="facturas" SIMULATE_DATA="True" # Set to False to use real Azure extraction
-
Initialize Database:
python src/setup_db.py
Usage
Run the Pipeline
Place PDF invoices in data/input and run:
python src/main.pyProcessed files will move to data/processed, and errors to data/error.
Test Data Simulation
To insert simulated data directly without files:
python src/insert_mock_data.pyProject Structure
SmartInvoice-ETL/
├── data/ # Input, processed, and error directories
├── logs/ # Execution logs
├── sql/ # SQL scripts for schema creation
├── src/
│ ├── main.py # Main ETL pipeline
│ ├── config.py # Configuration management
│ ├── utils.py # Helper functions
│ ├── setup_db.py # Database initialization script
│ └── mock_data.py # Mock data generator
├── .env.example # Template for environment variables
├── requirements.txt # Python dependencies
└── README.md # Project documentation
License
MIT