GitHunt
BO

boorjanunezz/SmartInvoice-ETL

Solución inteligente para la digitalización y gestión de facturas. Transforma documentos PDF no estructurados en datos SQL procesables mediante IA, optimizando el flujo de trabajo financiero.

SmartInvoice-ETL

SmartInvoice-ETL is a robust pipeline designed to extract data from invoice PDFs using Azure Document Intelligence and store it in a SQL Server database. It includes a simulation mode for development and testing without Azure costs or external dependencies.

Features

  • Azure Integration: Extracts key fields (Invoice Number, Date, Client, NIF, Amount) from PDF invoices.
  • SQL Server Storage: Automatically inserts extracted data into a structured relational database.
  • Simulation Mode: Generates realistic mock data using Faker to test the pipeline without Azure API calls.
  • Robust Error Handling: Logging and error management for production reliability.
  • Secure Configuration: Uses environment variables for sensitive credentials.

Prerequisites

  • Python 3.8+
  • SQL Server (Express or Standard)
  • ODBC Driver 17 for SQL Server

Setup

  1. Clone the repository:

    git clone https://github.com/yourusername/SmartInvoice-ETL.git
    cd SmartInvoice-ETL
  2. Create a virtual environment:

    python -m venv venv
    .\venv\Scripts\activate  # Windows
    # source venv/bin/activate  # Linux/Mac
  3. Install dependencies:

    pip install -r requirements.txt
  4. Configure Environment:
    Create a .env file based on .env.example:

    AZURE_ENDPOINT="your_endpoint"
    AZURE_KEY="your_key"
    SQL_SERVER="localhost\SQLEXPRESS"
    SQL_DB="facturas"
    SIMULATE_DATA="True"  # Set to False to use real Azure extraction
  5. Initialize Database:

    python src/setup_db.py

Usage

Run the Pipeline

Place PDF invoices in data/input and run:

python src/main.py

Processed files will move to data/processed, and errors to data/error.

Test Data Simulation

To insert simulated data directly without files:

python src/insert_mock_data.py

Project Structure

SmartInvoice-ETL/
├── data/               # Input, processed, and error directories
├── logs/               # Execution logs
├── sql/                # SQL scripts for schema creation
├── src/
│   ├── main.py         # Main ETL pipeline
│   ├── config.py       # Configuration management
│   ├── utils.py        # Helper functions
│   ├── setup_db.py     # Database initialization script
│   └── mock_data.py    # Mock data generator
├── .env.example        # Template for environment variables
├── requirements.txt    # Python dependencies
└── README.md           # Project documentation

License

MIT

Languages

Python96.9%TSQL3.1%
MIT License
Created February 17, 2026
Updated February 17, 2026