GitHunt
SR

srmklive/simple-llm-extractor

LLM Knowledge Extractor

A tiny FastAPI service that accepts unstructured text and uses an LLM to produce a 1–2 sentence summary and structured metadata, plus locally computed keywords (the 3 most frequent nouns). Results are persisted in SQLite and searchable by topic/keyword.


Local Environment Setup

  1. Setup a Python virtual environment:
python3 -m venv venv
source venv/bin/activate  # On Windows use `venv\Scripts\activate`
  1. Install dependencies:
pip install -r requirements.txt
  1. Update the environment variables:

A. Create a .env file.
B. Set the OpenAI API key.
C. (Optional) Set the LLM model and mode.

OPENAI_API_KEY=sk-...
OPENAI_MODEL=gpt-4o-mini
LLM_MODE=mock

or you can run the following commands in the terminal:

export OPENAI_API_KEY=sk-...
export OPENAI_MODEL=gpt-4o-mini
export LLM_MODE=mock
  1. Run the app:
uvicorn app.main:app --reload

The docs (Swagger) can be found at http://127.0.0.1:8000/docs.

Example

curl -s -X POST http://127.0.0.1:8000/analyze \  -H "Content-Type: application/json" \  -d '{"text":"OpenAI announced a new AI model today. Cloud providers reacted positively."}' | jq .

curl -s "http://127.0.0.1:8000/search?topic=AI" | jq .

Endpoints

  • POST /analyze → analyze one text or a batch

    • Body:
      • text: string or texts: string[]
    • Returns: results: AnalysisPayload[] (each contains id, title, summary, sentiment, topics[3], keywords[3], confidence)
  • GET /search?topic=xyz → return stored analyses with matching topic/keyword


Persistence

Uses SQLite (analyzer.db) via SQLAlchemy.

  • The analyses table stores title, summary, sentiment, topics (JSON), keywords (JSON), raw input, and timestamp.

Responses and Error Handling

  • Empty input → 400 with a clear message.
  • LLM API failure → 502 with error details; the server stays healthy.
  • You can set LLM_MODE=mock to avoid external API calls entirely.

Keywords (Noun) Extraction

The app/keywords.py module does not call an LLM. It implements a light tokenizer and counts probable nouns using:

  • capitalization (non-sentence-start),
  • morphology suffixes (e.g., -tion, -ment, -ity), and
  • hyphenated tokens.

Bonus

  • Tests: pytest -q (uses LLM_MODE=mock).
  • Docker:
    docker build -t llm-extractor:latest .
    docker run -p 8000:8000 -e LLM_MODE=mock llm-extractor:latest
  • Confidence score: a simple heuristic combining text length and keyword variety (0.0–1.0).

Design Choices

I chose FastAPI + SQLite to ship a full vertical slice quickly with minimal dependencies and great developer ergonomics. The LLM integration is abstracted behind app/llm.py, which supports a mock mode for deterministic tests and local runs without network access. Keyword extraction is implemented locally in app/keywords.py with a small heuristic (no LLM calls) to satisfy the “implement yourself” requirement while avoiding heavyweight model downloads. Data storage is normalized into a single analyses table with JSON columns for topics/keywords so that search is simple but still flexible. Error handling ensures empty inputs and LLM failures return useful messages rather than crashing the process.


Trade-offs

  • The noun detector is intentionally simple; adding spaCy POS tagging would improve accuracy at the cost of setup time.
  • Search uses LIKE over JSON strings; for larger datasets, you'd switch to Postgres with JSONB + GIN indexes or FTS.
  • The prompt parser assumes a reasonably well-formed LLM reply; production code would use a stricter schema and retries.
  • Batch processing is sequential for simplicity; a worker queue would scale better if needed.

Languages

Python97.0%Dockerfile3.0%

Contributors

Created September 15, 2025
Updated September 15, 2025
srmklive/simple-llm-extractor | GitHunt