LLM Knowledge Extractor
A tiny FastAPI service that accepts unstructured text and uses an LLM to produce a 1–2 sentence summary and structured metadata, plus locally computed keywords (the 3 most frequent nouns). Results are persisted in SQLite and searchable by topic/keyword.
Local Environment Setup
- Setup a Python virtual environment:
python3 -m venv venv
source venv/bin/activate # On Windows use `venv\Scripts\activate`- Install dependencies:
pip install -r requirements.txt- Update the environment variables:
A. Create a .env file.
B. Set the OpenAI API key.
C. (Optional) Set the LLM model and mode.
OPENAI_API_KEY=sk-...
OPENAI_MODEL=gpt-4o-mini
LLM_MODE=mock
or you can run the following commands in the terminal:
export OPENAI_API_KEY=sk-...
export OPENAI_MODEL=gpt-4o-mini
export LLM_MODE=mock- Run the app:
uvicorn app.main:app --reloadThe docs (Swagger) can be found at http://127.0.0.1:8000/docs.
Example
curl -s -X POST http://127.0.0.1:8000/analyze \ -H "Content-Type: application/json" \ -d '{"text":"OpenAI announced a new AI model today. Cloud providers reacted positively."}' | jq .
curl -s "http://127.0.0.1:8000/search?topic=AI" | jq .Endpoints
-
POST /analyze→ analyze one text or a batch- Body:
text: stringortexts: string[]
- Returns:
results: AnalysisPayload[](each containsid, title, summary, sentiment, topics[3], keywords[3], confidence)
- Body:
-
GET /search?topic=xyz→ return stored analyses with matching topic/keyword
Persistence
Uses SQLite (analyzer.db) via SQLAlchemy.
- The
analysestable stores title, summary, sentiment, topics (JSON), keywords (JSON), raw input, and timestamp.
Responses and Error Handling
- Empty input → 400 with a clear message.
- LLM API failure → 502 with error details; the server stays healthy.
- You can set
LLM_MODE=mockto avoid external API calls entirely.
Keywords (Noun) Extraction
The app/keywords.py module does not call an LLM. It implements a light tokenizer and counts probable nouns using:
- capitalization (non-sentence-start),
- morphology suffixes (e.g.,
-tion,-ment,-ity), and - hyphenated tokens.
Bonus
- Tests:
pytest -q(usesLLM_MODE=mock). - Docker:
docker build -t llm-extractor:latest . docker run -p 8000:8000 -e LLM_MODE=mock llm-extractor:latest - Confidence score: a simple heuristic combining text length and keyword variety (
0.0–1.0).
Design Choices
I chose FastAPI + SQLite to ship a full vertical slice quickly with minimal dependencies and great developer ergonomics. The LLM integration is abstracted behind app/llm.py, which supports a mock mode for deterministic tests and local runs without network access. Keyword extraction is implemented locally in app/keywords.py with a small heuristic (no LLM calls) to satisfy the “implement yourself” requirement while avoiding heavyweight model downloads. Data storage is normalized into a single analyses table with JSON columns for topics/keywords so that search is simple but still flexible. Error handling ensures empty inputs and LLM failures return useful messages rather than crashing the process.
Trade-offs
- The noun detector is intentionally simple; adding spaCy POS tagging would improve accuracy at the cost of setup time.
- Search uses
LIKEover JSON strings; for larger datasets, you'd switch to Postgres with JSONB + GIN indexes or FTS. - The prompt parser assumes a reasonably well-formed LLM reply; production code would use a stricter schema and retries.
- Batch processing is sequential for simplicity; a worker queue would scale better if needed.