Before touching a single node in n8n, I asked Gemini a sensible question: for this problem, is n8n or Python the better choice?
Gemini said n8n, clearly and confidently. It was the right tool for orchestration, it said. Visual workflows, lower barrier, easier to iterate. Perfect for this use case.
I want to note this for the record, because it becomes relevant later.
The architecture I had in mind had several parts.
The harvester. A one-time bootstrap job to take all existing Paperless documents and push their content into ChromaDB as vector embeddings. Once that is done, the system has a knowledge base of everything I have ever filed. For new documents going forward, the same processing runs on document upload — triggered by a Paperless webhook, not on a schedule.
For the embeddings I used Mistral’s embedding model. It supports batch processing, which matters more than it sounds. Sending 850 documents one at a time would have cost roughly fourteen euro cents. Sending them in batches: seven cents. Fifty percent cost reduction for zero extra complexity. I ran the harvester and watched it tick through the documents. It worked almost immediately. I had not expected that. Solid day’s effort, start to finish.
The RAG decision logic. For each new document coming into Paperless, the workflow would run OCR, produce clean text, embed it, and query ChromaDB for similar documents. Then a decision:
- Very similar (high cosine similarity): this document looks almost identical to existing ones. Take the metadata from the nearest match and apply it directly.
- Somewhat similar: the document type is probably the same, but some fields need fresh attention. Use the RAG context to guide the LLM.
- Not similar at all: the LLM handles it from scratch, and the document gets flagged with a
check_structuretag for human review.
That last category is what I was calling human-in-the-loop. The idea being that I spend less time on documents the system is confident about, and more time on genuinely unusual ones.
The tag ontology. This was the part I had been neglecting. Tags in Paperless are not just labels. Used well, they represent a structure. I reorganised mine into two parallel systems. First, spheres: not generic categories like “household” or “finance”, but specific personal context names — one per person or life domain in my household. Second, semantic document-type tags: Rechnung (invoice), Bescheid (official notice), Vertrag (contract), Kontoauszug (bank statement). And alongside those, system flags such as steuerrelevant (tax-relevant) and laufende_Kosten (recurring expense) that cut across all spheres. Colour-coded in Paperless: blue tags for sphere assignments, purple for system classifications. The LLM needed a defined vocabulary to write into, not an open field.
Financial data in MariaDB. For invoices, the workflow would extract line items, amounts, and VAT figures and write them to the database. The long-term goal: a proper BI dashboard. Nothing fancy yet, just structured data somewhere I can query it. The ten-year-old MariaDB gets a new purpose.
The gold standard loop. When I verify metadata manually and mark a document as correct, that verified metadata should go back into ChromaDB. The system learns from human corrections. Future queries get better results. The flywheel spins.
I want to be honest about how this felt at the design stage. It was exciting. The pieces fit together logically. RAG is not a complicated concept once you have seen it in practice: you embed things, you store them, you query by similarity, you use the results to inform the LLM’s response. Mistral’s embedding model handles the chunking cleanly. ChromaDB is a perfectly reasonable vector database for a homelab scale. The whole thing made sense.
I had ChromaDB running. I had the harvester working. The architecture was clear.
What I had not yet done was build the main workflow.
That is where Gemini’s advice about n8n being my friend was going to be tested.
← Lernreise 3/7: Teaching a Machine to Build Machines · Lernreise 5/7: Day 3: Fifty Nodes and a Burning Budget →
Lernreise 4/7 of 7. Follow the lernreise tag for the full series.