Blog

Building BuchhalterPython: Architecture Before the First Commit (Part 2)

19 March 2026 · 8 min read · Stefan Pauleweit

agentic-ai architecture rag document-processing microservices

Building BuchhalterPython: Architecture Before the First Commit (Part 2)

Part 2 of the BuchhalterPython series. Part 1 covers how we set up agentic infrastructure — six specialised agents, golden standards, and token optimisation — before writing any application code.

We spent a full day on architecture before writing a single line of business logic. No features, no endpoints, no database schemas. Just decisions. By the end of that day, we had five architectural choices that each prevented at least one production failure. Three of those failures would have been silent.

That’s the part I want to dwell on: silent failures. A crashed service gives you a stack trace. A silent error gives you wrong financial data in your accounting system, quietly, for months, until someone notices a discrepancy. Silent errors in financial software aren’t bugs. They’re time bombs.

Three Entry Points, Not One Webhook

The first question was how to trigger the pipeline. The obvious answer: one webhook from Paperless-ngx, then an if statement branching based on what kind of document came in.

The less obvious answer, which we chose: three separate entry points.

New documents arrive with a Neu tag. They need OCR. The full pipeline runs.

Re-classification requests arrive with a ReRAG tag. The text already exists, extracted in a previous run. We don’t re-run OCR. We re-run the RAG classification against an updated goldstandard.

Human-in-the-loop promotions are triggered by the absence of a tag. When a reviewer removes the Check_Struktur tag from a document, that’s an approval signal. The document gets promoted to the ChromaDB goldstandard.

These are three fundamentally different situations. They share almost no code path. Collapsing them into one entry point with branching logic would have created an if-forest that nobody, human or agent, could safely modify six months later. Three entry points mean three independently testable, independently deployable flows.

Three entry points — each triggered by a different Paperless-ngx signal

Five Dimensions, Not a Tag Graveyard

The second decision was taxonomic. Paperless-ngx uses tags. Tags are flat. The temptation is to put everything in tags: categories, states, triggers, topics. The result is a graveyard where nobody knows what any given tag means or whether it’s still being used.

We defined five distinct organisational dimensions instead:

Technical owner: the Paperless user who physically manages the document. This is distinct from sphere. A technical owner may manage documents on behalf of family members or dependants, precisely because they carry responsibility for those people’s affairs. There are currently two technical owners in this installation. The technical owner is a Paperless user account, not a tag.

Spheres (blue, #1976d2): the entity a document belongs to. Not the person who manages it, but the person, animal, or organisation it concerns. A technical owner can have multiple spheres. One of ours is Conchita — a family member who happened to be a dog, and who had vet bills, insurance documents, and a paper trail that deserved its own sphere just as much as any human’s. If a sphere needs to be hardcoded because the entity no longer exists and will never appear in a live Paperless API pull, that is a valid architectural decision. Conchita gets her sphere. Loaded live from the Paperless API for current entities.

Semantic system tags (purple, #7b1fa2): content properties. What kind of content is this? Also live from the Paperless API.

Control tags (orange, #f57c00): process triggers. Neu and ReRAG get removed after processing. noKI is permanent. These tags drive pipeline behaviour.

Correspondents: who sent the document. A structured Paperless entity, not a free-text tag.

doc_type_semantic: a custom field holding an empirically derived type from real documents. Not a taxonomy invented on a whiteboard. Built from observation.

The colour coding is not cosmetic. In the Paperless UI, you see at a glance which tags are control tags (orange) and which are content properties (purple). This matters when a human reviewer is working through a queue at speed. Good information architecture removes the need for documentation.

Six organisational dimensions — each with a distinct role, colour, and data source

Storage Paths as Semantic Contracts

The third decision addressed where documents live on disk. We use Jinja templates to derive storage paths from document metadata.

Periodic documents, invoices and statements, are year-based: sphere/correspondents/year/doc_type. Timeless documents, contracts and certificates, omit the year folder: sphere/correspondents/doc_type.

This isn’t just organisation. The storage path is embedded in the ChromaDB embed text. Every document in the goldstandard carries its own filing address as part of its semantic representation. That means RAG doesn’t just return a document type. It returns a complete filing instruction. And it improves automatically as more goldstandard documents are added, because each new example teaches the system where a document like this belongs.

The filing scheme becomes a self-improving knowledge base.

The RAG_HIGH Safety Trap

The fourth decision was the most important for data integrity. When a RAG search returns a similarity score of 0.90 or higher, it’s tempting to treat that as a match and copy everything from the goldstandard example. We called this the RAG_HIGH trap.

A similarity score of 0.90 means structural similarity, not content equality. Every invoice from the same telecoms provider is structurally identical. Same layout, same fields, same formatting. The similarity score will be very high. The actual content, the amounts, dates, and document numbers, will be completely different.

We made a hard rule. Structural fields, doc_type, sphere, correspondent, and storage_path_id, can be inherited from the goldstandard. Transactional fields, amounts, dates, and document numbers, always go to Mistral. Always. No threshold overrides this. No confidence score is high enough.

Silent errors in financial data leave no stack trace. If BuchhalterPython copies a wrong invoice amount from a goldstandard example, it will look correct in the UI, pass any automated checks, and sit quietly in the accounting system. This rule exists precisely because we won’t necessarily notice when it breaks.

noKI as the First Check, Not the Last

The fifth decision was about privacy. Some documents must never be processed by AI: identity documents, health records, anything sensitive. We implemented a permanent noKI tag for these.

The natural instinct is to check for noKI at the end of the pipeline, as a filter before output. That’s wrong. noKI must be the first check. Before OCR, before RAG, before anything.

The reason is ChromaDB. If a sensitive document gets as far as a RAG lookup, it’s been read and it’s potentially been used as retrieval context for other queries. Even if the system never directly extracts data from it, the content has been ingested. That’s a privacy violation, even if silent.

The guard rail works only if it’s first.

Zero Lines of Code, Five Prevented Failures

By the end of the day, we had written no business logic. We had designed entry points, built a tagging taxonomy, defined a storage path strategy, established rules for RAG confidence thresholds, and wired in a privacy guard rail.

Three of those five decisions prevent silent failures. Two of them protect financial data specifically. None of them would have been obvious mid-implementation, when the pressure to ship code is highest.

This is what I keep returning to when people ask whether agentic AI development is just “writing code faster”. The honest answer is more complicated than that.

AI-assisted development is not fire-and-forget. Every decision documented here required active steering, repeated corrections, and constant pushback — even with all context materials provided up front. The agent did not autonomously surface the RAG_HIGH trap or the noKI ordering problem. I did, through directed questions, and then by correcting answers that were wrong. The agent got things wrong, sometimes repeatedly. I caught it because I knew the domain.

What the agent partnership actually gave us was speed. Not correctness. Not autonomous problem-finding. Speed. The thinking that produced these five decisions would have happened with or without AI assistance — but it would have taken longer to write down, structure, and iterate on each decision.

That’s the correct framing: AI as a fast, tireless collaborator that requires a knowledgeable human actively in the loop. Not a system that catches the errors you would have missed. A system that handles the routine work quickly enough that you have time to do the thinking that actually matters.

A full day of architecture work, no code produced. Still worth it. Just not magic.

Next in this series: Building BuchhalterPython: Provisioning Infrastructure (Part 3) — four LXC containers, a naming convention, and what happens when an agent generates a Vault key for a service that isn’t in the project.