Blog

Building BuchhalterPython: Types, Tests, and a Hallucinated API Key (Part 4)

21 March 2026 · 5 min read · Stefan Pauleweit

pydantic tdd python agentic-coding langchain microservices domain-knowledge testing

Building BuchhalterPython: Types, Tests, and a Hallucinated API Key (Part 4)

After Part 3 left four LXC containers running and the src/ directories empty, Phase 4 starts writing actual code. The first day of implementation produces two closed issues, one rejected framework, and one bug that had already been caught once.

Part 4 of the Building BuchhalterPython series. Part 3 covers provisioning four LXC containers with OpenTofu and Ansible.

The Bug That Appeared Twice

Issue #12 had already documented a type error in the Pydantic model for the Paperless API: tags: list[str] instead of list[int]. Paperless returns tag IDs as integers. The model had used strings. That bug was fixed, documented in llm/domain-knowledge/01-paperless-api-reference.md, and noted in ERROR_PATTERNS.md.

Issue #19 opened because the same model had the same error.

The agent writing the implementation had not read the domain knowledge file. The file existed. The type was documented on page 1. The agent built the model from intuition: in the context of German accounting software, tags feel like strings (names like “Neu” or “steuerrelevant”), not integers (IDs like 53, 102, 7). The intuition is wrong. The API returns integers.

The TDD-Agent wrote fifteen unit tests with the correct types drawn from the domain knowledge file, including an explicit rejection test:

def test_paperless_api_document_rejects_string_tags():
    with pytest.raises(ValidationError):
        PaperlessApiDocument(
            id=1234, title="Test",
            tags=["Neu", "ReRAG"],  # strings — wrong type
            correspondent=None, created="2026-01-15T10:30:00Z"
        )

The Backend-Engineer implemented against the failing tests. Drone Build #8: 100% coverage.

The fix to prevent a third occurrence: the domain knowledge file is now referenced at the top of CLAUDE.md in its own section, with a warning and a concrete example pointing to #12 by number. Concrete past failures are harder to skip than abstract warnings.

Two Models, One Reason

Issue #20 introduced a second Pydantic model: PaperlessDocument, the resolved version of PaperlessApiDocument.

PaperlessApiDocument holds what the API returns: tags: [53, 102, 7], correspondent: 42. Correct for deserialisation. Unusable for everything else.

The LLM classifier needs tag names. ChromaDB needs readable strings for embedding. Monitoring needs human-readable output. So: a second model where integer IDs have been resolved to names.

Two models. No isinstance(tags[0], int) conditionals. No “if it looks like a name, treat it as a name” logic. The type system enforces the distinction.

StrictUndefined in Jinja2

Storage path rendering uses Jinja2 templates. Jinja2’s default behaviour for missing variables: silently substitute an empty string.

For a storage path, that produces conchita/2026//Telekommunikationsanbieter/Rechnung_1234. A document filed to a path with an empty segment. No error. No warning. Discoverable only by manually browsing the filesystem.

The fix is one parameter: Environment(undefined=StrictUndefined). A missing variable becomes an exception immediately, before the document is filed anywhere. Drone Build #9: 100% coverage.

Why We Rejected LangChain

This decision gets more space because it is the most important one made this session.

LangChain is the default recommendation for LLM applications in Python. Tutorials start with it. The appeal is real: a RetrievalQA chain with ChromaDB and Mistral configured in ten lines looks clean. The problems only appear when you try to test it, debug it, or upgrade it.

Chains are black boxes. With a direct Mistral SDK call, the exact prompt sent to the API is in the code. With LangChain, it is assembled internally through PromptTemplates and ChatPromptTemplates. When the LLM misclassifies a document, debugging means reading framework source code, not your own.

Unit tests become impossible. The TDD approach in this project requires test-first development with clear mock boundaries. A LangChain chain has no natural seam for mocking without either mocking the entire LLM object (which stops testing the prompt) or making real API calls (which are integration tests, not unit tests).

Direct code has clear interfaces:

def classify_document(text: str, rag_context: str, llm_client: MistralClient) -> DocumentMetadata:
    prompt = build_classifier_prompt(text, rag_context)
    response = llm_client.chat(prompt)   # mockable
    return parse_metadata(response)      # testable in isolation

LangChain breaks its own APIs regularly. The migration history from langchain to langchain-community to langchain-core, RetrievalQA deprecated for LCEL, classes moving between packages across minor versions. Every upgrade is a re-test.

The n8n analogy. We left n8n because visual abstractions do not scale: 80+ nodes, no unit tests, no meaningful git diffs. LangChain is the same problem in Python form. Chain objects instead of visual nodes, .pipe() and .invoke() instead of n8n connections. The underlying issue is identical: abstractions that hide complexity make systems harder to understand, test, and debug.

Instead, direct calls:

mistral = Mistral(api_key=settings.MISTRAL_API_KEY)
chroma = chromadb.HttpClient(host=settings.CHROMADB_HOST, port=settings.CHROMADB_PORT)

More lines. Every line readable. Every line testable. No framework between the code and the API.

LangSmith (observability for LangChain pipelines) was also evaluated and rejected: it is a SaaS cloud service, and BuchhalterPython runs fully self-hosted. Structured JSON logging with trace IDs covers the same ground without data leaving the homelab.

Parallel Agents and Permissions

Today’s session tested running a Wiki-Writer agent and a TDD-Agent in parallel: one updating documentation, one implementing code. The idea works. The first attempt did not.

The TDD-Agent could not run bash commands. Tests would not start. Git pushes failed.

The cause: settings.json had granular bash permissions that inadvertently excluded subagents. The fix was "Bash(*)" as the permitted action, with an explicit deny list for destructive operations. Broad permission with named exceptions, not narrow permission with gaps that emerge under real workloads.

As the Lernreise series noted: AI is not fire-and-forget. The same applies to the configuration around it. Narrow configurations produce debugging sessions at the worst moments.

One new process established: before any issue is closed, a Drone build URL must appear in the issue comment. Local green is not sufficient. Drone runs in a clean environment with no local venv, no cached artefacts. What passes locally can fail in CI. The URL makes the result reproducible and verifiable.

End State

Issue #19: PaperlessApiDocument    ✅ Drone Build #8
Issue #20: StoragePath + Resolved  ✅ Drone Build #9

settings.json:  Bash(*) + deny list  ✅
CLAUDE.md:      domain knowledge warning at top  ✅
Wiki:           Testing-Strategy.md + Decision-Log.md  ✅

The agent wrote the spec that documented the type rules. The agent ignored the spec when implementing the model. The information was in the context window. I caught it on review. The fix was not better AI — it was making the domain knowledge file impossible to overlook by putting it first, with a concrete past failure as the example.

Related: What AI Actually Can (and Cannot) Do covers the structural limits of AI assistance in more depth.