May 11, 2026

Connecting Finance Agents to Accounting Systems with Codemode

AI finance agents need to keep up with real bookkeeping volume, not just single receipts.
The wrong integration shape can turn month-end close into a two-hour wait.
Codemode gave us the shape that worked: 197 documents booked and audited in about 20 minutes.

Anthropic recently released agents for financial services: templates for work like general-ledger reconciliation, month-end close, statement review, KYC, research, and pitchbook preparation. The announcement packages each template around skills, connectors, and subagents. Skills give the agent task knowledge, connectors give it governed access to live data, and MCP apps can put a provider’s own tools inside Claude.

For accounting work, that connector is not optional. A month-end closer or general-ledger reconciler needs the accounting or ERP system, not just a folder of PDFs and spreadsheets. It has to read invoices, vouchers, suppliers, accounts, financial years, attachments, payment status, and audit trails. It also has to write back journal entries, voucher rows, file links, and close artifacts.

Fortnox is the example throughout this post, but the API shape is common across accounting and ERP systems: hundreds of endpoints, paginated lists, file attachments, and multi-step writes. The same design choices transfer.

To work effectively against an accounting system, an agent has to scale beyond single-record interactions. Even a single receipt is rarely one API call: posting the voucher, uploading the PDF, and connecting the file to the voucher are three distinct Fortnox operations. A naive MCP server—one tool per operation—turns each of those into its own model round-trip.

In practice, that means: list a page, fetch one record, post a voucher, refetch to confirm, and move on. The conversation fills with paginated JSON the model never needed to remember. At 135 receipts and three operations each, that is already 405 round-trips before any audit, reconciliation, or correction work. Every one of those round-trips is wall-clock time the user has to sit through. The token bill matters, but the cost you feel is the person waiting for the next answer.

Codemode is what made this workload finish in about 20 minutes and 100K tokens, instead of an estimated 2 hours and 700K with the naive shape. The rest of this post is how.

I tested this layer with Fortnox by creating a test account and asking Claude Cowork to set up the books for it. The input was 197 documents: receipts, invoices, card statements, and bank statements spanning January 2025 through May 2026.

I started with the hardened Fortnox API clients we use at Quantledger and placed them behind an MCP server that uses codemode. Quantledger has been running generative AI against Fortnox, Open Banking, and Nordic banks in its own pipelines for a while. What changed here was doing the same work from inside a chat agent like Claude Cowork.

The generated client covers 358 operation methods across 74 endpoint groups. That is 358 unique HTTP method/path pairs over 218 URL paths, with verbs like GET, POST, PUT, and DELETE on the same path counted separately. The Fortnox API has more than this. The 358 are what we currently generate from the public OpenAPI spec.

A common way to expose an API of this size through MCP would be one tool per operation: create_voucher, get_voucher, list_supplier_invoices, and so on. I built mine differently. For raw Fortnox API access, the MCP server exposes one execution tool, fortnox_execute. You give it a JavaScript function, and that function runs server-side with a typed proxy into those same 358 operation methods. A separate fortnox_search tool returns endpoint signatures on demand when the model needs them.

This server uses @cloudflare/codemode, Cloudflare’s codemode: the model writes code against a TypeScript-shaped API, and the code runs in a sandboxed Dynamic Worker. The Fortnox layer is narrower than loading the full generated API into context. It keeps the API accessible through a typed proxy, but only loads endpoint signatures when the model asks for them.

With codemode, the model writes a short program once. The program runs on the server. Pagination, detail fetches, retries, projections, and per-row error handling stay inside the routine. The model gets one aggregated result back at the end.

Most of the work in this run was writing data: posting vouchers from receipts, recording supplier invoices, and attaching PDFs. A typical write routine looked like this:

async (ctx) => {
  const fx = ctx.fortnox("conn_...");
  const RECEIPTS = [ /* ~27 receipt payloads parsed earlier in the session */ ];

  const results = [];
  for (const r of RECEIPTS) {
    try {
      const resp = await fx.vouchers.post_root({
        financialyear: r.fy,
        data: {
          Voucher: {
            VoucherSeries: "A",
            TransactionDate: r.date,
            Description: r.description,
            VoucherRows: [
              { Account: r.expenseAccount, Debit: r.net, TransactionInformation: r.description },
              { Account: 2641, Debit: r.vat, TransactionInformation: "Ingående moms" },
              { Account: 2890, Credit: r.total, TransactionInformation: r.paymentLabel },
            ],
          },
        },
      });
      results.push({ ok: true, stem: r.stem, vk: `A-${resp.data.Voucher.VoucherNumber}/${r.fy}` });
    } catch (e) {
      results.push({ ok: false, stem: r.stem, err: String(e).slice(0, 200) });
    }
  }
  return { posted: results.filter(x => x.ok).length, failed: results.filter(x => !x.ok), results };
}

Per-row failures were collected and returned in the same result, so a single bad receipt did not force another model round-trip.

PDF attachments needed a slightly different path. The files lived on the agent’s local disk, where the routine could not reach them. Instead, ctx.prepareUpload returned about 180 short-lived signed URLs in a single routine call, each pointing at a /bypass/upload/<jwt> route on the Worker. A shell xargs | curl pipeline then PUT the PDF bytes to those URLs in parallel, straight from local disk through the Worker into Fortnox. The bytes never touched the codemode sandbox or the model context.

A second small routine connected the uploaded files to their vouchers. Supplier-invoice PDFs had to land in the inbox_s archive folder first, or they would attach to the underlying voucher but never appear in the supplier-invoice UI. That nuance is not in the OpenAPI spec.

For the receipt side of the workload, the round-trip math looked like this:

Posting 135 receipt vouchers in 5 chunks: 5 model turns, 135 voucher creates.
Minting 135 PDF upload URLs: 1 model turn, 135 prepareUpload calls.
xargs | curl uploads in bash, outside MCP: 0 model turns, 135 HTTP PUTs.
Connecting files to vouchers: 1 model turn, 135 file-connect calls.

That is seven model turns for 405 Fortnox API calls plus 135 binary uploads. A naive endpoint-per-tool MCP would have meant 405 model round-trips just for those three operations.

Audit routines used the same shape on the read side: paginate, fetch detail per row, project to compact lines, return. The series-A FY1 voucher audit made about 118 Fortnox calls in one model turn; the supplier-invoice audit made 29. A raw voucher detail response was often around 1,000-1,500 tokens, but the pipe-delimited line returned to the model was closer to 80-120. The model kept full read access while the intermediate JSON stayed out of context.

The cost difference came from two places.

The first was tool-definition overhead. The two codemode tools serialize to under 4KB of minified tool text. An endpoint-per-tool version with method/path, summary, Fortnox description, and callable signature serializes to about 237KB. Even a stripped version with only method/path plus signature is about 117KB.

The second was intermediate response volume. In the endpoint-per-tool design, the model sees raw responses for each call unless the server adds custom projection to every endpoint. In the codemode design, the projection happens inside the routine. The model can ask for the exact shape it needs: pipe-delimited audit lines, grouped totals, failed writes, missing attachments, duplicate voucher candidates, or account-level summaries.

For this 197-document workload, the observed and estimated numbers were:

Tool definitions in context: under 4KB with codemode, about 117-237KB with endpoint-per-tool.
Full session token usage: about 100K vs about 700K.
Wall clock: 15-20 minutes vs about 2 hours.
Fortnox operation coverage: same 358 methods in both designs.
Model round-trips for repetitive API work: one per routine vs one per API call.

The endpoint-per-tool estimate is based on expressing the same Fortnox actions as sequential tool calls. The codemode number comes from the actual run.

For one-off questions, both designs can work well. A standalone get_supplier_invoice tool is simple and readable. Codemode becomes useful when the task has loops: bookkeeping backfills, customer migrations, close checks, inventory reconciliation, attachment matching, and audits across many records.

There are also costs to this design. The server has to treat model-written code as untrusted code. That means sandboxing, execution limits, scoped connections, audit logs, permission checks, and predictable error handling. Write access needs extra guardrails: dry-run modes, idempotency, approval gates for destructive operations, and row-level reporting when a batch partially fails.

In return, 358 Fortnox operation methods remain available without becoming 358 MCP tools in the model context. The agent can still call the accounting API, but the repetitive parts run as code close to the data.

For the workload in this run, that design choice was the difference between a twenty-minute session and a two-hour one.