LLM Automation: From Demo to Dependable

Autonomous LLM agents work, and the distance between a working demo and a dependable enterprise process is the entire job. Where the model actually fits, and what reliable autonomy costs to engineer.

Jacek Pietsch

Principal Solution Architect

I have worked with LLMs since nearly the beginning, every major model, coding in projects small and large. From that vantage point the next step looked obvious: hand an agent all the tools (inbox, calendar, browser, card), describe the outcome, and let it run without me in the loop. So I built it. Not the engine. I built my own personal autonomous agent on top of Hermes, an existing agent framework: it lives on a server, takes instructions over Telegram, runs scheduled tasks on its own, and browses the web like a person. One week with it can be summed up in a single number. The agent was told to find one cheap product, add it to the cart, and pay. It did, and it spent about thirteen dollars to do it.

This is not an article about AI failing. AI works, and it works impressively. It is about what LLM automation actually is once you step out of the chat window into a system meant to run itself, and why the distance between a smooth demo and a dependable product is the entire job.

The chat window hides who is doing the work

In a chat the model feels precise: you ask, it answers; you correct, it adapts. The conclusion writes itself: if it understands this well, I just describe the goal and it handles the means.

The catch: in a chat, you are the control loop. You read every answer, judge it, ask for the fix, decide when to stop. Hundreds of micro-corrections made on reflex and never counted. Build an autonomous agent and you remove yourself from that loop. What looked like understanding turns out to be the impression of understanding, and it holds only while you are watching its hands.

An LLM is not a precise employee. It is an employee handed a two-sentence brief, who took it seriously.

Hand it thirty sentences instead and you get thirty interpretations, sometimes each one independently. "Understanding," as an LLM performs it, is not predictability.

In a working agent, the model is roughly a tenth of the system

If the model cannot be taken at its word, something deterministic has to keep it honest. In a working agent the model is maybe ten percent of the system; the rest is scaffolding that checks, retries, throttles, validates, guards credentials, and decides when to cut the model off. That scaffolding is the product. Look at where the model actually sits in three of my flows:

Email triage is mostly Python. Fetching the mailbox, parsing, deduping, applying the unambiguous rules, logging, retrying, and the schedule that keeps it alive are all deterministic code. The LLM is invoked only for the ambiguous minority, and each such message in a fresh, isolated context, so cost and behavior stay bounded. The "AI inbox assistant" is a small reasoning call fenced inside a pipeline that is overwhelmingly plumbing.
Capacity management runs as a constraint solver, not a chat. The deterministic layer owns the source of truth (calendar, project tracker, task list) and hands the model a structured snapshot of the day’s state: open slots, deadlines, dependencies, load. The LLM gets one reasoning call: propose how to rank, batch, and place the work. The response comes back as a structured plan, validated against the same constraints (does the proposal fit, does it respect existing holds, does it stay inside working hours), and only then touches the calendar or the task store. The “AI planner” is a single fenced reasoning step inside a system that already knows the rules.
Browser navigation runs on trained paths, not improvisation. For every site the agent operates against, the system stores a recorded path (the selectors and waits that get from landing page to outcome) and replays it deterministically. The LLM is invoked only when a path breaks: a selector goes missing, a layout shifts, a new modal appears. A diagnostic tool chain compares the current DOM against the stored path, hands the model the minimal slice that has changed, and asks for a localised fix. The model sees a few kilobytes, not the full transcript of a fresh exploration. Token burn drops by an order of magnitude per run, and the system stays resilient to the small redesigns that would otherwise break it every week.

The pattern repeats: the model is a small, fenced reasoning component; the deterministic code around it wraps it, gates it, blinds it, and decides when to stop it. The model's intelligence is not the system's reliability. Reliability is earned one operation at a time, in exactly the code no one puts in a demo.

Every flow is its own project

At first glance, LLM agents look generic: the same model, the same prompt scaffold, pointed in turn at email, a calendar, a checkout. The versatility is real, and it is what makes the demos compelling. In production, the impression thins. Email triage runs reliably and cheaply only once the deterministic plumbing around the model is tuned to its specific volume, rules, and edge cases. A purchase at one store needs its own selectors and recovery paths. The same purchase at a different store is fresh work: different layout, different anti-bot defenses, different checkout. The agent does not “learn stores” the way a person does after one trip; each scenario gets its own pass of deterministic optimisation before the business case lands.

An untuned agent will usually find a way to complete the task. That is the seductive part of the technology, and it is also the trap. The path the agent finds by trial and error is expensive, slow, and unpredictable, because every step re-feeds the full accumulated history and every retry is billed on top of the attempts before it. A single cheap purchase reached thirteen dollars in inference fees on one of our test runs because the store’s defenses forced retry after retry, and the order that eventually went through was billed against the whole transcript. The same flow can be made cheap, fast, and predictable, but only by closing the LLM’s degrees of freedom with deterministic optimisation: trimming context, locking the toolset, caching recovery paths, gating retries. Until that work is done, technically working and economically deployable are not the same thing.

That cost can be driven down by trimming re-sent history, narrowing the toolset, or pinning one model provider. None of these is a switch; each is its own piece of deep engineering, stacked run after run. And even on the cheapest models, even fully optimized, the result lands right at the line where automating the job barely beats doing it by hand.

"The agent will do your shopping for you" meant, in practice: it can spend thirteen dollars doing one small piece of your shopping.

Business takeaway: before counting what automation saves, count two things it costs: the engineering required to make the per-run price bearable at all, and the price of every attempt that fights back.

The tooling is early

This is not one homemade system's failing. The whole class (terminal-launched agents, autonomy frameworks) is early and unoptimized. A wave of agents with real graphical interfaces is arriving now, from research labs and enterprise vendors alike. That so many serious players are starting at once does not weaken the immaturity claim; it confirms it. The advice is unromantic: experiment, by all means; build a money-critical process on it, not yet.

Where AI is genuinely excellent

Be honest to the end: this technology is superb where there is a hard truth to anchor to.

Programming. It is deterministic. class always means the same thing; syntax cannot be read thirty ways. It is no accident that an AI assistant led most of the diagnosis and fixes that same week, and did it well.
The "text calculator." Defined input, defined notion of the output: translate, summarize, classify, extract signal from mess. Transformation and extraction are its strength.

It is weak where it must create from scratch, or infer what you actually meant when you never fully said it. That frontier, between transforming what you were given and inventing what you did not say, is where the real front line of today's automation runs, and it deserves its own piece.

Join the Newsletter

Get weekly automation new right into your email inbox. No spam, only quality content!

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Blog

Advanced Automation Insights.

Access our exclusive whitepapers, expert webinars, and in-depth articles on the latest breakthroughs and strategic implications of advanced automation and AI.

Visit Blog