Watch Your Agents Fail

How We Refined Our Agents' Best Ideas

We pointed Claude Code at a URL and said "recreate this website, make no mistakes". It did not work. So, we spent the next month iterating on our own algorithm. The concept is simple: the Flint agent visits your websites, clones your brand, and then builds new landing pages for you. Passing the design bar for real users is not simple.

Engineering our agent for the long tail of websites taught us how blurred the lines have become between workflows, agents, and their tools. Our algorithm flip-flopped between semi-deterministic workflows and an agent quite a few times. A blend of monitoring the agent's behavior and applying human intuition landed us at a happy medium.

Starting From The Extremes

Besides pointing an agent at the problem, the simplest approach would be to download the browser's HTML and CSS and naively re-serve it to our application. While technically on brand, one big HTML and CSS file is not great context for an agent who wants to build new pages and all javascript interactions are lost.

Out of the box, Claude could produce an extensible code structure, but unsurprisingly, the visual output was not very good. We ended up with a purple shadcn-ified site that matched the original layout if you squinted hard enough.

We cycled between the deterministic and agentic bounds of our problem. Our agent's behavior gave us insight into what context provided the most signal. Then we built workflows to distill that context into a higher signal per token format and passed it back to the next version of our agent.

After each iteration of this cycle, the agent started at a higher floor and could save its budget for addressing the problems that fell through the cracks. Adapting the learnings from context pre-processing into skills and tools then helped the agent more effectively handle the edge cases. For example, we often saw Claude write scripts to brute force search CSS files. We monitored what patterns were most useful and wrote a workflow to optimize common searches.

Standard advice is "start with a workflow until you need an agent", but if the bounds are not well defined, you can flip this pattern upside down and let the agent do the discovery.

The Distillation Loop

When you do not know what your workflow should look like or if the model can even do the task, start by letting the agent try. Then experiment across a few prompt and tool variations. Evaluate what behavior is most effective and iterate into a harness with supporting workflows from there.

1. Let the agent attempt the whole task end-to-end
   - Instrument everything: tool calls, context usage, failures
   - Run on diverse inputs - enough to see patterns,
     few enough to review by hand

2. Watch what it does, then extract the boring parts
   - Find the repeated, stable sequences
     (it always fetches X, then builds Y from it)
   - Harden those into deterministic workflow steps
   - What's left is the hard part - that stays with the agent

3. Re-run, but start the agent at the hard part
   - Execute the extracted steps first
   - Feed their output as high-signal context to the agent

4. Watch the agent again
   - It develops new patterns on the harder problem
   - Extract the stable ones into workflow steps (same as step 2)

5. Build tools for what's left
   - Some patterns are useful but not stable enough to hardcode
   - Give the agent tools that handle these - structured access
     to context it was manually gathering, or capabilities it
     was hacking around
   - Prepare context for tool use in your workflow steps

6. Repeat from 3.

The workflow expands around the agent, pushing it toward increasingly creative work. Each cycle, the agent starts with a higher floor.

Some of the context pre-processing is best suited for simple code, but not every extracted workflow needs to be deterministic. It is often useful to use one-shot prompts, or even fully agentic steps that produce a structure for the downstream agent to consume. The line between subagent, tool call, and workflow step blurs here.

When introducing LLM calls or agents into pre-processor steps, it is useful to think about the returns to intelligence of your entire process. If models get smarter overnight, do you reap the rewards, or have you overly constrained certain steps? One of our initial workflows over-constrained the LLM to tiny chunks of work. By letting the model chew off bigger pieces we saw a large increase in its performance and the ability to handle a wider distribution of data.

New model capabilities can also reverse the need to extract steps entirely. Something you hardcoded into a workflow step because the agent was too slow or too unreliable can go back to being agentic, but now with better surrounding context and a more capable model.

Observability is a Prerequisite

You cannot run this loop without observability. We found that file-based artifacts of each run were the easiest way to analyze results with the help of an agent. We captured every intermediate workflow output, prompt, agent log, tool call, and more into a predictable file schema that we were able to co-locate with the final results of our agent. In parallel to our main process, we built specialized agent skills to analyze the outputs of the run and look for emergent patterns and failure modes. This let us automate the rote analysis and cover a large sample size with only two engineers on the project.

One interesting learning from this is that current models are helpful for finding issues, but not integrating a fix. If the output of our investigation skill was still in context, the new agent prompts would always be overfit to that specific issue. Human intelligence was key here for deciding on a direction after running a batch of experiments and collating the results. This level of task generalization does not seem in scope for the models yet, and our ability to gain intuition from a loosely connected data set still gives us a key advantage over agents. Figuring out how to maximize the intersection of human intuition and agentic intelligence is essential for iterating quickly.

Failure Modes

Our most interesting failure modes were around inter-model communication and tool trust. Our agent had access to LLM based tool calls backed by different models. The primary agent would not trust the results of the tool until we aligned their speaking style. Since the tool was nondeterministic, if it claimed to be too accurate, its results were disregarded entirely. Similarly, if the tool gave conflicting answers to a previous call, the agent would distrust and then abandon it. The solution here was tuning the tool's own confidence in its results and giving it internal memory of its past turns.

We initially faced token anxiety, which has been mostly solved with newer model releases, and sometimes witnessed the agent give up on the task entirely. Our initial prompt told the agent to work until it hit a 95% score on our verification metric. It would claim the task was too hard and fail "FATAL_ERROR: Unable to achieve 95%+ scores".

Our stopping condition was also 95% achievement on multiple internal scoring mechanisms. In the first iterations of our algorithm, the agent would race toward solving the first metric, then hit token anxiety or turn limits, and die. We revamped our agent prompt to jointly maximize across all metrics to avoid getting stuck and saw much better results.

The Engineer and The Agent

Engineers no longer need to solve these problems alone. Our job is to design experiments, monitor behavior, and architect harnesses to give agents maximum leverage. Generalizing over sparse data and long time horizons is our superpower for improving agents. They help automate pattern discovery. We analyze the results and steer them in the right direction. Agents could not have solved this problem on their own, but this product would not be possible without them.

Watch Your Agents Fail

How We Refined Our Agents' Best Ideas

Starting From The Extremes

The Distillation Loop

Observability is a Prerequisite

Failure Modes

The Engineer and The Agent

Keep reading

Meet Flint, and announcing our $5M seed from Accel

Join the fastest-growing AIB2B SaaScompanies shipping pages with Flint

Watch Your Agents Fail

How We Refined Our Agents' Best Ideas

Starting From The Extremes

The Distillation Loop

Observability is a Prerequisite

Failure Modes

The Engineer and The Agent

Keep reading

Meet Flint, and announcing our $5M seed from Accel

Join the fastest-growing AIB2B SaaSAIcompanies shipping pages with Flint

Join the fastest-growing AIB2B SaaScompanies shipping pages with Flint