Skip to main content

How File-System First Design Makes Agents Easier to Operate

Rare Ivy
Rare IvyMarketing Manager
11 min read
How File-System First Design Makes Agents Easier to Operate

Why agent infrastructure is starting to look like a folder tree

The first wave of agent prototypes had a simple charm to them. You pointed a model at a task, added a prompt, wired in a couple of tools, and waited to see if the thing could do something useful without breaking into interpretive dance. That gets you far enough for a demo. It doesn’t get you far enough for a system that has to run every day, handle changing inputs, and survive contact with real users.

Once an agent starts doing real work, the shape changes. The prompt still matters, of course, but it stops being the whole story. A mature agent is usually a bundle of smaller decisions that have to cooperate over time: which model it uses, what data it can read, which tools it may call, when it runs, how it retries, where it stores state, and what happens when it gets handed off to something else. Miss one of those pieces and the system can look fine in code while behaving oddly in production.

A production agent is less a clever prompt and more a set of files, permissions, timers, and state that have to agree with each other.

That’s why the filesystem keeps creeping into the conversation. Engineers already know how to read a folder tree. They know that a directory named tools probably contains callable actions, that prompts or policies are where behavior gets defined, and that a state folder is where the messy bits live. A tree gives you a map before you read a line of implementation. m. on a Tuesday.

That readability matters because agent infrastructure isn’t just code. Code is one layer, and sometimes not even the most fragile one. Identity controls what the agent can act as. Access controls decide what it can touch. Timing decides when it wakes up, when it pauses, and when it should try again later instead of hammering the same endpoint like an overcaffeinated intern. State decides what it remembers, what it forgets, and what gets written down for the next run. In production, those concerns aren’t side quests. They shape the behavior just as much as the prompt does.

Once you see agents through that lens, file-system first design starts to feel practical rather than decorative. It gives each concern a home. A reviewer can open a branch and see that a change touches a prompt file, a tool definition, and a schedule, instead of spelunking through one sprawling service to figure out what moved. An operator can inspect the tree and answer a simple question: if this agent behaves badly, where do I look first? A teammate taking over ownership doesn’t need a tour of the whole codebase before they understand the moving parts.

That makes the day-to-day work calmer. Changes become easier to reason about because the structure tells you what kind of change you’re making. If behavior changes, you edit the behavior files. If permissions change, you touch the access layer. If timing changes, you adjust the schedule. That sounds ordinary, even a little boring, and that’s the point. Boring systems are easier to keep upright.

So the folder tree isn’t a cosmetic choice. It’s the shape agent infrastructure takes when teams stop treating agents like one-off demos and start treating them like software that has to be operated. The next question is what goes wrong when everything is crammed into one app, because that’s where the mess usually shows up first.

What goes wrong when everything lives in one agent app

What goes wrong when everything lives in one agent app

A monolithic agent app feels tidy at first. One repo, one deployment, one place to poke around when the thing says something odd. That simplicity usually lasts until the agent stops being a demo and starts doing actual work on a schedule, with tools, state, retries, handoffs, and a few sharp edges around permissions.

Then the whole setup gets fussy.

The core problem is coupling. A prompt change that sounds harmless in review can alter tool selection. A tool schema tweak can break a downstream parser. A schedule adjustment can change when state gets loaded, which changes the prompt context, which changes the model’s output, which changes which branch of the orchestration runs next. None of those pieces is difficult on its own. Put them in one lump and every edit starts to feel like a small wiring job in a wall you can’t fully see.

That’s where agent orchestration gets annoying in the real sense of the word. “ It’s a pile of decisions happening over time. When those decisions live beside each other in one codebase, a developer trying to add a new tool can accidentally touch retry behavior, and a person adjusting a nightly schedule can trip a prompt path that only runs when the clock hits a certain minute. If the app also depends on structured outputs, the blast radius gets wider. A field rename that looks trivial in the code review can become the reason a run fails to deserialize cleanly. The structured outputs guide exists because model output shape matters, and in a monolith that shape is often shared with too many other concerns.

The nastiest agent bugs usually hide in the seams between prompt text, tool calls, timing, and stored state.

Debugging gets messy for the same reason. When control flow, configuration, and state all live in the same place, you spend half your time asking basic questions the code should have made obvious. Which prompt was active for this run? Which version of the tool schema did the model see? Did the scheduler trigger this job, or did a manual retry start it? Was the previous run’s state carried forward on purpose, or did a stale file stick around and shape the next response?

If the answer to any of those is “not sure,” you’re already in the weeds.

The docs for OpenAI’s Agents SDK and tools guide are useful because they separate concepts that production systems tend to smash together. In practice, though, teams often build a single app that contains the model config, tool wiring, cron-like schedules, retries, And whatever handoff logic seemed easiest that week. “ If the answer requires reading the full app from top to bottom, the architecture has become a support problem.

Ownership gets blurry next, and that part hurts in a very human way. One codebase can make it unclear who actually owns what. Prompt edits might be reviewed by an engineer who understands language quality but not rate limits or tool latency. Tool changes may be signed off by the person who wrote the integration, even though the failure is really in the agent’s decision path. Scheduling changes are often the orphan child in this setup. They’re easy to overlook until the on-call person gets paged because a job ran at the wrong time and the model inherited the wrong context.

That blur matters during handoffs too. A teammate can inherit the system and still not know where to start. The repo might contain a dozen small knobs, but they all sit in one thick layer of application code, so the path from symptom to cause is never obvious. Reviews become broader than they should be. A simple prompt revision now needs a glance at tool behavior, and a small tool change pulls in scheduling and state logic, even when those pieces have nothing to do with the feature itself. It’s noisy, and noise slows people down.

Production support suffers in a more direct way. A monolith gives you brittle releases because unrelated behavior is tied together. Change the delegation logic and you may also alter a retry branch that nobody meant to touch. Update a prompt and the agent may start choosing a different tool, which changes latency, which changes timeout behavior, which changes failure rate. The failure mode then shows up as a vague “agent didn’t do the thing” report, which is a lovely sentence for a ticket and a terrible sentence for diagnosis.

Testing gets awkward for the same reason. With AI agents, the interesting bugs often depend on timing, prior state, and model output that changes slightly from one run to the next. If the whole system is fused together, you can’t test prompt behavior without dragging in tools. You can’t test scheduling without booting the orchestration layer. You can’t test handoff rules without simulating the rest of the app. The tests either become enormous and brittle, or they skip the messy paths entirely and give everyone a false sense of comfort.

And that’s the real pain point. A single large agent app hides the boundaries that operators need in order to reason about change. When everything sits in one place, a small tweak can ripple through prompt behavior, tool calls, timing, and stored state before anyone notices. The next section is where the fix starts to look ordinary, almost boring in a good way: break the agent into parts you can actually name, inspect, and own.

A practical file-system-first layout for agents

Once you stop treating an agent like a single app and start treating it like a bundle of parts, the folder structure gets a lot less decorative. That’s the whole point of a folder tree architecture: each concern gets a place where people can find it without guessing which Python file happens to contain the truth this week.

A simple layout might look like this:

agent/
  agent.yaml
  model.yaml
  prompts/
    system.md
    policy.md
    handoff.md
  tools/
    search.py
    fetch.py
    writeback.py
  schedules/
    daily-summary.cron
    retry-window.yaml
  knowledge/
    product-specs/
    faq/
  state/
    schema.json
    checkpoints/
  runs/
  logs/
  tests/

That tree isn’t fancy. It doesn’t need to be. What it gives you is a way to separate the bits that define the agent from the bits the agent produces while it runs. yaml can hold the static setup: which model to call, what permissions it has, which tools it may use, and the basic operating rules. The prompts/ directory can carry the human-readable instructions, policies, and special case behavior. tools/ contains executable code. schedules/ stores timing rules or triggers. state/ is where you define how the agent remembers things, and runs/ or logs/` can hold the messy aftermath of actual execution.

If you can’t tell what changes the agent and what the agent produces, the folder tree isn’t doing its job.

That split matters because static definitions and mutable artifacts behave differently. A prompt file is edited by a developer or operator. A checkpoint file is written by the system at runtime. A schedule file tells the agent when to wake up. A run log tells you what happened after it woke up and immediately regretted it. If those live together, people start treating everything as editable, or nothing as trustworthy. Both are annoying. Neither helps on-call.

The same logic applies to knowledge. If an agent depends on documents, keep those documents in a dedicated place, then connect them to retrieval explicitly. OpenAI’s Agents guide and file search tool docs both point toward this separation: instructions, tools, and files are different things, even if they end up working together at runtime. A knowledge/ directory makes that obvious. So does a manifest that says which files are indexed and which ones are just archived for reference.

This also helps with agent state management, which gets awkward fast if you leave it buried inside orchestration code. State needs a schema. It needs boundaries. It needs to be clear about what survives between runs and what gets rebuilt every time. If you’re using a graph-based runtime, LangGraph’s state model guide is a good example of how to treat state as something explicit rather than magical. That usually means one file for the schema, another for serialization rules, and separate storage for checkpoints or cached outputs. The shape of the state becomes visible instead of implied by half a dozen function calls and a prayer.

Delegation and handoff logic belong close to the agent definition too. They don’t need to hide inside a giant routing layer that nobody wants to touch on Friday afternoon. py`, depending on how much logic you need. The point isn’t the file extension. The point is proximity. When someone opens the agent folder, they should see where responsibility starts, where it stops, and what happens next.

That becomes even clearer if you separate policy from mechanism. “ A tool file can implement the actual API call. A schedule file can decide when the agent runs. A state file can define what gets carried forward. None of those concerns needs to be trapped in one orchestration blob. In practice, that blob usually grows teeth, then legs, then a personal vendetta against maintainability.

The filesystem works well here because engineers already know how to read it. A folder named prompts means one thing. A folder named tools means another. state isn’t logs, and logs aren’t knowledge, even if all three are full of text. That gives the team a contract they can inspect without opening every module. You can review a pull request and ask a simple question: did this change alter the agent’s definition, its runtime behavior, or just the records it leaves behind?

That clarity pays off when the system grows a second agent, then a third. One tree can hold several definitions, each with its own prompts, schedules, tools, and handoff rules, without forcing every decision through the same central file. The structure stays readable because the responsibilities stay separated. And once that happens, the next change usually feels less like spelunking and more like editing a map that already makes sense.

The operational upside: simpler changes, safer runs, clearer ownership

Once the files are split out, the day-two experience changes pretty fast. A new teammate can open the tree, see where prompts live, where tools are registered, where schedules are defined, and where runtime state gets written. That sounds almost too plain to mention, which is exactly the point. “ In backend architecture terms, The folder tree becomes the first map people trust. “ and “I know where to look” down to a few seconds.

A folder tree won’t make an agent clever. It will make it much harder for cleverness to turn into chaos.

That matters during reviews. When a change touches one file or one directory, the blast radius is easier to judge. A prompt tweak can be reviewed as a prompt tweak. A schedule change can be inspected without dragging along model settings and access rules. If a rollout goes sideways, rollback is less of a guessing game because the pieces are separated in a way that matches the failure. You’re not sifting through a single monolith and hoping the right toggle falls out. You know which part changed, which part is still safe, and which part should be left alone.

The same structure helps with access control, which tends to get messy fast once agents start touching real systems. A team may want product managers to read policy files, But not edit credentials. An operator might need permission to adjust schedules, while only the runtime can write state. A file-system-first layout gives you somewhere to hang those boundaries. It also makes audits less annoying, since sensitive pieces don’t need to be hidden inside a giant app where every function can seem one import away from everything else. When state lives in its own place, it’s easier to tell what is source material and what was produced at runtime.

That separation also makes experimentation less risky. If someone wants to test a new model against the same tools, they can change the model config without poking at access policies. If a team wants to try a different delegation rule, they can modify that file and leave the rest of the system alone. This is the sort of restraint that sounds boring until the first late-night incident. Then boring starts to look pretty elegant. Less shared state means fewer accidental side effects. Fewer accidental side effects mean a calmer pager.

There’s also a human side to this that gets overlooked. Engineers are faster when the system matches the way they already think about parts and boundaries. A tree of folders is a blunt instrument, sure, but it gives ownership a physical shape. One directory can belong to the prompt owner. Another can belong to the scheduling logic. Another can hold runtime artifacts that should never be edited by hand. “ conversations.

For agent teams, that’s the real payoff. The folder structure doesn’t just keep code tidy. It keeps behavior legible, changes local, And access controlled in a way people can actually maintain. That’s boring in the best sense of the word. It gives the team a stable operational map, and in agent work, a stable map is worth more than a clever all-in-one file that tries to do everything and remembers none of it.

Newsletter

Stay in the loop

Join our newsletter and get resources, curated content, and inspiration delivered straight to your inbox.