Help! My coding agent can run code
Last week we talked briefly about how you can make tools safer when they’re tightly scoped. This week I’m going to discuss why this approach doesn’t generalise and what you can do instead.
Tightly scoped tools
Last week we introduced two principles that allowed for safe reading, writing, and searching:
Only read and write in the current directory (or its children). This relies on two assumptions. Firstly, reads are safe, because you’re implicitly giving read permission by your choice of working directory. Secondly, if you’re using git (which you should be, especially when working with agents!), writes are safe, because you can easily undo any changes.
Never read or write files that start with
.. Dot files often contain sensitive information like API keys or passwords, so you want to keep them away from the agent. There’s still some danger if the user has stored secrets in other files, but that’s bad practice regardless, since it’s very easy to commit those secrets to git and accidentally share them with your human colleagues (or the whole world!).
Since we can make these narrowly scoped tools safe, you might wonder if we should give our agent a large set of simple tools that we have carefully analysed for safety. Unfortunately there are two reasons that this doesn’t work:
Empirically, agents aren’t very good at managing large numbers of tools. They generally seem to be better with fewer tools, especially when those tools can run code. This is presumably because LLMs have much more training data about how people write code, rather than how they pick from a smorgasbord of special-purpose tools.
A coding agent is much more useful if it can do whatever the user imagines, not what the agent author imagines. Creating a walled garden is the right decision in many cases (e.g. creating a customer support agent) but is not what you want from a coding agent.
So harnesses are generally more effective when they give agents general tools like the one I showed last week, or a similarly dangerous tool that runs arbitrary R code:
eval_env <- new.env(parent = globalenv())
run_r_code <- function(code) {
expr <- parse(text = code)
eval(expr, envir = eval_env)
}
chat$register_tool(tool(
run_r_code,
description = "Run arbitrary code in a persistent environment",
arguments = list(code = type_string("R code to execute"))
))From a safety perspective, this tool is exactly equivalent to the tool from my last post, because you can call R from the command line (with Rscript), and the command line from R (with system()).
Here be dragons
If we can’t rely on narrow tools that are provably safe, what can we do? We need to first define what “danger” means. I think there’s a fairly broad definition that covers the cases we care about: a dangerous operation is something that is expensive, time-consuming, or flat-out impossible to undo. Here are a few examples roughly ordered from easiest to hardest to undo:
Deleting or modifying a file in a directory not tracked by git. It should be possible to recover from a backup (I hope you have backups!), but restoring from a backup is usually a pain.
Deleting a database table. Again, you should have backups, but you’re probably only backing up on a schedule, so you’ll lose anything that’s changed since.
Spending money. I truly hope your agent doesn’t have access to your credit card, but even without that a tool might get stuck in a loop and spend a bunch of your tokens. Good luck getting that money back!
Sending an email. Once a message has left your outbox there’s no recalling it, the best you can do is beg the recipient not to read it.
Revealing a secret like a password or API key, personally identifiable information (PII), or just something about your business that should be kept confidential. You can rotate a password or API key, but there’s no way to undo sharing PII or company secrets.
So how do we protect against these dangers? There are three basic approaches: asking for permission, sandboxing, and using an LLM to analyse the code. These three approaches are arranged in increasing order of sophistication, and match the ways in which they were added to real tools.
Ask for permission
The first approach to avoiding danger is to require explicit approval for each potentially dangerous action. The key problem with this approach is that it quickly becomes security theatre: 99.9% of requests will be safe, so you are trained to click yes, without deeply inspecting the code to be run.
One way to reduce the number of manual approvals is to declare an entire prefix to be safe: i.e. approve anything that starts with git st or ls. But this unfortunately doesn’t work very well because command line tools are not designed with the idea of splitting up their API surface into safe and dangerous operations based on the prefix. For example, git push is safe, but git push --force is often dangerous; or gh issue list is safe, but gh issue delete is dangerous. It’s even less obvious why the following longer commands are unsafe:
find . -type f -name '*.log' -mtime +30 -size +10M -path './var/*' -delete
curl -fsSL https://example.com/install.sh -o /dev/stdout | sh
tar -xzf backup.tar.gz -C /tmp/restore --strip-components=1 -P(If you want to know why, ask your favourite LLM 😀)
And this only scratches the surface of the shell commands that you can use to chain together operations that might be safe individually but dangerous collectively.
Sandbox
So if manual permission doesn’t work, what does? The next step up in sophistication is sandboxing. This is a service provided by your operating system that constrains a process to a sandbox of safe actions. Typically this means forbidding all reads and writes outside of the current directory, and categorically denying any network request.
This allows you to carve out a space for your agent to play that is guaranteed to be safe. But unfortunately there are a lot of details you need to work through. For example, take running R: it needs to be able to write to the temp directory or it fails with an obscure error, and if you can’t read from the default library, you can’t use any installed packages. What about writing to the packages directory? Without that you can’t install packages.
There are also lots of cases where you do want to access the network. Maybe you want to read an issue on GitHub, or a documentation website, or access data from a database. It’s possible to configure an allowlist to let specific websites through, but like managing directories this quickly gets annoying.
Use an LLM to assess command safety
So that leads us to the state of the art today: run all tools in a sandbox and then if a tool call fails, check with an LLM. That ensures the vast majority of safe actions can run without overhead. And then for anything that the sandbox rejects, ask another LLM to analyse the code (and conversation) to determine if it’s safe. (This is what we just added to Posit Assistant.)
It’s worth noting that determining if a tool call is safe or dangerous is a non-trivial task. We mentioned git push --force before; that’s definitely a dangerous operation if other people are working on the same fork. So an LLM should never do it without asking, but if you explicitly request a force push, it’s probably reasonable to do. But other requests are so dangerous that the LLM should never allow them; even if you explicitly ask the LLM to delete all the files on your computer, it shouldn’t comply!
Next week we’ll continue to talk about security, including the lethal trifecta a particularly insidious scenario that arises when an agent can access private data, can read untrusted content, and has the ability to communicate externally. We’ll also talk more about some of the techniques people use to trick the LLM into thinking that untrusted content is actually safe.


