Back to Blog
engineering

Agents that write their own tools

For an agent to be really useful, it needs tools — specialized pieces of code that help complete specific tasks, like browsing the web. Today, these tools are...

3 min read
By Alan Zabihi

For an agent to be really useful, it needs tools — specialized pieces of code that help complete specific tasks, like browsing the web. Today, these tools are built manually by developers who try to anticipate what the agent might need. When they miss something, work stops until a developer notices and builds a new tool.

We propose a different approach: agents that write their own tools when they encounter new challenges.

How agents identify tool needs

For example, imagine an agent working on a weather forecasting project in Superagent. It might break this down into tasks like:

  1. Import historical weather datasets
  2. Wait for human to verify analysis assumptions
  3. Build seasonal forecasting model
  4. Generate report with findings

If the agent discovers it can't identify seasonal patterns with its current tools, it recognizes this gap and starts planning a new tool to fill it.

Writing new tools

Once the agent knows what tool it needs, it follows a test-driven development (TDD) approach:

First, the agent writes tests that define what the tool should do. After writing tests, the agent:

  1. Implements the tool
  2. Runs it in an isolated testing environment
  3. Fixes the tool when tests fail
  4. Repeats until all tests pass

The sandbox environment keeps the development process safe by preventing unintended interactions with other systems. We recommend E2B - an open-source runtime for executing AI-generated code in secure cloud sandboxes. Additionally, environment secrets can be managed here, ensuring sensitive data like API keys and credentials remain secure.

Tool evaluations

Tool evals focus on whether a tool effectively completes its intended task, and help agents improve at writing and using new tools. The evaluation process combines:

  • Quantitative metrics (runtime, resource usage, error rates)
  • Task completion (did it solve the problem it was built for?)
  • User feedback (was the solution useful?)

These evaluations feed into a reinforcement fine-tuning process where human feedback guides the agent's learning. The agent trains a reward model from human evaluations of tool performance, uses this model to optimize its tool-writing capabilities, and continuously improves as new evaluations come in.

Effects on human-AI collaboration

Current agent implementations require humans to constantly anticipate and provide for agent needs, creating a dependency that limits both human and agent potential. Agents that write their own tools fundamentally transform this relationship. Instead of spending time developing basic tools, human developers can focus on strategic oversight and direction-setting.

This shift transforms how organizations work with AI in several important ways. Humans can now focus on providing high-level guidance rather than handling tactical implementation details. Organizations can scale their AI capabilities without needing to proportionally increase their development resources. Perhaps most significantly, there's a compound learning effect as tools build upon tools, accelerating the development of new capabilities.

As agents become more proficient at tool creation, the human role evolves from tool provider to strategic evaluator and guide. This evolution not only improves efficiency but enables organizations to tackle increasingly complex challenges by leveraging the growing capabilities of their AI systems.

Alan Zabihi

Co-founder & CEO

Follow on X

Related Articles

Most agents today rely on large, general-purpose models built to do everything. If your agent has a single, well-defined job, it should also have a model designed for that job. This is the case for small language models: models that handle one task, run locally, and can be retrained as your data evolves.

October 20, 20258 min read

In 2022, Simon Willison argued that 'adding more AI' was the wrong fix for prompt injection and related failures. He was mostly right at the time. What people tried then were brittle ideas that either overblocked or were easy to trick. This post explains what has changed since, what has not, and why builders can now use AI to meaningfully defend their agents in production.

September 30, 20254 min read

Every developer has preferences. Some love Claude's reasoning approach. Others prefer Cursor's interface and workflow. But you shouldn't have to compromise on security just because you prefer a certain agent. VibeKit's universal agent support provides a consistent security and observability layer that works across all your preferred agents.

August 19, 20253 min read

Subscribe to our newsletter

Get notified when we publish new articles and updates.