
CodeAct + Hyperlight: One Code Block, Dozens of Tool Calls
Summary
Collapse agent tool-call loops into one sandboxed Python program and cut latency in half.
Most production agents aren't slow because the model is dumb. They're slow because of the wiring. Every tool call is a separate model turn: the LLM picks a tool, you run it, you send the result back, the LLM picks the next tool. A task that needs forty small lookups burns forty round trips, and you pay for the full conversation history on every single one.
At Build 2026 this week, the Microsoft Agent Framework team put a number on the fix. CodeAct support, shipped in the new agent-framework-hyperlight package, lets the model write one short Python program that calls all of your tools via call_tool(...), runs it once in a sandbox, and returns a consolidated answer. On Microsoft's own benchmark (computing order totals across eight users, dozens of tool calls), the same task with the same model and the same tools dropped from 27.81s to 13.23s and from 6,890 tokens to 2,489. That's a 52.4% latency cut and 63.9% fewer tokens from changing nothing but the wiring.
The part that makes this safe enough to be a default is Hyperlight: each execute_code call runs in a fresh micro-VM that boots in milliseconds, with no host filesystem or network access unless you opt in. This guide walks through wiring CodeAct into a Python agent, builds the order-totals workload end to end, covers the approval model (which has a real gotcha), and shows how to grant the sandbox controlled file and network access.
Prerequisites
- Python 3.10+ on Linux or Windows. The Hyperlight backend doesn't support macOS yet;
execute_coderaises a clear runtime error on unsupported platforms. - Microsoft Agent Framework 1.0+ (
agent-framework), which hit GA on April 2, 2026. - An LLM endpoint: Foundry, Azure OpenAI, or OpenAI all work through MAF chat clients.
- Basic familiarity with MAF's
Agentand@tooldecorator. If you've written any tool-calling agent, you're fine.
Step 1: Install the alpha package
CodeAct ships in a separate package, currently alpha, so you need the prerelease flag:
pip install agent-framework-hyperlight --pre
# or with uv:
uv add --prerelease=allow agent-framework-hyperlight
The package depends on agent-framework-core but deliberately installs no connectors. If you don't already have one, add the client for your model provider, e.g. pip install agent-framework-openai --pre or the Foundry equivalent. Verify the install:
python -c "from agent_framework_hyperlight import HyperlightCodeActProvider; print('ok')"
# ok
Step 2: Wire up your first CodeAct agent
The recommended entry point is HyperlightCodeActProvider, a context provider you attach to the agent. It does two things on every run: registers a single execute_code tool, and injects system-prompt instructions that tell the model what's available in the sandbox and which of your tools it can reach through call_tool(...).
import asyncio
from agent_framework import Agent, tool
from agent_framework_hyperlight import HyperlightCodeActProvider
from agent_framework.openai import OpenAIChatClient
@tool
def get_weather(city: str) -> dict[str, float | str]:
"""Return the current weather for a city."""
return {"city": city, "temperature_c": 21.5, "conditions": "partly cloudy"}
codeact = HyperlightCodeActProvider(
tools=[get_weather],
approval_mode="never_require",
)
agent = Agent(
client=OpenAIChatClient(model_id="gpt-5.5"),
name="CodeActAgent",
instructions="You are a helpful assistant.",
context_providers=[codeact],
)
async def main():
result = await agent.run(
"Get the weather for Seattle and Amsterdam and compare them."
)
print(result)
asyncio.run(main())
Run it and watch the trace. With traditional wiring this is two tool calls and three model turns. With CodeAct the model emits one execute_code call containing something like:
# Code the model wrote inside execute_code (one turn, one sandbox run)
seattle = call_tool("get_weather", city="Seattle")
amsterdam = call_tool("get_weather", city="Amsterdam")
diff = seattle["temperature_c"] - amsterdam["temperature_c"]
result = {
"seattle": seattle,
"amsterdam": amsterdam,
"temperature_difference_c": diff,
}
result
Example final output: "Seattle and Amsterdam are both 21.5°C and partly cloudy right now — effectively identical conditions, with a temperature difference of 0°C." Same tools, same model, one turn instead of three.
Step 3: Understand the sandbox/host split
This is the mental model that makes everything else in the package make sense. Hyperlight isolates the model-generated code, not your tools. The Python the LLM writes runs inside a fresh micro-VM with its own memory, no host filesystem, and no network. Your tools run on the host, in your application process, with whatever access your process has: files, credentials, internal APIs.
call_tool(...) is the bridge between the two. When sandboxed code calls it, Hyperlight marshals the call out to your runtime, executes your reviewed, deterministic tool there, and returns the result into the sandbox. The untrusted glue code that decides which tools to call and how to combine results stays locked in the VM; the trusted code you shipped keeps the access it needs. Because each execute_code call gets a brand-new guest that boots in milliseconds, there's no shared state between calls and no container daemon to manage.
Step 4: A real workload — order totals across users
Here's a fuller version of the workload Microsoft benchmarks in the repo's codeact_benchmark.py sample: compute the grand total of every user's orders, with discounts and tax. Traditionally this explodes into dozens of tool calls — list users, fetch each user's orders, fetch rates, compute each line — every one a separate model turn.
from agent_framework import Agent, tool
from agent_framework_hyperlight import HyperlightCodeActProvider
USERS = [{"id": 1, "name": "Ada"}, {"id": 2, "name": "Grace"}]
ORDERS = {
1: [{"sku": "A1", "qty": 3, "unit_price": 19.99},
{"sku": "B2", "qty": 1, "unit_price": 249.00}],
2: [{"sku": "C3", "qty": 5, "unit_price": 4.50}],
}
@tool
def list_users() -> list[dict]:
"""Return all users with their ids and names."""
return USERS
@tool
def get_orders_for_user(user_id: int) -> list[dict]:
"""Return all order lines (sku, qty, unit_price) for a user id."""
return ORDERS.get(user_id, [])
@tool
def get_discount_rate(user_id: int) -> float:
"""Return the user's discount rate as a fraction, e.g. 0.10."""
return 0.10 if user_id == 1 else 0.0
@tool
def get_tax_rate() -> float:
"""Return the sales tax rate as a fraction."""
return 0.21
@tool
def compute_line_total(qty: int, unit_price: float,
discount: float, tax: float) -> float:
"""Compute qty * unit_price with discount, then tax applied."""
return round(qty * unit_price * (1 - discount) * (1 + tax), 2)
codeact = HyperlightCodeActProvider(
tools=[list_users, get_orders_for_user, get_discount_rate,
get_tax_rate, compute_line_total],
approval_mode="never_require",
)
agent = Agent(
client=client, # any MAF chat client
name="OrderTotalsAgent",
instructions="Answer precisely. Use tools for all data and math.",
context_providers=[codeact],
)
result = await agent.run(
"Compute the grand total of every user's orders, "
"applying their discount and the tax rate. Show per-user totals."
)
The model's sandboxed program loops over users and order lines, calling the same five tools — but the whole plan executes in one turn. Example output:
Per-user totals (10% discount for Ada, 21% tax):
- Ada: (3 x 19.99 + 1 x 249.00) * 0.90 * 1.21 = 336.43
- Grace: (5 x 4.50) * 1.00 * 1.21 = 27.23
Grand total: 363.66
On Microsoft's published run of the full benchmark (eight users, the same five tools, same prompt, same structured output schema, only the wiring changed):
| Wiring | Time | Tokens |
|---|---|---|
| Traditional tool-calling | 27.81s | 6,890 |
| CodeAct | 13.23s | 2,489 |
| Improvement | 52.4% | 63.9% |
The savings come from two places: fewer model turns (one instead of dozens) and not re-sending the growing conversation history on every step. The reasoning trace also gets better, not worse — the full plan lives in one auditable code block instead of being smeared across forty tool-call messages.
Step 5: Mix in approval-gated tools the right way
Where you register a tool decides how it's gated. Tools passed to the provider are invisible to the model as direct tools; they're only reachable via call_tool(...) inside the sandbox, and approval applies to the whole execute_code block, not to individual calls inside it. Tools passed to Agent(tools=...) are first-class: each call is its own model turn and honors its own approval_mode.
@tool(approval_mode="always_require")
def send_email(to: str, subject: str, body: str) -> str:
"""Send an email. Requires human approval on every call."""
...
agent = Agent(
client=client,
name="MixedToolsAgent",
instructions="You are a helpful assistant.",
context_providers=[codeact], # cheap, safe, chainable tools live here
tools=[send_email], # side-effecting tools stay direct + gated
)
The rule of thumb from the MAF team: if a tool is cheap, pure, and safe to chain (lookups, computation, formatting, read-only APIs), register it on the provider so it can be composed into one turn. If it has side effects a human should gate individually (email, payments, production writes), keep it on the agent with always_require. You can also register the same tool in both places — the model then picks per step whether to call it directly (its own approval mode applies) or via call_tool (the execute_code approval mode applies).
Step 6: Grant controlled file and network access
By default the sandbox can't touch the host at all. When the model-written code itself needs to read a dataset or hit an API, opt in explicitly:
codeact = HyperlightCodeActProvider(
tools=[get_weather],
approval_mode="never_require",
file_mounts=[
"/host/data", # same path inside the sandbox
("/host/models", "/sandbox/models"), # host -> sandbox mapping
],
allowed_domains=[
"api.github.com", # all methods
("internal.api.example.com", "GET"), # GET only
],
)
Mounted paths are advertised in the generated CodeAct instructions so the model knows where to read and write artifacts, and the domain allow-list is enforced at the sandbox boundary, not by convention. One important nuance: your tools always run on the host and are never constrained by these settings. If the model needs data outside the mounts, the recommended move is usually not to widen the sandbox — it's to expose a narrow host tool that does exactly that one operation, and keep the sandbox locked down.
Step 7 (optional): The standalone execute_code tool
If you want full control — custom labeling, middleware wrapping, or a fully static agent definition — skip the provider and use HyperlightExecuteCodeTool directly. The catch: the provider injects the CodeAct instructions into the system prompt on every run; with the standalone tool that's your job, or the model sees a bare "run Python" tool with no idea which tools live inside the sandbox. build_instructions() generates exactly that prompt fragment once, at construction time:
from agent_framework_hyperlight import HyperlightExecuteCodeTool
execute_code = HyperlightExecuteCodeTool(
tools=[get_weather],
approval_mode="never_require",
)
agent = Agent(
client=client,
name="StandaloneToolAgent",
instructions=(
"You are a helpful assistant.\n\n"
+ execute_code.build_instructions()
),
tools=[execute_code, send_email],
)
Common pitfalls
- Approval granularity surprises people. If
execute_codeor any tool registered on the provider requires approval, the entire code block is gated behind a single prompt — even for tools the generated code never invokes. Per-call gating inside a block doesn't exist yet; keep individually-gated operations as direct agent tools. - Weak docstrings hurt CodeAct more than normal tool-calling. The model writes Python that calls your tools by name, so docstrings, parameter annotations, and return-type hints are the contract it reasons about. Models are heavily tuned for direct tool calls; they get less help composing yours into code. Type-hint everything.
- Your tools are the security boundary, not the sandbox. Hyperlight contains the model's code, but
call_toolexecutes your tool on the host with full process privileges. A provider-registered tool that writes files or spends money is now reachable from unattended generated code. Audit what you hand the provider. - Don't CodeAct everything. If your agent makes one or two tool calls per turn, there's almost no overhead to collapse — you're adding an alpha dependency and a code-generation step for nothing. Benchmark your own tool set before committing; the repo's
codeact_benchmark.pyis a ready-made template. - Platform limits are real right now. Linux and Windows only (macOS on the way), Python guest only (a .NET counterpart and other guest languages are planned), and the package is alpha — pin your version and expect API movement.
- Don't put tools needing fresh per-call context on the provider. Results flow between
call_toolcalls as plain Python values inside the sandbox. If a tool's output is huge, all of it lives in the program's memory and whatever the model returns gets serialized back — summarize inside the generated code rather than returning raw dumps.
Quick reference
| What | How |
|---|---|
| Install | pip install agent-framework-hyperlight --pre |
| Main entry point | HyperlightCodeActProvider(tools=[...]) via context_providers |
| Sandboxed tool calls | call_tool("name", ...) inside model-written Python |
| Approval modes | never_require | always_require (gates whole code block) |
| Side-effecting tools | Register on Agent(tools=...) with always_require |
| File access | file_mounts=["/path" | (host, sandbox)] |
| Network access | allowed_domains=["domain" | (domain, method)] |
| Static wiring | HyperlightExecuteCodeTool + build_instructions() |
| Platforms | Linux + Windows; Python guest; alpha |
Next steps
Run the official benchmark against your own tool set — swap the five demo tools in codeact_benchmark.py for yours and see whether your workload is chainable enough to win. From there, the natural follow-ons are the Agent Harness (which adds context compaction, todo tracking, and skills on top of any chat client) and Foundry Hosted Agents for deploying the result with scale-to-zero and per-session VM isolation — both also went big at Build 2026. The CodeAct pattern itself comes from Wang et al., 2024 (arXiv:2402.01030) if you want the research grounding, and the MAF team is collecting alpha feedback on approval ergonomics and non-Python guests in the GitHub discussions.
Sources: Microsoft Agent Framework BUILD 2026 announcements and the CodeAct with Hyperlight deep-dive on the official MAF devblog, the microsoft/agent-framework repository, and the Hyperlight project.
Comments
Be the first to comment
Found this useful?
Get new AI guides for builders by email. Free.