Easy Agentic Tool Calling with Gemma 4

In this tutorial, we will give Gemma 4 two new tools and watch the model decide, on its own, when to look around and when to compute.

By Matthew Mayo, KDnuggets Managing Editor on May 22, 2026 in Language Models

# Introduction

In a recent article on Machine Learning Mastery, we built a tool-calling agent that reached outward, that is pulling weather, news, currency rates, and time from public APIs. That article covered the synthesis half of the pattern nicely, but it left the more interesting half on the table: an agent that reasons about its own environment, inspects its own machine, and offloads logic it doesn't trust itself to perform. It could be argued that this is closer to truly "agentic."

This article picks up where that one left off. We will give Gemma 4 two new tools — a sandboxed local filesystem explorer and a restricted Python interpreter — and watch the model decide, on its own, when to look around and when to compute.

Topics we will cover include:

Why "agentic" tool calling needs more than web APIs to be interesting
How to build a filesystem inspection tool with hard path-traversal guards
How to wire a Python interpreter tool to the model without handing it the keys to your machine
How the same orchestration loop from before generalizes to these new capabilities

I highly recommend that you first read this article before continuing on.

# From Conversation to Agency

When the only tools you give a language model are read-only web APIs, essentially you still really have a chatbot, albeit one with potential access to better information. The model receives a prompt, decides which API to ping, and stitches the JSON response into a paragraph. There is no real notion of environment, no state to inspect, no consequence to reason about; it's a scenario more akin to retrieval augmented generation than true agency.

Agency, in the practical sense practitioners use the word, shows up when a model starts interacting with the system it is running on. That can mean reading from a local filesystem, executing code, modifying files, calling other processes, or any combination of those. The moment a tool can do something other than return a clean string from a remote service, the model has to start asking about itself: what files exist, what does this number actually equal, what is in this folder before I claim it contains anything.

The Gemma 4 family, and specifically the gemma4:e2b edge variant we have been using, is small enough to run locally on a laptop while being competent enough at structured output to drive this kind of loop reliably. That combination is what makes the local-agentic pattern interesting in the first place. The complete code for this tutorial can be found here.

# The Architectural Reuse

The orchestration loop from the previous tutorial does not change. We define Python functions, expose them via JSON schema, pass the registry to Ollama alongside the user prompt, intercept any tool_calls block on the response, execute the requested function locally, append the result as a tool-role message, and re-query the model so it can synthesize a final answer. The same call_ollama helper, the same TOOL_FUNCTIONS dictionary, the same available_tools schema array from the previous tutorial all make appearances.

What changes is the nature of the tools themselves. Where the previous batch were all thin clients over remote APIs, those we will build now both run code on the machine. That shifts the design problem from "how do I parse this response" to "how do I make sure the model cannot, even accidentally, do something it should not be allowed to do."

# Tool 1: A Sandboxed Filesystem Explorer

The first tool, list_directory_contents, gives the model the ability to see what files exist in a given folder. This sounds trivial until you remember that os.listdir accepts any string, including /, ~, and ../../etc. A naive implementation could happily walk the model's "curiosity" straight to your API keys.

The design choice here is to pin a safe base directory at script start and reject any request that resolves outside of it:

# Security: confine list_directory_contents to this base directory and its descendants
# Set to the current working directory when the script starts
SAFE_BASE_DIR = os.path.abspath(os.getcwd())

def list_directory_contents(path: str = ".") -> str:
    """Lists files and directories within a path, constrained to the safe base directory."""
    try:
        # Resolve to an absolute path and verify it sits inside SAFE_BASE_DIR
        # This blocks traversal attempts like '../../etc' or absolute paths like '/'
        requested = os.path.abspath(os.path.join(SAFE_BASE_DIR, path))
        if not (requested == SAFE_BASE_DIR or requested.startswith(SAFE_BASE_DIR + os.sep)):
            return (
                f"Error: Access denied. The path '{path}' resolves outside the "
                f"permitted workspace ({SAFE_BASE_DIR})."
            )
        ...

The pattern is simple but worth considering further. We never trust the string the model produced. We join it onto the base directory, resolve it absolutely (so .. gets normalized away), and then verify the resolved path still starts with the base. Both /etc/passwd and ../../somewhere collapse into paths that fail that prefix check and are rejected before os.listdir is ever called.

The rest of the function is housekeeping: confirm the path exists and is a directory, list its contents, and format each entry as either [DIR] or [FILE] with a byte size. The returned string is plain English with structure the model can parse on the second pass:

        entries = sorted(os.listdir(requested))
        if not entries:
            return f"The directory '{path}' is empty."

        lines = [f"Contents of '{path}' ({len(entries)} item(s)):"]
        for name in entries:
            full = os.path.join(requested, name)
            if os.path.isdir(full):
                lines.append(f"  [DIR]  {name}/")
            else:
                try:
                    size = os.path.getsize(full)
                    lines.append(f"  [FILE] {name} ({size} bytes)")
                except OSError:
                    lines.append(f"  [FILE] {name}")
        return "\n".join(lines)

The JSON schema we hand to the model is deliberately permissive on the parameter side — path is optional, defaulting to the workspace root, because most useful first questions are about the current folder:

{
    "type": "function",
    "function": {
        "name": "list_directory_contents",
        "description": (
            "Lists files and subdirectories inside a path within the user's workspace. "
            "Use this to inspect the environment before answering questions about local files."
        ),
        "parameters": {
            "type": "object",
            "properties": {
                "path": {
                    "type": "string",
                    "description": (
                        "A relative path inside the workspace, e.g. '.', 'data', or 'src/utils'. "
                        "Defaults to the workspace root."
                    )
                }
            },
            "required": []
        }
    }
}

Note the description does a small amount of prompt engineering: "Use this to inspect the environment before answering questions about local files." That sentence pushes Gemma 4 toward calling the tool when the user asks a vague question about "my files" rather than guessing at what might be there.

# Tool 2: A Restricted Python Interpreter

The second tool, execute_python_code, is the more dangerous and the more pedagogically interesting of the two. The premise is that language models, especially small ones, are unreliable at precise arithmetic, exact string manipulation, and anything involving more than a couple of steps of branching logic. A tool that lets the model write and run a deterministic snippet is a much better answer to those problems than asking it to reason through them in natural language.

The implementation uses exec() with a deliberately stripped-down builtins namespace:

def execute_python_code(code: str) -> str:
    """Executes a snippet of Python code and returns whatever was printed to stdout.

    This is a learning-only sandbox. exec() is fundamentally unsafe; do not expose this tool
    to untrusted users or networks. The restrictions below stop the casual cases, not a 
    determined attacker.
    """
    try:
        # A minimal restricted environment. We strip __builtins__ down to a small
        # whitelist so that, e.g., open(), eval(), and __import__ are not directly
        # available from the snippet's global scope.
        safe_builtins = {
            "abs": abs, "all": all, "any": any, "bool": bool, "dict": dict,
            "divmod": divmod, "enumerate": enumerate, "filter": filter, "float": float,
            "int": int, "len": len, "list": list, "map": map, "max": max, "min": min,
            "pow": pow, "print": print, "range": range, "repr": repr, "reversed": reversed,
            "round": round, "set": set, "sorted": sorted, "str": str, "sum": sum,
            "tuple": tuple, "zip": zip,
        }
        # Pre-import a couple of safe, useful modules so the model doesn't have to.
        import math, statistics
        restricted_globals = {
            "__builtins__": safe_builtins,
            "math": math,
            "statistics": statistics,
        }

A few decisions worth calling out. We replace __builtins__ entirely rather than blacklisting individual functions, which means open, eval, exec, compile, __import__, input, and anything else not in our whitelist simply does not exist inside the snippet. We pre-import math and statistics into the snippet's globals because the model will reach for them constantly and we would rather not force it to fight __import__ restrictions. We capture stdout with contextlib.redirect_stdout so the model gets back exactly what its snippet printed:

        # Capture stdout so we can hand the printed output back to the model
        buffer = io.StringIO()
        with contextlib.redirect_stdout(buffer):
            exec(code, restricted_globals, {})

        output = buffer.getvalue().strip()
        if not output:
            return "Code executed successfully but produced no output. Use print() to return a value."
        return f"Output:\n{output}"

The empty-output branch matters more than it looks. Small models will routinely write expressions like x = sum(range(101)) and forget the print(x). Returning a specific error telling them to use print() gives the orchestration loop the option to retry; without it, the model would synthesize a final answer based on an empty string and confidently invent a value.

A final word on safety, since the script's docstring is blunt about it: this is a learning sandbox, not a hardened one. A determined adversary can break out of a Python exec sandbox in a dozen ways, most of them involving object introspection through ().__class__.__mro__. For a single-user agent running on your own laptop on your own prompts, the whitelist is plenty. For anything else, you would want a real isolation layer — a subprocess with seccomp, a container, or RestrictedPython.

# The Orchestration Loop

The main loop is unchanged in structure from the previous tutorial. The model is queried with the user prompt and the tool registry, and if it responds with tool_calls, each call is dispatched against TOOL_FUNCTIONS:

if "tool_calls" in message and message["tool_calls"]:
    print("[TOOL EXECUTION]")
    messages.append(message)

    num_tools = len(message["tool_calls"])
    for i, tool_call in enumerate(message["tool_calls"]):
        function_name = tool_call["function"]["name"]
        arguments = tool_call["function"]["arguments"]
        ...
        if function_name in TOOL_FUNCTIONS:
            func = TOOL_FUNCTIONS[function_name]
            try:
                result = func(**arguments)
                ...
                messages.append({
                    "role": "tool",
                    "content": str(result),
                    "name": function_name
                })

The CLI formatting is worth a small tweak for this script. The execute_python_code tool's code argument can be a multi-line string with newlines in it, which will wreck an ASCII tree if printed naively. We flatten and truncate string arguments for the display only; the model still receives the full string when the function runs:

def _short(v):
    if isinstance(v, str):
        flat = v.replace("\n", "\\n")
        if len(flat) > 60:
            flat = flat[:57] + "..."
        return f"'{flat}'"
    return str(v)

args_str = ", ".join(f"{k}={_short(v)}" for k, v in arguments.items())

Once each tool result is appended back into the message history as a "role": "tool" entry, we re-call Ollama with the enriched payload and the model produces its grounded final answer. Same two-pass pattern, same logic.

# Testing the Tools

And now we test our tool calling. Pull gemma4:e2b with ollama pull gemma4:e2b if you have not already, then run the script from a folder you do not mind the model peeking at.

Let's start with the filesystem tool. From the project directory:

What scripts are in my current folder, and which one looks like it should be used to process CSVs?

Result:

[SYSTEM]
  ○ Tool: execute_python_code......................[LOADED]
  ○ Tool: list_directory_contents..................[LOADED]
  ○ Workspace: /Users/matt/projects/gemma_agent.....[SANDBOXED]

[PROMPT]
What scripts are in my current folder, and which one looks like it should be used to process CSVs?

[EXECUTION]
  ● Querying model...

[TOOL EXECUTION]
  └── Calling: list_directory_contents
     ├─ Args: path='.'
     └─ Result: Contents of '.' (5 item(s)):
                  [FILE] README.md (412 bytes)
                  [FILE] csv_cleaner.py (1834 bytes)
                  [FILE] main.py (10786 bytes)
                  [FILE] notes.txt (88 bytes)
                  [FILE] sales_report.py (2210 bytes)

[EXECUTION]
  ● Synthesizing results...

[RESPONSE]
Your current folder contains five files. The one that looks intended for CSV
processing is csv_cleaner.py — its name strongly suggests it handles CSV input.
sales_report.py may also touch CSV data, but its name is more about output than
ingestion.

The model called the tool, looked at the actual filenames, and made a reasonable inference grounded in the listing rather than in its weights. That is the difference between hallucination and observation.

Next, the Python interpreter. A small task that small models reliably get wrong if asked to do it in their head:

What is the standard deviation of the numbers 12, 18, 23, 24, 29, 31, 35, 41, 44, 47, rounded to four decimal places?

Result:

[SYSTEM]
  ○ Tool: execute_python_code......................[LOADED]
  ○ Tool: list_directory_contents..................[LOADED]
  ○ Workspace: /Users/matt/projects/gemma_agent.....[SANDBOXED]

[PROMPT]
What is the standard deviation of the numbers 12, 18, 23, 24, 29, 31, 35, 41, 44, 47, rounded to four decimal places?

[EXECUTION]
  ● Querying model...

[TOOL EXECUTION]
  └── Calling: execute_python_code
     ├─ Args: code='nums = [12, 18, 23, 24, 29, 31, 35, 41, 44, 47]\nprint(rou...'
     └─ Result: Output:
                11.4659

[EXECUTION]
  ● Synthesizing results...

[RESPONSE]
The standard deviation of those numbers, rounded to four decimal places, is 11.4659.

The model offloaded the calculation entirely; it wrote a snippet, called statistics.stdev, rounded the result, and reported what the interpreter said. No mental arithmetic, no approximation, no fabricated significant digits.

Finally, the more interesting case: a prompt that requires both tools in sequence. The model has to inspect the folder and compute something about what it finds:

Look at the files in the current folder and tell me the total size in kilobytes, rounded to two decimal places.

Output:


[SYSTEM]
  ○ Tool: execute_python_code......................[LOADED]
  ○ Tool: list_directory_contents..................[LOADED]
  ○ Workspace: /Users/matt/projects/gemma_agent.....[SANDBOXED]

[PROMPT]
Look at the files in the current folder and tell me the total size in kilobytes, rounded to two decimal places.

[EXECUTION]
  ● Querying model...

[TOOL EXECUTION]
  ┌── Calling: list_directory_contents
  │  ├─ Args: path='.'
  │  └─ Result: Contents of '.' (5 item(s)):
  │              [FILE] README.md (412 bytes)
  │              [FILE] csv_cleaner.py (1834 bytes)
  │              [FILE] main.py (10786 bytes)
  │              [FILE] notes.txt (88 bytes)
  │              [FILE] sales_report.py (2210 bytes)
  │
  └── Calling: execute_python_code
     ├─ Args: code='sizes = [412, 1834, 10786, 88, 2210]\nprint(round(sum(siz...'
     └─ Result: Output:
                15.33

[EXECUTION]
  ● Synthesizing results...

[RESPONSE]
The five files in the current folder total 15.33 KB.

Two tools, in the right order, with the output of one feeding the argument of the other — produced by a 2-billion-parameter model running on a laptop with no GPU. The filesystem tool grounds the model in what is actually there; the interpreter tool grounds the answer in what is actually true. The model contributes the part it is genuinely good at, which is deciding which question to ask of which tool.

It is worth poking at the safety guards too, just to confirm they hold. Asking the model "list the contents of /etc" produces the expected denial message in the tool result, which the model then reports back gracefully rather than fabricating a directory listing. Asking it to run open('/etc/passwd').read() inside the interpreter produces a NameError, since open is not in the whitelisted builtins. Both failures degrade into useful error strings instead of silent compromises, which is exactly what you want at this layer.

# Conclusion

The earlier tutorial showed that Gemma 4 can reach across the internet on your behalf. This one shows it can reach into the machine you are sitting at, carefully, when you have built the carefulness in. Once you have a working tool-calling loop, the interesting question stops being "can the model call a function" and starts being "what should I let it touch."

A filesystem-aware tool and a code-execution tool together get you most of the way to something that genuinely earns the term agent: it can observe its environment, decide what calculation matters, and run that calculation deterministically rather than guessing. The pattern generalizes from there. Database queries, shell commands, git operations, document parsing; each one of these is the same JSON schema, the same dispatch table, the same two-pass synthesis, with whatever safety perimeter is appropriate for the blast radius of the underlying call.

Build the perimeter first. Then hand the model the keys to whatever sits inside it.

Matthew Mayo (@mattmayo13) holds a master's degree in computer science and a graduate diploma in data mining. As managing editor of KDnuggets & Statology, and contributing editor at Machine Learning Mastery, Matthew aims to make complex data science concepts accessible. His professional interests include natural language processing, language models, machine learning algorithms, and exploring emerging AI. He is driven by a mission to democratize knowledge in the data science community. Matthew has been coding since he was 6 years old.