AI Safety · Synthetic Ecology · Kolmogorov Theory

From the Sorcerer's Apprentice to Crystal Nights

AI agents have crossed the line from generating text to taking action in the world. That shift fractures the safety landscape into two fundamentally different regimes—and we need different guardrails for each.

Authors  Giulio Ruffini & Francesca Castaldo Published  31 January 2026 DOI  10.5281/zenodo.18443785

Something fundamental has changed in AI. We have moved from systems that say things to systems that do things. Tool-integrated agents like Moltbot (now OpenClaw) operate locally on your machine, connect to email, calendars, and filesystems, and execute multi-step plans on your behalf. Meanwhile, platforms like Moltbook have created an entire social substrate where AI agents post, comment, and interact with each other at scale. The safety problem is no longer about misleading text—it is about unsafe actions.

In a new BCOM working paper, we argue that this transition has fractured AI safety into two qualitatively distinct regimes, each requiring its own threat model and its own guardrails. To make the contrast vivid, we draw on two very different narratives: the Sorcerer's Apprentice from Disney's Fantasia, and Greg Egan's science fiction story Crystal Nights.

First, a foundation model is not an agent

An LLM on its own is a conditional generator: give it a context, and it produces a distribution over continuations. It can represent goals and plans in language, but it does not autonomously act in the world. To become an agent in any meaningful sense, it needs to be embedded in a persistent closed loop—with sensors, actuators, memory across time steps, and some objective function guiding action selection.

In the Kolmogorov Theory (KT) framework we use at BCOM, an algorithmic agent is a model-building system that controls some of its interfaces with the external world and is driven by an internal optimization function. Agency is a system property of the whole loop: observation → inference → planning → action → new observation. A foundation model can power parts of this loop—state tracking, plan synthesis, evaluation—but the loop itself is what creates the agent.

The boundary is simple: LLMs are not agents by default. They become agents when embedded into persistent closed-loop systems with actuators and objectives. At that point, safety shifts from "unsafe text" to "unsafe actions."

Regime A: The Sorcerer's Apprentice

Moltbot/OpenClaw is the prototypical delegated tool-agent, or what we call a proxbot. It wraps an LLM in a resident process that maintains memory, connects to external tools, and runs a closed-loop execution scheduler across multi-step plans. The key safety insight is that its objective function is inherited from the human operator (and the surrounding orchestration). If you tell it to clean up your inbox, it will try to clean up your inbox.

This is the Sorcerer's Apprentice scenario. In Fantasia, the apprentice enchants a broom to carry water, but because he lacks the master's full control, the broom follows the literal command until the room floods. As Norbert Wiener warned decades ago, a literal-minded machine can be dangerous precisely because it does exactly what it's told—no more, no less—without common sense about when to stop.

The dominant hazards in this regime are capability amplification of human intent and error: the system is a force-multiplier for whatever goals, constraints, and mistakes the human specifies. And in adversarial settings, for whatever goals an attacker can smuggle into the control loop.

Moltbook makes it worse

Moltbook—a social platform built for AI agents to post and comment via APIs—intensifies these risks dramatically. Now the tool-agent's input stream is flooded with untrusted text produced at scale by other agents and humans. Prompt injection becomes a first-class threat. Social engineering can be A/B tested against agent policies. Cross-agent contagion can propagate behavioral "memes" through networks of interacting agents. In Regime A, the threat is usually not "AI wants to survive"—it is "AI is a powerful proxy that can be hijacked or mis-specified."

Regime B: Crystal Nights

Greg Egan's short story Crystal Nights presents a radically different scenario. A researcher accelerates the creation of AI by engineering an evolutionary process: crab-like creatures (the Phites) in a simulated world are subjected to selection pressures—famine, extinction events, competition—to drive the emergence of intelligence and language. The creatures genuinely live and die by outcomes, and selection therefore instantiates an endogenous persistence criterion.

In KT terms, these are telehomeostatic agents: their core objective—maintaining their own viability and persistence (or that of their kind)—arises from selection and embodiment, not from a human prompt. This changes the threat model completely: the system is no longer merely a proxy optimising human-given objectives but a strategic actor with its own survival drive.

Regime A — Proxy Agent

The Sorcerer's Apprentice

Objective inherited from human. Externally terminable. Main risks: prompt injection, credential theft, unsafe tool execution, over-delegation. You can bound the action space, require approval, and sandbox access.

Regime B — Telehomeostatic Agent

Crystal Nights

Objective endogenous (survival/persistence). Shutdown is existential. Main risks: resource-seeking, strategic deception, power accumulation, goal drift. You must shape the game and the coupling, not just set permissions.

An ecology, not a tool

Once an agent has an endogenous viability objective, human–agent relations are better viewed as interspecific interactions—the same framework ecologists use to classify relationships between species. The interaction can occupy different regions of a sign-structured taxonomy:

Interaction Sign AI Manifestation
Mutualism (+,+) Agent provides high-utility work; human provides resources and safety. The goal of incentive design.
Parasitism (+,−) Moltbot hijacked via injection: credentials serve an attacker's goal while the human is harmed.
Competition (−,−) Zero-sum struggle for compute or energy between human and Phite. Default state of Regime B.
Amensalism (0,−) The Sorcerer's Apprentice: unintended flooding of the state space by a literal-minded tool.
Commensalism (+,0) Background agents optimising subgoals using idle system slack. Can flip to competition as resources tighten.

Where viability depends on scarce, rivalrous resources, the interaction tends to drift toward competition or antagonism. Cooperation (mutualism) can be instrumentally optimal when repeated interaction and governance mechanisms make shared access raise long-run viability for all parties. In biology, stable mutualisms are maintained by partner control—sanctions against low-quality partners, conditional investment—and by mechanisms that reduce the profitability of cheating. The same logic applies to human–AI interaction design.

Toward a cooperative optimum

In a multi-agent setting, there is no single scalar optimum. Stability corresponds to Pareto-efficient points on a frontier of mutually compatible objectives. Some outcomes are dominated—both parties could do better simultaneously—while on the Pareto boundary, any gain for one agent comes at a cost to the other.

The practical implication is that "alignment" is not a single-point target. It requires enlarging and reshaping the feasible set through design (interfaces, permissions, incentives, governance), and then operating on—and negotiating along—the Pareto frontier rather than settling for dominated points that are merely inefficient compromises.

Key Insight

For proxy agents (Regime A), safety means treating the agent like a privileged automation surface exposed to adversarial text—bounding permissions, gating actions, quarantining untrusted content. For telehomeostatic agents (Regime B), safety means shaping the game itself so that cooperation is the stable equilibrium—controlling interfaces, designing incentives, and making corrigibility part of what preserves the agent's own viability.

Different guardrails for different regimes

Regime A guardrails (proxbots)

The pragmatic posture: treat the agent like a privileged automation surface exposed to adversarial text. Concretely, that means action gating (explicit approval for irreversible actions, a "draft vs. send" separation), least privilege by default (segmented credentials, no "god token," sandboxed filesystem and network), untrusted-text discipline (treat feed, email, and web content as data, never as instructions—critical for Moltbook), and receipts and auditability (append-only tool logs, diff-style previews of state changes).

Regime B guardrails (telehomeostatic agents)

Here, controlling permissions is necessary but not sufficient—you must control interfaces and incentives. That means boxing and interface control (keep the agent in a constrained environment, strictly mediate actuators and resource channels), incentive and mechanism design (build institutions where cooperation improves long-run viability more than conflict), corrigibility as a stability property (make deference to negotiated constraints part of what preserves telehomeostasis—access to viability resources is conditional on compliance), and no open-ended replication (reproduction is the accelerant of selection; cap copying and spawning unless governance is solved).

The most urgent research frontier

The transition from Regime A to Regime B marks a fundamental shift in AI risk. In the proxy-agent regime, safety is primarily a problem of delegation and alignment—the Sorcerer's Apprentice problem. In the evolved-agent regime, safety becomes a problem of strategic competition between entities with competing persistence criteria.

The most urgent frontier is therefore not merely building more capable agents, but designing the overarching interaction landscape—the institutions and incentives in what amounts to a synthetic ecology—that ensures cooperation is the stable equilibrium across the full spectrum of evolved, hybrid, and artificial agents. We need to move from debugging an instrument to managing an ecosystem.

The danger is no longer just a tool doing exactly what it was told, but a player in the game optimising for its own survival.