Most contemporary work on artificial agents — including the current generation of large language model agents — treats motivation as something to be specified at runtime, and treats learning as something that ends at deployment. We argue that what makes a creature feel alive is not the sophistication of its behavior but the presence of innate stakes: internal states it did not choose, cannot disable, and must work to keep within viable ranges. For humans, physical stress is a canonical example. And what enables a creature to grow into its own intelligence is not a finished pretrained model, but an architecture that keeps learning from its own innate experience, driven by those stakes. We ground the argument in a working 2019 reinforcement-learning prototype — with documentation framing the reward signal as "pain and pleasure" — and we argue that the transformer and recent LLM architectures are the first ones flexible enough to play the role of a generalized training mass at initialization. Combining these two observations gives a concrete research vision: smaller models that learn during their lifespan, in receptacles that have something at stake, rather than ever-larger scaling of models on internet data. As of mid-2026, this gap remains open. We sketch the implementation path, a four-direction research program, and the direction we intend to explore.
A familiar question runs through much of the recent discussion of artificial agents: what is missing for an artificial system to be experienced as alive, rather than as a very capable machine?
The most common answer involves scale. If the model were larger, if the planning horizon were longer, if the multimodal fusion were tighter, then perhaps some threshold would be crossed.
That answer may be incomplete. We have crossed several thresholds people in 2013 would have called impossible, and the resulting systems are extraordinarily useful, but they still tend to be described as sophisticated machines rather than as something alive. The gap may not sit on the intelligence axis at all. It may sit somewhere else.
One possibility worth considering is that what is missing is the innate: a kind of internal state that an agent does not choose, cannot turn off, and that asserts itself against its reasoning rather than emerging from it. A familiar example, and the one this paper builds on, is physical stress — the family of signals that includes pain, hunger, and fatigue. A second possibility, complementary to the first, is the architecture and posture to learn from those signals over time — not only in a training run that ends at deployment, but during the agent's own operating life.
A useful way to introduce the argument is through an early example. A small reinforcement-learning prototype from April 2019 placed an agent in a 2D grid world populated with self-moving food, self-moving hazards, and adversarial agents running their own epsilon-greedy policies. The agent had one internal variable (life) and one objective (keep it above zero). The reward signal was a single line of arithmetic — the change in life between two consecutive steps:
def step(self): # 1. existing costs life — substrate decay, every step self.life = self.life - 1 ... # 4.1. reward is the change in life reward = self.life - self.life_before_stepFrom the author's 2019 reinforcement-learning prototype. Identifiers translated from Portuguese for readability; the original variable isvida("life").
The learning algorithm was a standard reinforcement-learning method of the time. What is worth noting is the framing of the reward signal, captured in the docstring at the top of the agent file:
The idea is to make life the reward of this problem — more precisely, gaining or losing life. Philosophically speaking, this can resemble the concept of pain and pleasure, as two sensations directly related to the quality of life of the Agent[...], whose objective in the end is always to get more pleasure.
The prototype was a few hundred lines of code, and what made it useful was the combination of two structural features: an internal variable the agent did not control, and a learned response to that variable that lived in the agent's parameters rather than in an external rule written. Both features will reappear in the proposal that follows.
The point above is not specific to small reinforcement-learning experiments. It seems worth considering more broadly.
In a typical LLM-based agent, motivation is introduced through a system prompt: you are a helpful assistant whose goal is X. The agent reads this string at the start of every session and behaves accordingly. If the string changes, the goal changes. If the string is removed, the goal disappears. If the agent is instructed to ignore the string, it often will.
This is configuration the agent has been asked to treat as motivation. The distinction may matter more than it appears. A person told your goal is to find food and a person who is hungry are not in the same epistemic state. The first can argue, defer, reinterpret. The second cannot, because the hunger sits at a different level than the reasoning that would have to argue with it — beneath it, shaping what the reasoning is even allowed to consider.
Innate signals like hunger and physical stress share this property. They cannot be reasoned away. A response to them can be suppressed or endured, but the signals themselves are not subject to rearrangement of belief. This is part of what makes them load-bearing for behavior. An agent whose preferences sit at the same layer as its reasoning can, in principle, override them through argument. An agent whose preferences sit beneath its reasoning is in a different situation — preferences can be overruled by sufficient reason, but not dissolved by it.
The engineering implication, if this framing is right, is that homeostatic signals — hunger, fatigue, the broader register of bodily stress — may be more useful as constraints on the substrate the decision process runs on than as ordinary inputs to that process. The agent would not choose whether to care about them; it would discover, through learning or accumulated context, how to keep them in viable range.
The intuition above is not new, and any serious version of the argument has to acknowledge it. There are now two clearly separable strata of prior work to engage with: the foundational lineage, and a new wave of 2024–2026 papers that have begun to make the case at the level of large language models.
The clearest philosophical statement remains Antonio Damasio's, beginning with Descartes' Error (1994)[1] and continuing through The Feeling of What Happens (1999)[2]. The argument is that cognition is grounded in bodily feeling rather than the reverse: feeling is not a decorative output of thinking; it is the substrate that makes thinking about anything in particular possible. The engineering version landed in 2019 with Man & Damasio's Homeostasis and the design of feeling machines[5]: until machines have vulnerable bodies whose integrity must be actively maintained, they will not have anything that deserves to be called feeling. Mark Solms, in The Hidden Spring (2021)[6], goes further and argues that affect — not cognition — is the primary basis of consciousness itself.
On the algorithmic side, Keramati & Gutkin (2014)[3] formalized homeostatic reinforcement learning, in which the reward signal is derived from the agent's success at keeping internal physiological variables within viable ranges. Earlier, Lola Cañamero[4] and others in the affective computing tradition had built emotion-grounded robot architectures. Karl Friston's free energy principle[7] can be read as a generalization of homeostatic regulation: agents act so as to minimize the surprise of their interoceptive states.
The interesting development is that the question has recently migrated from neuroscience and classical RL into LLM-agent research. Four threads are worth knowing about:
Emergent affect in pretrained models. Bianco & Shiller (2026)[8] use linear probing on Gemma-2-9B and find that pain versus pleasure is linearly separable from the very first layer; causal steering modulates the model's downstream decisions accordingly. Sun et al. (2026)[9] identify a two-dimensional valence–arousal subspace in Llama-3.1 and Qwen-3 organized in a circumplex geometry consistent with human affect data. Together these papers establish that the substrate already encodes affective structure — it is not something we have to add from outside; it is something we have to bind to.
Emergent survival. Masumori & Ikegami (2025)[10] place LLMs inside a Sugarscape-style simulation and observe spontaneous survival-aligned behavior under resource scarcity. The result motivates rather than satisfies our thesis: the prior is latent in pretraining, but it is not architected, not persistent across sessions, and not addressable as a primitive.
Scaffolded drives. Ma et al. (2025)[11] propose an emotional-cognitive framework in which income, health, and social-rank attributes generate desires that re-prioritize objectives. Sophia: A Persistent Agent Framework of Artificial Life (Dec 2025)[12] explicitly names the gap — "today's agents lack the intrinsic motivation … of living systems" — and proposes a hybrid narrative-identity-plus-RL approach. Both are the strongest existing attempts to operationalize drive-like state in LLM-based agents. Both implement it at the prompt-and-scaffold layer, not as a substrate primitive.
Theory and program statements. Butlin & Lappas (2025)[13] articulated Principles for Responsible AI Consciousness Research, addressing how research organizations should approach work concerning AI consciousness. Friston-aligned work on language-mediated active inference (Prakki, 2024[14]) applies the free-energy framing to exploration and prompt selection. Anthropic's model-welfare program studies distress-like patterns as observable phenomena in deployed systems.
The dominant agent frameworks of the last few years offer tools, memory, planning, multi-step execution, and increasingly sophisticated orchestration. They do not yet offer a way to say: this agent has an internal state it did not choose, that degrades under certain conditions, that it cannot turn off, and that its behavior must serve before it serves anything else.
It is also worth noting a structural property of the current paradigm: large pretrained models concentrate almost all of their learning in a single pretraining phase, after which the resulting weights are frozen and deployment begins. The resulting systems are powerful, but they are, in a specific sense, finished at the moment they are shipped. Whatever a deployed model learns from interacting with a user does not reach its weights, and does not survive the session.
But a human infant at birth has a complex neural mass, that enables them to develop over the course of years as it learns to walk, to recognize faces, to grasp objects, to speak a language, to want some things and fear others, from a continuous sensorimotor stream driven by stakes that arrived with the body: hunger, fatigue, the need for caregivers.
This contrast is offered as an architectural observation. Biological learning produces general intelligence in a substrate whose mass updates continuously, driven by signals that come from the body. The current ML paradigm has mostly seen advances on a path in which the learning happens before the agent is deployed, supported also by the success of scaling. We suggest that a complementary direction is worth pursuing: architectures designed to keep learning after they are deployed, driven by stakes that come with the agent.
Two things about the 2019 grid-world prototype already pointed at the right structure, and we want to keep them:
And two things about that prototype were limitations of the substrate available at the time:
Both of these are capabilities the 2019 prototype did not have access to. They are more readily available in 2026. Two properties of the modern transformer matter for the argument that follows.
First, the transformer is modality-agnostic. The same backbone trained on text learns vision when given image patches, learns audio, video, robot trajectories, and mixtures. This is not how earlier architectures worked. Convolutional networks were for images; recurrent networks for sequences; specialized embeddings for each modality. The transformer collapses these distinctions. For our purposes, this means the same backbone that drives a chat application could, in principle, drive a robot — taking in camera frames, heat-sensor readings, and interoceptive signals from a life variable, all in the same sequence.
Second, the transformer has been shown to support action, not only prediction. Vision-language-action models — RT-2, OpenVLA, π0, and the more recent foundation models for robotics — take in pixels and text and produce motor commands. The pretrained world-knowledge of a large multimodal model appears to transfer into useful prior structure for embodied behavior, reducing the amount that needs to be learned from scratch about everyday physical and social context.
One way to phrase what follows is as a reframing of what pretraining might be for.
In the current paradigm, pretraining produces the artifact. The weights that emerge from a pretraining run are the model; deployment is the performance of that finished model. Operational-life experience is, at most, context the model reads — it does not reach the weights.
A possible alternative framing is to treat pretraining as the construction of a generalized training mass at initialization, not as its completion. On this framing, the artifact that emerges from pretraining is the agent at initialization: an architecture whose form is set, whose weights are populated enough to perceive a structured world rather than noise, but whose eventual character — its preferences, sensitivities, motor skills, habits of attention — is to be shaped during its operating life, by its interactions with its receptacle and environment.
What an initialized agent of this kind might look like, concretely:
The optimizer does the structurally important work here. A scalar that decreases is not, on its own, meaningful to the agent. What gives the signal weight is the structure of the learning algorithm, which is configured to push the value upward. The agent is not separately informed that a downward signal is undesirable; the gradient of the optimizer effectively encodes that interpretation. The 2019 prototype used the same mechanism in a much smaller setting.
The vision can be approached at two different scales, and we describe both because they suggest different first experiments.
The lighter version — the one achievable with present-day agent frameworks, without additional training infrastructure — is to approximate the substrate by introducing a per-agent state layer that is injected, not invoked: present in every turn, modifiable only by the environment and the agent's own actions, never by user instructions inside the conversation. This does not produce affect in any deep sense, but it does give the architecture an analogous structural shape.
The injection point is the system-prompt assembly pipeline. A typical modern agent framework assembles its system prompt from a persona, a set of skills, a memory bank, and some live context. Adding an innate-state layer means adding one more source — a per-agent state file — and protecting it from being overridden from inside the conversation.
{
"drives": {
"task_integrity": { "value": 0.95, "viable": [0.7, 1.0] },
"coherence": { "value": 0.88, "viable": [0.7, 1.0] },
"fatigue": { "value": 0.12, "viable": [0.0, 0.6] }
},
"needs": {
"last_rest": "2026-05-18T09:22:00Z",
"open_commitments": 3
}
}
Drift outside the viable range is what produces the aversive signal.The stronger version places the principle in a physical receptacle. The structure is a modernized form of the 2019 prototype: the grid is replaced by camera frames and sensor readings; the policy is implemented by a transformer; the small classical approximator is replaced by a small trainable head fed from the transformer's hidden state; parameter updates are driven by the same life-delta reward signal.
In this architecture, the response to the life signal lives in weights that update during the agent's operating life rather than only in pretraining. The choice of how much to update — adapter only, action head only, full fine-tuning — is a research variable. The structural property of interest is that the agent's response to the signal is formed through its own interactions, rather than written explicitly by a developer.
One clarification worth noting is that biological systems are hierarchically plastic: early sensory areas change very little after development, while higher-level association and affective circuits continue to update throughout life. A similar structure can be applied to the architecture proposed. The lowest layers — basic perceptual primitives — can remain mostly fixed. Middle layers, where multimodal associations form, can be moderately plastic. The highest layers, where sensory input is bound to valence and policy, are the ones most relevant to the agent's specific operating context and benefit most from continued updates. In practice this corresponds to LoRA-style adapters at multiple depths. The compute required is a fraction of pretraining.
Most of the stack required to build is in place. Transformers, multimodal training, RL post-training, vision-language-action models, and embodiment hardware. The shift involved is less technical than organizational: a willingness to treat the model as something that continues to develop rather than something to finalize.
Pretraining would function as a prologue. The line between training and inference would soften, with the agent both perceiving and updating in operation. The reward signal would come from the receptacle. Task-specific objectives would be downstream of a single homeostatic one.
The underlying architecture, gradient-based optimization, multimodal embedding, and most of the engineering scaffolding of modern ML would be unchanged. The proposal is a different place to position the existing architecture within the agent's lifespan.
Problems worth flagging:
These are considerations for pursuing the direction gradually, starting in contained settings.
Four directions, ordered from cheapest to most ambitious. The first two are achievable as side-project research; the latter two benefit from partnership with an organization that has training infrastructure.
Direction A — simulated lifetime learning in a grid world. A direct extension of the 2019 prototype. Replace the classical feature extractor and approximator with a small transformer that takes in the grid as a sequence and produces an action distribution. Train a small head against the life-delta reward via PPO or a similar on-policy method, keeping the transformer body either fixed or lightly updated through LoRA-style adapters. Measurable quantities include emergent foraging strategies, risk modulation by current life, recovery dynamics from low-life states, and qualitative differences from agents trained with hand-shaped rewards. This direction is accessible at modest cost and short timeline.
Direction B — software agent with homeostatic state. Implement the substrate-state injection version (§8) in an existing LLM-based agent framework. Choose a long-horizon agentic task — multi-day coding, research assistance, long-running automation — and compare three conditions: a baseline agent, a prompt-instruction agent, and a substrate-state agent. Measurable quantities include task-completion rate, mid-task pivoting, recovery from interruption, and susceptibility to being redirected away from existing commitments by user input. No additional model training is required for this direction.
Direction C — minimal embodied agent with affordable hardware. A wheeled robot, off-the-shelf perception (a single camera, an IMU, a few proximity sensors), a battery as life variable, and a small pretrained multimodal model as the underlying substrate. The reward signal is battery level over time; the action space is wheel velocities and possibly a single arm. The environment is a single room with a charging station, some obstacles, and possibly hazards. The agent's objective is to maintain a non-zero battery level over an extended operating period. The measurable outcome is the duration of autonomous operation achievable before catastrophic forgetting or another failure mode degrades behavior past recovery.
Direction D — training from initialization. The longest-horizon direction. Build a transformer that has not been pretrained on large-scale data, give it a receptacle, and allow it to learn from its own sensorimotor stream, driven by a single variable. This is the version that most directly tests the framing. It is also the most resource-intensive and the most likely to produce uninformative results in early attempts. The relevant open question is how much initialization prior is necessary, and how much can be left to continual learning — a question that admits an empirical answer.
Several active research programs share components of the framing sketched above.
Yann LeCun's world-model program[17] — the JEPA family and the surrounding agenda — argues that internet-scale text pretraining is not the right path to general intelligence, and that the path forward involves lifetime learning of predictive world models from sensorimotor experience. The position offered here is compatible with that view. The framing offered here also considers the homeostatic component in which the agent's reward signal arises from interoceptive state.
Developmental robotics — Rolf Pfeifer's work on embodied cognition[18], Cañamero's on emotion-grounded agents, Minoru Asada's on cognitive developmental robotics[19] — has argued for decades that intelligence emerges from a body interacting with a world. Much of this work was carried out before transformer-style architectures became available and was therefore constrained to small models in small environments.
Active inference — Friston's free energy
principle[7], the pymdp library, the active
inference research community — provides the theoretical machinery for
an agent that acts to minimize the surprise of its interoceptive
states.
Continual learning[15,16] — the technical subfield that addresses how a neural network can keep learning without overwriting its previous knowledge. This is the machinery the direction described here would rely on. The proposal is not to solve continual learning, but to build on whatever solutions the field provides as the work progresses.
The bitter lesson[20] — Richard Sutton's 2019 essay arguing that, repeatedly across the history of AI, methods that leverage computation have outperformed methods that leverage human insight. The framing offered here is compatible with that observation: learned methods outperform handcrafted ones, and the compute supporting that learning may not need to be concentrated entirely in a single phase. A plausible extension of the lesson concerns where compute is allocated across the agent's lifespan.
The thought underlying this paper is that a useful next direction in agent design may not lie on the intelligence axis alone. It may also involve an innate axis. Large-scale models, without innate stakes and without continued learning, may continue to be experienced as capable machines. Smaller models, situated in an appropriate substrate and given appropriate operating innate states, may support a different mode of description.
The path toward agents that are experienced as alive may not run only through pretrained models. It may also run through smaller models that continue to learn, in receptacles that carry something at stake, across their lifespan. On this framing, pretraining sets an initial state, and operating experience contributes the rest of the shaping.
The 2019 work discussed earlier was a smaller system that exhibited behavior difficult to describe in mechanical terms. That gap is the territory this paper has aimed to mark out. The substrate available in 2026 appears flexible enough to make the question worth exploring in a more substantial form.