Active Inference: A Primer

Active Inference is a theoretical framework originating from neuroscience, aiming to provide a unified account of perception, action, learning, and decision-making under a single principle: Free Energy Minimization. It posits that biological agents (like brains) act to minimize their long-term average surprise, which is equivalent to maximizing the evidence for their internal model of the world.

Motivation: Staying Alive
Generative Process vs. Model
Bayesian Inference Basics
Example: Fruit Trees & Model Evidence
Model Evidence & Surprise
Free Energy: A Tractable Bound
Deriving Free Energy
Decomposing Free Energy
Dynamic World & Perception
Active Inference: Adding Action
Planning as Policy Selection
Expected Free Energy (EFE)
Decomposing EFE
Minimalistic Example
Action Selection
Summary & The Big Picture
Learning & Limitations
Applications

Motivation: Staying Alive

The fundamental drive, from this perspective, is survival. Biological agents strive to maintain themselves within viable physiological bounds (homeostasis). States outside these bounds (e.g., extreme temperature, low oxygen) are 'surprising' in a statistical sense – they are states the organism is not adapted to and rarely encounters while viable.

However, an agent doesn't directly perceive its internal physiological state; it only has access to sensory observations (). Therefore, to avoid surprising internal states, the agent must minimize the surprise associated with its sensory observations. Minimizing sensory surprise turns out to be equivalent to building a better predictive model of the world, because accurate predictions lead to less surprising observations.

Chain of reasoning: Remain alive Maintain homeostasis Avoid surprising (non-viable) states Avoid surprising sensory observations Minimize an approximation to surprise (Free Energy).

This tutorial delves into the mechanics, requiring only a basic grasp of probability and Bayes' theorem.

Generative Process vs. Generative Model

We distinguish between two key concepts:

Generative Process ( or ): This is how the actual environment works. Hidden states () in the world cause sensory observations (). The agent doesn't have direct access to . Example: It rained last night (, hidden state), causing the grass to be wet (, observation). The uppercase denotes the true, objective probabilities of the world.
Generative Model (): This is the agent's internal, subjective model of how it thinks the environment works. The agent uses this model to infer the hidden states () that likely caused its current observations () via , using its prior beliefs () and its understanding of how states cause observations (the likelihood, ). The lowercase signifies the agent's potentially inaccurate beliefs.

Generative Process (World)

Loading diagram...

Generative Model (Agent)

Loading diagram...

Active inference is about aligning the generative model () with the generative process () through perception (updating beliefs) and action (changing the world to match beliefs).

Bayesian Inference Basics

At the heart of Active Inference lies Bayesian belief updating. The brain is thought to constantly refine its understanding of the world (its generative model) by combining prior beliefs with new sensory evidence using Bayes' rule:

The posterior belief about the hidden state () given an observation () becomes the prior for the next observation, enabling continuous learning.

Sequential Bayesian Updates

Below you can see how the prior belief (blue) and likelihood (green) combine to form the posterior belief (red). Drag the circles to adjust the distributions and see how they interact in Bayesian updating.

Loading visualization...

The posterior (red) is proportional to the product of the prior (blue) and likelihood (green).

Observe how beliefs evolve over time as new evidence arrives. Adjust the likelihood of making an observation given a particular state () and click 'Update'. The posterior from one step becomes the prior for the next.

Loading visualization...

Likelihood (P(o|s)): 0.70

Example: Fruit Trees & Model Evidence

Imagine an orchard with apple () and orange () trees. The fruit type () is hidden. We observe features like location or size/sweetness (). Let's say the true distribution (Generative Process ) has 70% oranges and 30% apples, and they fall in slightly different locations.

Loading plot...

Visualizing potential locations () for apples and oranges ().

The agent's tasks within its Generative Model () are:

Inference: Given an observation (fruit location ), infer the probability it's an apple or orange (). This uses Bayes' rule: .
Learning: Estimate the model parameters – the prior probability of fruit types (, e.g., initially believing 50/50) and the likelihood of observing a location given a fruit type (, e.g., the mean/variance of location for apples vs. oranges).

Crucially, performing inference via Bayes' rule requires calculating the denominator, , the probability of the observation itself according to the model:

This is called the Model Evidence (or marginal likelihood). It represents how well the agent's current model ( and ) predicts the observation . A higher indicates a better model fit to the data. Learning aims to adjust model parameters to maximize model evidence over encountered observations.

(Note: Technically, true model evidence involves marginalizing over model parameters too, not just states. We'll simplify for now and treat as the likelihood of parameters given the data, , where are the parameters).

Interactive Model Evidence

Explore how Model Evidence changes based on the prior beliefs and likelihood models . Adjust the sliders for the prior (Apple vs. Orange), the likelihood parameters (mean/std dev for each fruit's size distribution), and the current observation (size). Observe how the likelihoods at the observation point combine with the priors to calculate . Higher means the current observation is more probable under the agent's model.

Loading visualization...

Model Evidence & Surprise

As established, the agent wants to maximize its model evidence . This is equivalent to minimizing Surprise, defined as the negative log-probability of the observation:

Highly probable observations (high ) have low surprise, while very improbable observations (low ) have high surprise (approaching infinity as ). Minimizing surprise drives the agent to seek out familiar, predictable sensory states consistent with its model and its existence.

Surprise vs. Probability

Why ? The logarithm is used because it transforms multiplicative probabilities into additive quantities. The surprise of two independent events occurring is the sum of their individual surprises: . This aligns with the additive nature of information (measured in bits or nats).

Surprise () quantifies how unexpected an event is. Adjust the probability slider to see the relationship.

Loading visualization...

Probability (p): 0.50

Current: Probability = 0.50, Surprise = 0.69

Notice the non-linear increase in surprise as probability decreases. Rare events are disproportionately surprising.

However, directly calculating surprise is often computationally intractable (difficult or impossible). Why? Because the sum might involve summing over an astronomical number of possible hidden states .

Free Energy: A Tractable Bound on Surprise

Since calculating surprise directly is hard, Active Inference uses a more manageable quantity called Variational Free Energy () as a proxy. Free Energy provides an upper bound on surprise, meaning . By minimizing Free Energy (), the agent indirectly minimizes surprise ().

Deriving Free Energy

How do we get this bound? We start with the definition of surprise and introduce an arbitrary probability distribution over the hidden states. This represents the agent's current belief or approximation to the true posterior belief about the hidden state . Multiplying and dividing inside the logarithm by (which doesn't change the value) is a common technique in variational methods:

Recall the definition of Expectation (average value) of a function under a probability distribution :

The term inside the logarithm is now an expectation (an average) of the ratio under the distribution :

Now, we use Jensen's Inequality. This inequality relates the function of an expectation to the expectation of the function. For a convex function (like which curves upwards), the inequality states that . The function applied to the average value is less than or equal to the average of the function applied to all values.

Jensen's Inequality (Convexity): For a convex function (like ) and a random variable with distribution , the inequality holds.

Applying this to and allows us to establish that Free Energy () is an upper bound on Surprise ().

Applying Jensen's Inequality to our surprise expression (with and under the expectation ):

≤

Using the logarithm property , we arrive at the definition of Variational Free Energy ():

This is the quantity the agent seeks to minimize. The distribution is called the variational or approximate posterior distribution. The agent can adjust its beliefs (and its model parameters within ) to make as small as possible, thereby getting a tighter bound on the true surprise and improving its model.

Summary So Far

We started with the goal of minimizing Surprise () to maintain homeostasis and make good predictions. However, calculating Surprise directly by summing over all hidden states () is often intractable.

To overcome this, we introduced an approximate belief distribution and used Jensen's Inequality on the definition of surprise. This allowed us to derive Variational Free Energy () as a computationally tractable upper bound on Surprise ().

Therefore, by minimizing , the agent can indirectly minimize Surprise and improve its model of the world. The next step is to understand the components of .

Decomposing Free Energy

Free Energy can be rearranged into two insightful forms using standard probability rules () and properties of logarithms.

Decomposition 1: Complexity and Inaccuracy

Here, Free Energy balances:

Complexity: The KL divergence measures how much the agent's approximate posterior belief diverges from its prior belief . Minimizing this encourages simpler explanations that don't stray too far from prior assumptions.
Inaccuracy: The expected negative log-likelihood of the observation given the state, under the agent's belief . Minimizing this term (equivalent to maximizing Accuracy, ) means finding beliefs that make the observation likely.

Minimizing involves a trade-off: find beliefs that accurately explain the data () without becoming overly complex (too different from the prior ).

Decomposition 2: Approximation Error and Surprise

This shows:

Approximation Error: The KL divergence measures how different the agent's approximate belief is from the true posterior belief (the ideal Bayesian belief given the observation).
Surprise: The actual surprise .

Since KL divergence is always non-negative (), this form clearly shows that , confirming is an upper bound on surprise. Minimizing with respect to means making the agent's belief as close as possible to the true posterior , driving the KL term towards zero. When , then .

Furthermore, if the agent can also adjust its model parameters (affecting ), minimizing simultaneously improves the model (reduces surprise) and the belief accuracy (reduces KL divergence).

A Dynamic World: Introducing Time

So far, we dealt with a static situation, with one hidden state and one set of observations, but the real world is dynamic. We perceive and act over time. This means we have hidden states and observations at each point in time . Since things tend to depend on what has just happened, we assume that the state depends on the state at the previous time point, . For example, the probability of seeing a rainbow depends directly on whether it rained before.

Loading time visualization...

The agent's task is still to learn a generative model that captures these temporal dependencies (e.g., and ) and infer the current state using an approximate posterior based on observations . As before, minimizing Free Energy drives the updates to and the model parameters.

Active Inference: Adding Action

Passive observation isn't enough; agents act on the world. Active Inference incorporates action . Now, the state at time depends not only on the previous state but also on the action taken at the previous step. The agent's model includes state transitions conditioned on actions: .

Loading action visualization...

A different action can lead to a different future. Now the agent must choose actions that help minimize Free Energy over time. But how far into the future?

Expected Free Energy (EFE)

Agents don't just react; they plan. Active Inference frames planning as selecting a policy (), which is a sequence of future actions over some time horizon .

Calculating the Free Energy for future time steps requires a slight modification because we haven't observed the future outcomes yet. We need to calculate the Expected Free Energy () under a given policy , averaged over the agent's predictions about future states and observations under that policy:

(Note: and denote future states and observations from time to , the planning horizon.)

How can we decompose this EFE? Recall the two decompositions of Free Energy we saw earlier. The first decomposition (Complexity + Inaccuracy) is problematic for the future because the Inaccuracy term () requires knowing the future observation , which we don't.

Recall: Free Energy Decompositions

1: (Requires )
2:

Therefore, we use the second decomposition (Approximation Error + Surprise) as a starting point, but averaged over predicted future outcomes under policy :

Let's expand the joint probability inside the logarithm using :

(Simplifying notation: , , and hiding the expectation for clarity in the next steps)

- Here, represents the agent's prior preferences over outcomes — how much it desires certain future observations. This term encourages policies that lead to preferred outcomes. + This decomposition shows EFE as the sum of the KL divergence (approximating zero if beliefs are optimal) and the Expected Preferences term. Here, represents the agent's prior preferences over outcomes, encouraging policies leading to desired future observations.

However, the true posterior is still difficult to compute, especially for the future. We can apply Bayes' rule within the KL divergence term to relate it back to the likelihood (which the agent does model) and the predicted states :

(Note: This step involves some approximations where the agent's predictive distributions replace the true distributions , reflecting that the agent operates based on its own model. For example, the true posterior is approximated by , and we then work with within the expectation.)

Loading policy visualization...

There can be many possible policies. Active Inference considers all plausible policies in parallel. For each potential policy , the agent predicts the likely sequence of future states and observations that would result from executing that policy.

The goal remains Free Energy minimization. The agent evaluates each policy based on the Free Energy expected in the future if that policy were pursued. Policies that are expected to lead to lower future Free Energy are preferred.

Decomposing Expected Free Energy

Substituting the Bayesian expansion back into the EFE expression and rearranging leads to a common decomposition:

Let's rewrite using :

(Note: The Risk term is often expressed using prior preferences instead of preferences conditioned on the policy . Assuming leads to the KL divergence form below.)

This decomposition provides intuition about policy selection:

Risk (or Cost): The KL divergence between the predicted outcomes under the policy () and the agent's prior preferences (). Minimizing this favors policies expected to lead to desirable outcomes. This term captures the Instrumental Value.
Ambiguity: The expected uncertainty about outcomes, given the states predicted under the policy. It reflects the agent's uncertainty in its likelihood model (). Minimizing this favors policies expected to lead to states where the outcome is predictable.

Alternative Decomposition: Epistemic and Instrumental Value

Another insightful way to decompose EFE relates back to the idea of information gain:

Rearranging differently, focusing on the difference between expected future states and states conditioned on observations:

(Note: The sign is often flipped, discussing maximizing negative EFE. Here G is minimized.)

Instrumental Value: The degree to which expected outcomes align with prior preferences (). Same as maximizing expected log preferences (or minimizing Risk).
- Epistemic Value: The expected information gain about hidden states from observing future outcomes under policy . It's the expected reduction in uncertainty about states after seeing outcomes, quantified by the Mutual Information . + Epistemic Value: The expected information gain about hidden states from observing future outcomes under policy . It's the expected reduction in uncertainty about states after seeing outcomes, quantified by the Mutual Information .

Minimizing EFE thus balances achieving preferred outcomes (Instrumental Value) with choosing actions that lead to informative observations that reduce uncertainty about the world (Epistemic Value).

Digging into Epistemic Value

- Epistemic value, or expected information gain, motivates exploration. It quantifies how much a policy is expected to reduce uncertainty about the hidden states by observing the resulting outcomes . It is the mutual information between predicted states and predicted observations under the policy: + Epistemic value, or expected information gain, motivates exploration. It quantifies how much a policy is expected to reduce uncertainty about the hidden states by observing the resulting outcomes . It is the mutual information between predicted states and predicted observations under the policy:

It can also be expressed using entropies ():

This shows it's the difference between the uncertainty about states before seeing the outcome () and the expected uncertainty after seeing the outcome (). Policies that lead to large reductions in uncertainty have high epistemic value.

Alternatively, using the KL divergence formulation for Mutual Information:

This measures how much the joint distribution under the policy differs from the product of the marginals (what you'd expect if states and observations were independent). High epistemic value means states and observations are strongly coupled under that policy.

- Practically, epistemic value is high when the agent is uncertain about the state ( is high) but expects the outcomes under a policy to resolve that uncertainty ( is low). If the agent is already certain, or if observations don't distinguish between states, epistemic value is low. + Practically, epistemic value is high when the agent is uncertain about the state ( is high) but expects the outcomes under a policy to resolve that uncertainty ( is low). If the agent is already certain, or if observations don't distinguish between states, epistemic value is low.

Minimalistic Example: Hunger Games

Let's illustrate policy evaluation and selection with the simple hunger scenario described earlier. This interactive component walks through the calculations step-by-step.

Loading interactive example...

Planning Horizon: One step ahead ().

Policies: Two possible policies: (Get food), (Do nothing).

Detailed Calculation Steps

The interactive visualization above demonstrates the core calculations. Here's the step-by-step mathematical breakdown for this example:

Predict Future States (): Calculate the expected state distribution at for each policy, based on the transition matrix and initial belief .
- Under (Get food, ): (Predicts state 1: Food)
- Under (Do nothing, ): (Predicts state 2: Empty)
Predict Future Observations (): Project the predicted state distributions into expected observation distributions using the likelihood matrix .
- Under : (Predicts observation 1: Fed)
- Under : (Predicts observation 2: Hungry)
Calculate Expected Free Energy (): Evaluate each policy based on how well its predicted outcomes align with preferences and how predictable those outcomes are.
- Risk = (Using preferences ). Measures divergence from preferred outcomes.
- Ambiguity = (Expected entropy of likelihood columns). Measures expected outcome uncertainty given predicted states.
- Since is the identity matrix, the likelihood is certain for each state (columns are [1, 0] or [0, 1]), meaning the entropy of each column is 0. Thus, Ambiguity = 0.
- Risk() = (Using limit). Policy 1 perfectly matches preferences.
- Risk() = . Policy 2 predicts an outcome () that has zero probability under preferences ().
- Therefore:
- And:
Calculate Policy Probabilities (): Convert EFE values into a probability distribution over policies using the softmax function, modulated by precision .
- Assume precision for simplicity.
- Unnormalized probability for : .
- Unnormalized probability for : .
- Normalizing gives . The agent is certain it should select policy (Get food).
Calculate Overall Expected State (): Average the state predictions ( from Step 1) weighted by the policy probabilities ( from Step 4): .
In our example (for ):
The agent expects, overall, to be in state 1 (Food) at t=1.
Calculate Overall Expected Outcome (): Project the overall expected state ( from Step 5) into the overall expected observation using the likelihood: .
In our example (for ):
The agent expects, overall, to observe 'Fed' (o=1) at t=1.
Evaluate Immediate Action Consequences (): For each possible immediate action available now (at ), calculate the specific outcome distribution () that would result if that action were taken, starting from the current belief .
- If action (Get food) is taken now: (Outcome would be 'Fed')
- If action (Do nothing) is taken now: (Outcome would be 'Hungry')
Select Action (): Choose the immediate action that minimizes the KL divergence between the outcome predicted if that specific action is taken ( from Step 7) and the overall expected outcome averaged over policies ( from Step 6). This means selecting the action that best fulfills the agent's overall expectations.
- Divergence for : .
- Divergence for : .
- The action (Get food) minimizes the KL divergence (0).

Therefore, the agent selects action (Get food) because taking that action leads precisely to the outcome it expects (), based on its policy evaluation. The environment (generative process) then yields the next observation , the agent updates its belief using Bayesian inference (perception), and the cycle repeats. The agent acts to make its (policy-informed) predictions come true.

Interactive Grid World Simulation

Now let's explore a slightly more complex scenario. In this interactive grid world, you can place the agent (), food sources (), predators (), and shelters (). You can also toggle the weather conditions ( / ).

How the Simulation Works:

The agent's goal is to minimize its Expected Free Energy (EFE) over a short planning horizon (here, just one step ahead).

At each time step, the agent performs the following loop:

Evaluates Policies: It considers possible actions (Stay, North, East, South, West). For each action (policy), it calculates the EFE based on its current belief about its state, the environment's likelihood model, and its preferences.
Selects Action: It converts the EFE values into probabilities using a softmax function (influenced by Precision γ) and selects an action stochastically based on these probabilities. Policies with lower EFE are more likely to be chosen.
Updates Belief: It executes the chosen action and updates its belief about its new state based on the predictable consequences of its action (using the transition model). (Note: A full Active Inference agent would also incorporate a new observation here to refine its belief via perception, but this simulation simplifies that step.)

Use the / button to pause and inspect the detailed calculations, including visualizations of beliefs and predictions.

Observe how the agent's preferences (seeking food, avoiding predators, seeking shelter in bad weather) influence the EFE calculated for different actions. The agent selects actions based on minimizing this EFE, balancing the desire to reach preferred states (low Risk) with the need to reduce uncertainty about outcomes (low Ambiguity).

Loading interactive grid world...

Try different arrangements of items and weather conditions. Adjust the agent's precision () and likelihood noise, and see how its behavior changes. Notice how the belief distribution (purple circles in main grid, mini-grid when paused) and the agent's most likely position (black circle) evolve.

Summary & The Big Picture

Active Inference proposes a unified mechanism for perception, learning, planning, and action, all driven by the imperative to minimize Free Energy (a proxy for surprise). The core loop involves:

Perception (State Estimation): Update beliefs about the current hidden state () by minimizing Free Energy given the latest observation (). (Minimizing )
Learning (Model Update): Adjust model parameters (e.g., likelihood , transitions , precision , preferences ) to minimize Free Energy over time. (Making higher, reducing surprise)
Planning (Policy Evaluation):
- Consider possible future policies ().
- For each policy, predict future states () and observations ().
- Calculate the Expected Free Energy () for each policy, balancing expected instrumental value (reaching preferred ) and epistemic value (reducing uncertainty about ).
- Compute the probability of each policy () based on its EFE and precision ().
Action Selection:
- Compute the marginal predicted state for the next step () by averaging over policies.
- Compute the overall expected outcome ().
- Select the action () that minimizes the divergence between the outcome predicted specifically for that action () and the overall expected outcome (). (Acting to fulfill expectations)

Minimize Free Energy Improve Model & Beliefs Reduce Surprise Fulfill Preferences Maintain Homeostasis Stay Alive

Learning & Limitations

As mentioned, learning involves adjusting the parameters (, etc.) of the generative model to minimize Free Energy. This allows the model () to better approximate the true generative process () and make more accurate predictions, thereby reducing future surprise. More advanced formulations perform Bayesian inference over parameters themselves.

Limitations:

Scalability: The explicit enumeration and evaluation of all policies becomes computationally infeasible for complex problems with large state/action spaces or long planning horizons. Hierarchical approaches and function approximators (like deep learning) are areas of research to address this.
Model Specification: Defining the structure of the generative model (states, observations, dependencies) and prior preferences () can be challenging for complex real-world scenarios.
Assumptions: The framework often makes simplifying assumptions (e.g., mean-field approximations for , factorization of beliefs) which might not hold in all cases.

Despite these challenges, Active Inference provides a comprehensive, mathematically principled framework for understanding perception, action, and learning.

Applications

Active inference is influential in computational neuroscience and psychiatry. It offers potential explanations for phenomena like perceptual inference (e.g., explaining illusions), motor control (viewed as fulfilling proprioceptive predictions), and decision-making under uncertainty.

Variations in parameters like precision () or prior beliefs (including preferences ) are used to model neurological and psychiatric conditions. For instance, altered precision (linked to dopamine function) might relate to symptoms in Parkinson's disease or schizophrenia. Aberrant priors might contribute to delusions or anxiety. It provides a formal way to simulate and test hypotheses about brain function and dysfunction.

Active Inference: A Primer

Contents

Motivation: Staying Alive

Generative Process vs. Generative Model

Generative Process (World)

Generative Model (Agent)

Bayesian Inference Basics

Sequential Bayesian Updates

Example: Fruit Trees & Model Evidence

Interactive Model Evidence

Model Evidence & Surprise

Surprise vs. Probability

Free Energy: A Tractable Bound on Surprise

Deriving Free Energy

Summary So Far

Decomposing Free Energy

A Dynamic World: Introducing Time

Active Inference: Adding Action

Expected Free Energy (EFE)

Decomposing Expected Free Energy

Digging into Epistemic Value

Minimalistic Example: Hunger Games

Detailed Calculation Steps

Interactive Grid World Simulation

How the Simulation Works:

Summary & The Big Picture

Learning & Limitations

Applications

Further Reading