Overview of the Thousand Brains Project
Overview of the Thousand Brains Project
Numenta, 2024
Authors: Viviane Clay, Niels Leadholm, and Jeff Hawkins
Contents
1 Overview 3
1.1 Goals of the Thousand Brains Project . . . . . . . . . . . . . . 5
Long Term Goals . . . . . . . . . . . . . . . . . . . . . 5
Short Term Goals . . . . . . . . . . . . . . . . . . . . . 7
1.2 Opening Statements . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Document Structure . . . . . . . . . . . . . . . . . . . 7
1.2.2 Challenging Preconceptions . . . . . . . . . . . . . . . 8
1.2.3 Capabilities of the System . . . . . . . . . . . . . . . . 9
1.3 Overview Of The Architecture . . . . . . . . . . . . . . . . . . 11
1.3.1 Three Core Components . . . . . . . . . . . . . . . . . 11
Sensor Modules . . . . . . . . . . . . . . . . . . . . . . 11
Learning Modules . . . . . . . . . . . . . . . . . . . . . 13
Common Communication Protocol . . . . . . . . . . . 16
1.3.2 Further Details . . . . . . . . . . . . . . . . . . . . . . 17
Learning on Different Spatial Scales (Hierarchy): . . . . 17
Learning on Different Time-Scales . . . . . . . . . . . . 17
Voting/Consensus . . . . . . . . . . . . . . . . . . . . . 18
Motor Input . . . . . . . . . . . . . . . . . . . . . . . . 18
Motor Output . . . . . . . . . . . . . . . . . . . . . . . 19
Sub-Cortical Motor Areas . . . . . . . . . . . . . . . . 19
Predictions . . . . . . . . . . . . . . . . . . . . . . . . 20
1.3.3 Bringing it Together . . . . . . . . . . . . . . . . . . . 20
1.3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 22
2 Resources 23
2.1 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2
Overview
3
up and handled by distinct architectures, but play a crucial role at every
point in the Architecture where information is processed.
A second differentiator is that our sensorimotor systems learn structured
models, using reference frames, coordinate systems within which locations
and rotations can be represented. The models keep track of where their sen-
sors are relative to things in the world. They are learned by assigning sensory
observations to locations in reference frames. In this way, the models learned
by sensorimotor systems are structured, similar to CAD models in a com-
puter. This allows the system to quickly learn the structure of the world
and how to manipulate objects to achieve a variety of goals, what is some-
times referred to as a ‘world model’. As with sensorimotor learning, reference
frames are used throughout all levels of information processing, including the
representations of not only environments, but also physical objects and ab-
stract concepts - even the simplest representations in the Architecture are
represented within a reference frame.
There are numerous advantages to sensorimotor learning and reference
frames. At a high level, you can think about all the ways humans are different
from today’s AI. We learn quickly and continuously, constantly updating our
knowledge of the world as we go about our day. We do not have to undergo
a lengthy and expensive training phase to learn something new. We interact
with the world and manipulate tools and objects in sophisticated ways that
leverage our knowledge of how things are structured. For example, we can
explore a new app on our phone and quickly figure out what it does and how
it works based on other apps we know. We actively test hypotheses to fill
in the gaps in our knowledge. We also learn from multiple sensors and our
different sensors work together seamlessly. For example, we may learn what
a new tool looks like with a few glances and then immediately know how to
grab and interact with the object via touch.
One of the most important discoveries about the brain is that most of
what we think of as intelligence, from seeing, to touching, to hearing, to
conceptual thinking, to language, is created by a common neural algorithm.
All aspects of intelligence are created by the same sensorimotor mechanism.
In the neocortex, this mechanism is implemented in each of the thousands of
cortical columns. This means we can create many different types of intelli-
gent systems using a set of common building blocks. The Architecture we are
creating is built on this premise. The SDK will provide the core components
and developers will then be able to assemble widely varying AI and robotics
applications using these components in different numbers and arrangements.
4
Any engineer will be able to create AI applications using the Platform with-
out requiring huge computational resources or background knowledge.
5
1.1. GOALS OF THE THOUSAND BRAINS PROJECT
and grid cells (Ahmad and Scheinkman, 2019; Hawkins and Ahmad, 2016;
Hawkins, Ahmad, and Cui, 2017; Hawkins et al., 2019; Lewis et al., 2019).
Finally, it will be important to showcase the capabilities of our SDK. We
will work towards creating a non-trivial demo where the Architecture can be
used to showcase some capability that would be hard to demonstrate any
other way. This may not be one specific task but could play to the strength
of this system to tackle a wide variety of tasks. We will also work on an
easy-to-use open-source SDK that other practitioners can apply and test on
their applications. We want this to be a platform for all kinds of applications
and not just a specific technology showcase.
We have a set of guiding principles that steer the Thousand Brains
Project. Throughout the life of the project there may be several different
implementations and within each implementation there may be different ver-
sions of the core building blocks but everything we work on should follow
these core principles:
6
Short Term Goals
Our goal in the near term is to continue building an Architecture based on
the principles listed above with a general set of abilities for modeling and
interacting with the world. We want to understand and flesh out some of the
key issues and mechanisms of learning in such a modular, sensorimotor setup.
Two key issues we will focus on next are learning compositional objects using
hierarchy, and using learned object models to enable sophisticated (‘model-
based’) action policies.
In the current stage of building up the code framework, we are focusing
on the two basic components; learning modules and sensor modules, and the
communication between them. In the initial implementation, many compo-
nents are deliberately not biologically constrained, and/or simplified, so as
to support visualizing, debugging, and understanding the system as a whole.
For example, object models are currently based on explicit graphs in 3D
Cartesian space. In the future, these elements will be substituted with more
powerful, albeit more inscrutable neural components.
Another goal for the coming months is to open-source and communicate
our progress and achievements so far. We want to make it easy for others
to join the project and contribute to the Platform. We will provide access
to the simple SDK and examples to get started. We also want to spread the
ideas of the Thousand Brains Theory and the corresponding Architecture to
a wider audience. We aim to do this by writing blog posts, releasing videos,
open-sourcing our code, and creating a community around the project.
7
1.2. OPENING STATEMENTS
• Despite the previous point, each modeling system can learn complete
models of objects and recognize them on its own. A single modeling
unit should be able to perform all basic tasks of object recognition
and manipulation. Using more modeling units makes the system faster
and more efficient, and supports compositional and abstract represen-
tations, but a single learning module is itself a powerful system. In the
8
single model scenario, inference always requires movement to collect a
series of observations, in the same way that recognizing a coffee cup
with one of your fingers requires moving across its surface.
• All models are structured by reference frames. An object is not just a
bag of features. It is a collection of features at locations. The relative
locations of features to each other is more important than the features
themselves.
9
1.2. OPENING STATEMENTS
• Using learned models to alter the world and achieve goals, including
goals that require decomposition into simpler tasks. The highest-level,
overarching goals can be set externally.
10
1.3 Overview Of The Architecture
There are three major components that play a role in the Architecture: sen-
sors, learning modules, and actuators1 . These three components are tied
together by a common communication protocol (CCP). Due to the unified
communication protocol, the inner workings of each individual component
can be quite varied as long as they leverage the appropriate interfaces.2
11
1.3. OVERVIEW OF THE ARCHITECTURE
Figure 1.1: Sensor modules receive and process the raw sensory input. This
is then communicated via a common communication protocol to a learning
module which uses this to learn and recognize models of anything in the
environment.
12
Learning Modules
The basic building block for sensorimotor pro-
cessing and modeling the output from the sen-
sor module is a learning module. These are re-
peating elements, each with the same input and
output information format. Each learning mod-
ule should function as a stand-alone unit and be
able to recognize objects on its own. Combining
multiple learning modules can speed up recog-
nition (e.g. recognizing a cup using five fingers
vs. one), allows for learning modules to focus on
storing only some objects, and enables learning
compositional objects.
Learning modules receive either feature IDs
from a sensor or estimated object IDs (also in-
terpreted as features) from a lower-level learning Figure 1.2: Learning mod-
module3 . The feature or object representation ules learn structured mod-
might be in the form of a discrete ID (e.g. the els through sensorimotor
color red, a cylinder), or could be represented interaction, using reference
in a more high dimensional space (e.g. an SDR frames. They model how
representing hue or corresponding to a fork-like incoming features are ar-
object). Additionally, learning modules receive ranged relative to each
the feature’s or object’s pose relative to the body, other in space and time.
where the pose includes location and rotation.
In this way, pose relative to the body serves as a common reference frame
for spatial computations, as opposed to the pose of features relative to each
individual sensor. From this information higher level learning modules can
build up graphs of compositional objects (e.g. large objects or scenes) and
vote on the ID of the currently visible object(s).
The features and relative poses are incorporated into a model of the ob-
3
By object, we mean a discrete entity composed of a collection of one or more other
objects, each with their own associated pose. As such, an object could also be a scene or
any other composition of sub-objects. At the lowest level of object hierarchy, an object
is composed of ‘proto-objects’ (commonly thought of as features), which are also discrete
entities with a location and orientation in space, but which are output by the sensor
modules; as such, these cannot be further decomposed into constituent objects. Wherever
an object (or proto-object) is being processed at a higher level, it can also be referred to
as a feature.
13
1.3. OVERVIEW OF THE ARCHITECTURE
ject. All models have an inductive bias towards learning the world as based
in 3-dimensional space with an additional temporal dimension. However the
exact structure of space can potentially be learned, such that the lower-
dimensional space of a melody, or the abstract space of a family tree, can
be represented. When interacting with the physical world, the 3D inductive
bias is used to place features in internal models accordingly.
The learning module therefore encompasses two major principles of the
TBT: Sensorimotor learning, and building models using reference frames (see
figure 1.2). Both ideas are motivated by studies of cortical columns in the
neocortex (see figure 1.3).
Figure 1.3: Conceptual sketch of how the learning module could be imple-
menting possible mechanisms of cortical columns.
Besides learning new models, the learning module also tries to match
the observed features and relative poses to already learned models stored in
memory. In addition to performing such inference within a single LM, an
LM’s current hypotheses can be sent through lateral connections to other
learning modules using the common communication protocol. We note again
that the CCP is independent of modality, and as such, LMs that have learned
objects in different modalities (e.g. vision vs. touch), can still ‘vote’ with
each other to quickly reach a consensus. This voting process is inspired by the
voting process described in Hawkins, Ahmad, and Cui (2017). Unlike when
the CCP is used for the input and output of an LM, votes consist of multiple
CCP-compliant messages, representing the union of multiple possible object
hypotheses.
14
To generate the LM’s output, we need to get the pose of the sensed object
relative to the body. We can calculate this from the current incoming pose
(pose of the sensed feature relative to the body) and the poses stored in the
model of the object. This pose of the object can then be passed hierarchically
to another learning module in the same format as the sensory input (features
at a pose relative to the body).
Once the learning module has determined the ID of an object and its
pose, it can take the most recent observations (and possibly collect more)
to update its model of this object. We can therefore continually learn more
about the world and learning and inference are two intertwined processes.
15
1.3. OVERVIEW OF THE ARCHITECTURE
16
possible to detect objects independently of their pose.
The common reference frame also supports voting operations accounting
for the relative displacement of sensors, and therefore LM models. For ex-
ample, when two fingers touch a coffee mug in two different parts, one might
sense the rim, while the other senses the handle. As such, ‘coffee mug’ will be
in both of their working hypotheses about the current object. When voting
however, they do not simply communicate ‘coffee mug’, but also where on
the coffee mug other learning modules should be sensing it, according to their
relative displacements. As a result, voting is not simply a ‘bag-of-features’
operation, but is dependent on the relative arrangement of features in the
world.
17
1.3. OVERVIEW OF THE ARCHITECTURE
Voting/Consensus
Learning modules have lateral connections to each other to communicate
their estimates of the current object ID and pose. For voting, we use a
similar feature-pose communication as we use to communicate to higher-
level modules. However, in this case we communicate a union of all possible
objects and poses under the current evidence (multiple messages adhering to
the CCP). Through the lateral voting connections between modules they try
to reach a consensus on which object they are sensing at the moment and
its pose (see figure 1.5). This helps to recognize objects faster than a single
module could.
Motor Input
The movement information (pose displacement) can be a copy of the selected
action command (efference copy) or deduced from the sensory input. With-
out the efference copy, movement can for example be detected from optical
flow or proprioception. Sensor modules use movement information to update
18
their pose relative to the body. Learning modules use it to update their
hypothesized location within an object’s reference frame.
Motor Output
Each learning module produces a motor output. The motor output is for-
malized as a goal state and also adheres to the common communication
protocol. The goal state could for example be generated using the learned
models and current hypotheses by calculating a sensor state which would
resolve the most uncertainty between different possible object models. It can
also help to guide directed and more efficient exploration to known features
in a reference frame stored in memory. Different policies can be leveraged
depending on whether we are trying to recognize an object or trying to learn
new information about an object.
Hierarchy can also be leveraged for goal-states, where a more abstract
goal-state in a high-level learning module can be achieved by decomposing
it into simpler goal-states for lower-level learning modules. Importantly, the
same learning modules that learn models of objects are used to generate
goal-states, enabling hierarchical, model-based policies, no matter how novel
the task.
19
1.3. OVERVIEW OF THE ARCHITECTURE
are learned in the learning modules and therefore needs to receive useful
model-based motor commands from the LMs.
Predictions
Learned models in the memory of the learning module can be used to make
predictions about future observations. If there are multiple models that
match the current observations, the predictions would have more uncertainty
attached to them. The prediction error can be used as a learning signal to
update models or as a criterion for matching during object recognition.
Currently there is no prediction in time, although in the future such
capabilities will be added via the inclusion of a temporal dimension. This
will help support encoding behaviors of objects, as well as predictions that
can be used for motor-policy planning. For example, the long-term aim is for
the Architecture to be able to predict how a simple object such as a stapler
evolves as it is opened or closed, or to coarsely model the physical properties
of common materials.
20
Figure 1.6: High-level architecture of our Platform with all the main concep-
tual components mirroring figure 1.4 applied to a concrete example. Black
lines indicate the main flow of information up the hierarchy. Grey lines
show top-down connections, biasing the lower-level learning modules. Pink
lines show lateral voting connections. Blue lines show communication of goal
states which eventually translate into motor commands in the motor system.
Information communicated along solid lines follows the CCP (contains fea-
tures and pose). Dashed lines are the interface of the system with the world
and subcortical compute units. Black dashed lines communicate raw sensory
input from sensors. Blue dashed lines communicate motor commands to the
actuators. The light blue dashed lines send sensory information directly to
the motor system and implement a fast reflex loop for purely input-driven
policies.
21
1.3. OVERVIEW OF THE ARCHITECTURE
1.3.4 Conclusion
We have provided a high-level description of an in-development Platform,
consisting of an Architecture for intelligent sensorimotor learning and ac-
tion, and an SDK to support the development of applications that leverage
that Architecture. We look forward to sharing further details about the Plat-
form in the coming months, and hope that you will consider joining in its
development and growth.
22
Resources
2.1 Glossary
This section aims to provide concise definitions of terms commonly used in
the Thousand Brains Project.
reference frame: A specific coordinate system within which locations
and rotations can be represented. For instance, a location may be repre-
sented relative to the body (body/ego-centric reference frame) or relative to
some point in the world (world/allo-centric reference frame) or relative to an
object’s center (object-centric reference frame).
pose: An object’s location and orientation (in a given reference frame).
The location can for example be x, y, z coordinates and the orientation can
be represented as a quaternion, Euler-angles or rotation matrix.
displacement: The spatial difference between two locations. In 3D
space, this would be a 3D vector.
transformation: Applies a displacement/translation and a rotation to
a point.
path integration: Updating an agent’s location by using an estimate of
its current location, together with an estimate of its own movement.
sensorimotor/embodied: Learning or inference through interaction
with an environment using a closed loop between action and perception.
This means observations depend on actions and in turn the choice of these
actions depend on the observations.
policy: Defines the function used to select actions. Selected actions can
be dependent on a model’s internal state and on external inputs.
sparse distributed representation (SDR): A binary vector with sig-
nificantly more 0 bits than 1 bits. Significant overlap between the bit assign-
ments in different SDRs captures similarity in representational space (e.g.
23
2.1. GLOSSARY
similar features).
dendrites: Implement pattern recognizers to identify patterns such as
a specific SDR. One neuron is typically associated with multiple dendrites
such that it can identify multiple patterns.
In biology, dendrites of a postsynaptic cell receive information from the
axons of other presynaptic cells. The axons of these presynaptic cells connect
to the dendrites of postsynaptic cells at a junction called a “synapse”. An
SDR can be thought of as a pattern which is represented by a set of synapses
that are collocated on a single dendritic segment.
sensor module: A computational unit that turns raw sensory input into
a common communication protocol output. The structure of the output of
a sensor module is independent of the sensory modality and represents a set
of features at poses.
learning module: A computational unit that takes features at poses as
input and uses this information to learn models of the world. It is also able to
recognize objects and their poses from the input if an object has been learned
already. Finally, it can output actions in the form of target goal-states for
other modules.
reference frame graph: A set of nodes that are connected to each other
with edges. Both nodes and edges can have features associated with them.
For instance all graphs used in the Architecture have a location associated
with each node and a variable list of features. An edge can, for example,
have a displacement associated with it.
efference copy: A copy of the motor command that was output by the
policy and sent to the actuators. This copy can be used by learning modules
to update their states or make predictions.
inductive bias: An assumption about the nature of the world that is
built into an algorithm/model. If the assumption holds, this can enable the
model to learn more efficiently than without the inductive bias. However, it
will result in limitations of the model if the assumption does not hold or is
overly restrictive.
24
Bibliography
25