TL;DR
Neuroscientists are building models of sensory processing for different areas of the brain, e.g. using feedforward CNNs or transformers acting on images or sounds. These models are built using either task-optimized (proxy tasks on images & sounds) or direct-fit (directly fit to brain data) approaches. However, there are pain points around usability and performance, such as difficulty in using and sharing models, and bad inductive biases. This proposal is to create Foundation Brain Models that can be easily downloaded. These foundation models will be pre-aligned to the brain, starting with unimodal, image-based models.
Based on the proposal, three artifacts are to be created:
- A compilation of heterogeneous datasets containing relevant brain data
- A library that facilitates sharing existing and future foundation brain models that can be easily downloaded, starting with existing models trained on image data only.
- A pretrained foundation brain model trained on multiple large-scale datasets, including brain data
Context
- Neuroscientists are building in silico models of sensory processing. It started out with visual processing (ventral stream), but these models now cover far more (dorsal visual stream, auditory, language, motor, hippocampus, etc.).
- Some important components of these models:
- They are image- (or input-) computable. That makes them amenable to benchmarking to determine which model best recapitulates brain function (e.g. Brain score).
- The most common model types take as input sequences of images, which are processed independently. Mel cepstrums are used for auditory data, transforming them into 2d planes. 3d CNNs have been used for movie clips. Some models may use more abstract representations. Occasionally, models use auxiliary behavioural variables, e.g. eye position, pupil dilation, etc.
- They could be RNNs, CNNs, transformers, GNNs, etc.
- Generally speaking, they’re DNNs composed of a number of layers. They may come with an intrinsic alignment to specific brain areas (e.g. CorNet), or it may be up to the experimentalist to figure out the mapping between DNN and brain. This is usually through some form of (penalized) linear regression, perhaps through a weighted Gaussian. Sometimes, an explicit alignment need not be computed, e.g. for RSA-based comparisons.
- These models are, by and large, deterministic.
- The purpose of the models is to recapitulate the brain’s processing, not necessarily to perfectly emulate the brain. It is rarely the case that one subunit of the DNN = one neuron in the brain. Perhaps one might aim for one subunit = one column (ensemble of neurons), or one subunit = one voxel, but most of the time the mapping is more ethereal, e.g. all subunits within a layer approximate the subspace of all neurons within a brain area.
- These models are typically built using one of two core approaches:
- Task-optimized: DNNs are fit to solve problems that the brain attempts to solve, e.g. object recognition or self-supervision on an appropriate stimulus ensemble. When the task is similar to the brain’s task, and the architecture is not too dissimilar to the brain’s, it’s often been observed that the network converges to a solution similar to the brain’s.
- Direct fit: models are directly optimized (from scratch) to explain neurons/voxels in different brain areas (e.g. Cadena et al. 2019).
- This type of modelling is a really common paradigm to study sensory systems in animals and humans, encoding/decoding in fMRI and for brain-computer interfaces. I’ve outlined some of the uses of the potential use cases for these models in this blog post. They’re an essential tool for neuroAI.

On the NeuroAI continuum, AIs as models of the brain are on the lower right.
Pain points
Usability
- As a user, it’s a pain to use somebody else’s model. I can’t download models of brains off of BrainScore. Everybody has their own github repo with weights stored “somewhere” with out-of-date pytorch versions, etc. There’s no plug-and-play weekend projects for our community. A notable exception is
thingsvision, which allows one to extract intermediate activations out of pretrained image models easily. , nor does it have intrinsic alignments to different brain areas; it would be nice to tap into the existing thingsvision framework and expand it towards this use case.
- As an author, it’s a pain to share my model. I have to worry about AWS/gcs bills to store the weights. The fact that not a lot of people use my model because it’s a pain means that my model is not kept up-to-date/patched by the community. I still have to answer emails about my NeurIPS’2021 paper, people trying to use the models and getting silly errors. My model has a bus factor of 1. Thingsvision doesn’t host weights by itself, but that’s something that is easily doable with the HuggingFace Hub (especially for transformer-based models).
Performance
- Bad inductive biases. Most neuroAI models which are considered state-of-the-art are, in the grand scheme of things, pretty tiny. The way they get around using large datasets seems largely to leverage powerful inductive biases which may or may not be warranted. Although this has helped us jump-start the field in the absence of large enough datasets, it’s probably holding us back as a field, forcing us to use models which underperform in explaining the brain.