What tool representation, intuitive physics, and action have in common: The brain’s first-person physics engine

ABSTRACT An overlapping set of brain regions in parietal and frontal cortex are engaged by different types of tasks and stimuli: (i) making inferences about the physical structure and dynamics of the world, (ii) passively viewing, or actively interacting with, manipulable objects, and (iii) planning and execution of reaching and grasping actions. We suggest the observed neural overlap is because a common superordinate computation is engaged by each of those different tasks: A forward model of physical reasoning about how first-person actions will affect the world and be affected by unfolding physical events. This perspective offers an account of why some physical predictions are systematically incorrect – there can be a mismatch between how physical scenarios are experimentally framed and the native format of the inferences generated by the brain’s first-person physics engine. This perspective generates new empirical expectations about the conditions under which physical reasoning may exhibit systematic biases.


Three tasks that engage a common network
The motivation for this paper is the empirical observation summarized in Figure 1: An overlapping set of brain regions in parietal and frontal cortex are engaged when, Panel 1A: Making inferences about the physical structure and dynamics of the world (Fischer et al., 2016;Schwettmann et al., 2019); viewing, naming or pantomiming the use of manipulable objects (Chao & Martin, 2000), and planning reach-to-grasp actions (Culham et al., 2003). Those three findings have each been broadly replicated, and the argument of this paper takes them at face value.
The goal of this paper is to sketch a proposal about why those three tasks seem to co-locate in the brain. We argue that the parieto-frontal network shown in Figure 1, centred on the supramarginal gyrus, supports a common computation that is engaged across physical reasoning, manipulable object representation, and action planning. That common computation is a forward model of physical reasoning about how first-person actions will effect changes in the world, and how actions will in turn be constrained by unfolding physical events.
This proposal is based on the idea that the brain has a "physics engine" which supports inferences about what will happen next in a scene Pramod et al., 2022). Within the framework of thinking of the brain's physics engine as a type of forward model, we emphasize the first-person reference frame that supports action.

Specialization of functionfor what?
A dominant paradigm in cognitive neuroscience involves identifying and studying brain areas that show differential activity for a particular type of stimulus or process, compared with "theoretically relevant" control conditions. What qualifies as "theoretically relevant" depends on the hypothesis being tested and on assumptions about what the brain region(s) in question do(es). For instance, the same stimulus rotated 90 degrees in its orientation may be a relevant baseline for an orientation-tuned cell in early visual cortex, while the appropriate control in high-level visual areas may be a stimulus from a completely different semantic category (e.g., the relevant baseline for an image of a cat, in the fusiform gyrus, may be a picture of a house).
The regions in Figure 1 were defined using different types of experimental stimuli and control conditions. The contrast for physical reasoning (Panel 1A) compared judgments about where the majority of blocks would land if unstable block towers were to fall, to difficulty-matched judgments of the visual features of the same stimuli (whether there were more blue or yellow blocks present in the tower). The comparison for manipulable objects (Panel1B) contrasted naming handheld graspable and manipulable objects (fork, glass, hammer) against naming faces, animals and places. The grasp-related areas (Panel 1C) were identified by subtracting activity for reach-to-touch actions from activity during reach-to-grasp actions. The notable heterogeneity of stimuli and tasks across the experiments summarized in Figure 1 motivates Figure 1. A common network of brain regions supports manipulable object representation, intuitive physics inferences, and action planning. (A) Regions in parietal and frontal cortex that are engaged during intuitive physics inferences (Fischer, 2020;Navarro-Cebrian & Fischer, 2022). Those regions are more active during physical prediction than during difficulty-matched tasks requiring prediction in other domains (Fischer et al., 2016). (B) The network that is more active during viewing of manipulable objects compared to animals, places and faces (for the original observation, see Chao & Martin, 2000;data from Kristensen et al., 2016). (C) Regions engaged in reaching and grasping, from (Culham et al., 2003, see also Gallivan & Culham, 2015). While the task demands and stimuli used to localize the three networks are markedly different on their surface, they engage overlapping parietal and frontal areas. Of particular note, is the role of the supramarginal gyrus across intuitive physics, manipulable object representation, action planning and execution, and (not shown) phonological processing (Oberhuber et al., 2016). the present proposal, which seeks to "zoom out" in thinking about what those regions are doing. This point has been made before in many other contexts: Specificity of a region for a given computation is a theoretical hypothesis that must be inferred from a pattern of responses. Our proposal, for the brain regions discussed herein, is that the "theoretically relevant pattern of responses" is broader than what has been considered in any of the respective sub-fields that generated the findings summarized in Figure 1.
Perception in the service of predicting the future Visual processing comprises separable operations over distinct dimensions of the visual inputform, motion, colour, depth, location, orientation, axis correspondence, and so on. A high-level organizing principle within the visual systemorthogonal to many of those dimensionsdistinguishes "vision for perception" from "vision for action". The occipito-temporal pathway, or ventral stream, supports detailed perception and visual recognition, while a subcortical and dorsal occipito-parietal pathway supports object localization and visual analysis in the service of concurrently unfolding actions (Goodale et al., 1991;Goodale & Milner, 1992;Schneider, 1969;Ungerleider & Mishkin, 1982; for further discussion on how best to characterize the visual streams see Freud et al., 2020;Livingstone & Hubel, 1988;Mahon, in press;Merigan & Maunsell, 1993;Pisella et al., 2006;Schenk, 2006;Xu, 2018).
Everyday interactions with objects involve processing of visual form, surface-texture and material properties, object location in various reference frames, action goals, object identity, object function, lexical semantics, linguistic forms, and learned motor competencies and skills. Functional object use involves the orchestration of that entire diversity of representations in the service of behaviour. Those separable processes are supported by brain regions across the ventral and dorsal visual streams. A subset of the regions engaged by these processes overlap with regions involved in action planning. In particular, the left ventral precentral gyrus (premotor cortex), the left supramarginal gyrus, and the anterior IPS bilaterally (but stronger in the left).
The first wave of empirical reports describing the neural representation of manipulable objects emphasized that the relevant parietal and frontal areas were also involved in action planning and execution (Chao & Martin, 2000;Culham et al., 2003;Mahon et al., 2007;Noppeney et al., 2006). Indeed, it has been debated whether manipulable object representations obligatorily involve a form of motor simulation. Generalizing over much discussion, there is broad agreement that object-directed actions are a type of knowledge ("knowledge" understood broadly) that is automatically engaged when thinking about manipulable objects (Martin, 2016). The key point is that past discussions have been premised on the view that "manipulable objects ('tools') engage the action system". In other words, with respect to parietal-frontal areas, and in particular the supramarginal gyrus, it has been assumed the processes indexed by activity in those regions have to do with action or motor-relevant processes.
An alternative view is that processing of manipulable objects and action planning both drive inferences about the next state of the world. The brain does not wait until "after perception is over" to determine how to interact with the environmentand it does not wait until after action is over to understand how an action may affect the world. The default posture of the system is to be continuously inferring what will happen next in the environment. Those inferences inform action and perception. Which is to say, implicit physical reasoning is constantly evaluating how potential actions will interact with the world, and what the next state of the world is likely to be. The process is ongoing and iterative. Physical inferences drive further perceptual analysis, which drives further physical reasoning. On this view, the computations supported by the supramarginal gyrus (and other regions in the network) may be better thought of as implementing a forward model that supports first-person inferences about future states of the world.
There has been much discussion about whether parietal and frontal areas are specialized for "tools" as a category. But to say that a region is specialized for processing tools amounts to little more than a redescription of the datai.e., the experimental conditions under which disproportionately high levels of activity are observed. "Tools", or manipulable objects, describes a set (or category) of graspable objects for which the function and manner of manipulation and visual structure are all tightly related . The system is not specialized for manipulable objects per se. Rather, there are certain computations that are demanded of successful processing of manipulable objects and which are not demanded by processing of animals, faces or places .
Physical reasoning is often tested in contexts in which the layout is unfamiliarusing novel objects to push the system into a state where it has to reason, de novo, about the behaviours of objects based on their structure and dynamics. This de novo reasoning itself may be a key component of the underlying processes. For instance, Weisberg, van Turennout, and Martin (2007) showed that learning form-function relations for novel objects drove activity in the same regions highlighted in Figure 1 (Weisberg et al., 2007; see also Martin & Weisberg, 2003 for a study using Michotte causality driving parts of the network).
There is considerable neuropsychological evidence that lesions involving the fronto-parietal network highlighted in Figure 1 are associated with specific grasping or object manipulation impairments. In particular, patients with upper limb apraxia have an impairment in using objects that cannot be attributed to basic sensory or motor deficits (Rothi et al., 1991;Rumiati et al., 2001). Upper limb apraxia is classically associated with damage to the left supramarginal gyrus, in the inferior parietal lobule (Goldenberg, 2014;Gonzalez-Rothi & Heilman, 1996). By contrast, aIPS supports the computation of hand postures for object-directed grasping (Binkofski et al., 1998;Culham et al., 2003). Some authors have argued that upper limb apraxia is fundamentally a disruption to "de novo mechanical problem solving abilities" (Goldenberg & Spatt, 2009); such accounts emphasize that complex object-directed actions are not stored as integral representations but are rather built on the fly.
The "errors" of intuitive physics result from a feature, not a glitch, in the system In certain circumstances, naïve observers make dramatic errors in their judgments about the physical contents and dynamics of seemingly simple scenarios (Caramazza et al., 1981;Gilden & Proffitt, 1994;Ludwin-Peery et al., 2020McCloskey et al., 1980). These errors in physical reasoning have generally been viewed as revealing a glitch in the systeman incorrect or incomplete mental model of physics that leads to misconceptions about the latent physical structure of a scene or the way its physical events will unfold. However, these errors might instead be viewed as a consequence of a mismatch between the native format of the computations that support intuitive physics reasoning in everyday life, and the format of the hypothetical scenarios used to cue explicit responses and descriptions of how physical interactions will unfold. The "errors" of intuitive physics judgements could result from what is otherwise a "feature"a dedicated system that supports in situ and first-person inferences about how the state of the world may change.
A classic paradigm asks participants to diagram the trajectory of a moving object in the absence of external forcesfor instance, a ball launched through a curved tube seen from directly overhead. People often draw a curved path as though the tube would impart a persisting curvilinear trajectory to the ball (McCloskey et al., 1980). Diagrams of curvilinear motion in this scenario are at odds with natural physical behaviour, where the ball would maintain a straight path in the absence of external forces. The prevalence and magnitude of errors in people's reports has presented a puzzle, especially alongside other circumstances in which physical predictions are accurate and highly precise. The format of the scenario might be key to understanding why it leads to incorrect responsesthe scenario forces the observer to represent the movement in a specific allocentric reference frame. By hypothesis, that reference frame is disconnected from the format of the inferences the system has available.
Consider what would be involved in observing the Newtonian dynamics of a large, faraway objectfor example, a boulder tumbling down a mountain, or a tree falling after it has been chopped down. The retinal sizes of such objects can be the same as those of nearby objects on one's desk, and the same physical laws apply to both the distant and nearby objects. Yet the observed physical dynamics are quite different in their visual patterns. The boulder appears to fall in slow motion because its visual acceleration profile does not match the acceleration of falling objects that are close to the observer. These are the types of divergent cues that may be generated by separable systems about what will happen as the event unfolds.
Another classic finding is the "straight down belief:" people diagram the path of an object after it is released by a walking person as falling straight down . In a similar manner to the scenario above, these errors in prediction could be due to the requirement to report the results of the event from the 3rd person perspective a format that may not align with how the system represents the event internally. If physical predictions are informed by a system that sees the world in the first person, then participants' answers are actually correct: The ball does fall straight down from the perspective of the person walking. The "erroneous" judgement is a product of the mental model that participants have for completing the task, which operates first and foremost in the service of physical inference for first-person interactions.
On this view, the intuitions of intuitive physics cannot help but be influenced by the "suggestions" generated from a first-person perspective on how the state of the world may change. The computations of the dorsal stream are inherently in the first personand the dorsal stream is one source for the generation of such inferences. The "suggestions" that the dorsal stream may make to the rest of the brain are in the service of fitting actions to the current state of the world.
The tasks used in intuitive physics studies often ask for declarative and explicit judgements about how the world will be. By hypothesis, it is the inescapability of our physical reasoning from the suggestions made by the dorsal stream that causes some physical judgments to be systematically incorrect when queried in scenarios that run counter to the expectations of the system. Physical reasoning itself is not a dorsal processthe physical reasoning task is an explicit and declarative perceptual task. The physical reasoning task is equivalent to asking about the perceptual consequences of what will happen in the world.
There is an intriguing and instructive analogy between participants' erroneous judgements in physical reasoning tasks and some visual illusions, such as the Ebbinghaus/Titchener size constancy illusion (Titchener, 1901). Size constancy illusions are not a glitch in the systemthey are a result of a feature (size constancy) that is a key element of a stable perceptual experience of the world. Errors in intuitive physical reasoning, like the Ebbinghaus/Titchener visual illusion, expose an aspect of how the system works: Some computations are applied in a compulsory manner. "Compulsory" in this context implies both "automatic" and "in a manner that is relatively encapsulated", from declarative knowledge that is also held about the world. For instance, knowing about the Ebbinghaus/Titchener illusion, even as one is staring at it, does not make it stop.
This framing of the source of some erroneous intuitive physics judgements generates several predictions about the experimental conditions under which people will display accurate or mistaken physical inferences. Some of these find preliminary support and some, we hope, will spur future research: (1) The accuracy of some physical predictions will be modulated by the reference frame in which the scenario is depicted. The directional prediction is that scenarios that are presented in egocentric frames will yield more accurate predictions than those presented in allocentric frames. Some of the examples above align with that expectation, and other studies have also highlighted cases where physical reasoning can fail for scenarios that are depicted as being outside peri-personal space, or the space in which visually guided actions typically operate (Ludwin-Peery et al., 2020).
(2) Some systematically erroneous physical inferences should become accurate when the judgement is rendered by the participant via an action. Returning to the analogy to the Ebbinghaus/Titchener illusion, an important finding is that the illusory effect is less pronounced when the size judgment is rendered implicitly via spontaneous grip aperture during a reach-to-grasp action. The same illusory stimulus that generates a size constancy illusion in perception tricks the hand less during a visually guided grasp (Aglioti et al., 1995;Goodale, 2011). In this example and others, a task that asks participants to report on their perceptual experience can yield responses that are systematically incorrect. Similarly, we suggest, physical reasoning performance should improve if judgements about the unfolding physics of an event are collected via actions on the part of participants.
Prediction 2 is in line with an argument put forward by Smith and colleagues (Smith et al., 2018), and finds some preliminary empirical support from several studies, including one in this special issue (Neupärtl et al., 2022). When sliding a puck toward a target (Neupärtl et al., 2022) or moving a bin to catch a falling object (Smith et al., 2013), participants produce actions that are in line with Newtonian physics even when their explicit reports or categorical judgments about the scenarios are erroneous. Even for judgments about the behaviour of liquids in containers, which are notoriously challenging, pantomiming an action on the container can lead to substantially improved predictions (Schwartz & Black, 1999).
It is important to note that not all studies that used natural actions to probe physical predictions found improvements in accuracy, compared to explicit report (McCloskey & Kohl, 1983). Understanding the boundary conditions that allow the system to generate the correct judgements for a given physical event will provide important constraints on hypotheses about the causes of systematic biases.

What we are (and are not) arguing
We assume that the dorsal stream is not cognitively penetrable. Our argument is not in conflict with the view that the dorsal stream is (relatively) informationally encapsulated. Indeed, we would subscribe to a robust version of the view that declarative knowledge systems can intervene only on inputs and outputs to the dorsal stream but not internal machinations, and that dorsal stream computations do not have access to semantic interpretations of the world. The dorsal stream is focused on processing visual information in support of real-time actions by the hand, eye, bodyit is a type of "lidar" for the body (see discussion in Mahon & Wu, 2015;Mahon, in press).
As noted, it is important that the same naïve observers who make incorrect physical inferences in certain scenarios (e.g., "the ball will drop straight down") will make the correct actions in the first person (e.g., catch the same ball). Intuitive physics inferences are shaped by information being generated by the dorsal stream; performing the correct action in the first person to catch the ball, and thus displaying veridical underlying representation of the ball's trajectory, is a dorsal process. When the task used to query physical inference is aligned with the suggestions of the dorsal stream, there is no conflict, and performance is not systematically wrong.
While the "physics engine in the brain" might be informed by dorsal stream processing, intuitive physics judgements as they are typically studied are not "dorsal stream inferences". The explicit, declarative nature of participants' reports in most intuitive physics tasks (e.g., drawing the path a falling object will take, or predicting the position where it will land) cannot draw directly on dorsal processes because those processes are siloed from broader declarative knowledge. This disconnect is (we suggest) precisely why many intuitive physics judgements can be systematically wrong. One simply cannot intervene to stop the information from being generated in a particular way, just like one cannot stop the perceptual (ventral) visual system from generating the Ebbinghaus/Titchener illusion. The physical "inferences" that guide a reach to an object's centre of mass, or help one dodge a snowball, play out in the dorsal stream; however, declarative tasks that ask for reports about those inferences cannot access the contents of the computations for explicit report. Explicit reports might be gardenpathed by dorsal stream contributions, but the garden-pathing can be more of a hindrance than a benefit for explicit tasks in an allocentric format. On this view, some intuitive physics errors reflect a type of illusion, that ironically, is made possible by a system (dorsal stream) that specializes in representing reality veridically.
Physical reasoning is not just visual mental imagery. It feels natural to think about physical prediction as a movie playing out in one's heada form of visual mental imagery. And, of course, we would not disagree that visual mental imagery can be engaged in some physical reasoning tasks. However, does having more vivid mental imagery allow for better predictions? Or at the very least, is a lack of mental imagery an impediment for accurately forecasting physical events? Recent findings (Washington & Fischer, 2021) suggest that vivid visual imagery does not predict good performance on intuitive physics tasks, and, if anything, can be slightly detrimental. Perhaps the vividly "seen" outcomes in mental imagery can lead judgements astray, as some sources of physical intuitions do not manifest as images (e.g., a catching action of the ball whose trajectory is being predicted). Similarly, recent work has shown that intuitive physics is not simply a special case of spatial cognition (Mitko & Fischer, 2020)the two domains make distinct contributions to individual differences in performance on physical reasoning tasks. This is not to say that imagery itself is the source of errors in physical reasoning. The source of the miscalculation could be generated by systems having nothing to do with imagery per se. Imagery may be where the error is detected in intuitive physics tasks; where an error is detected is not always where it arises.
The role of visual mental imagery in supporting intuitive physics reasoning is of direct relevance to our proposalas visual mental imagery engages bilateral posterior parietal regions. The regions that are engaged in visual mental imagery are posterior and superior to the parietal regions highlighted in Figure 1. Xu (2018) has argued for a new framework to understand the nature and source of posterior parietal visual representations. On Xu's proposal, while ventral stream representations achieve invariance in order to provide a stable basis for perception, the role of the posterior parietal cortex is to process visual representations in a manner that is tuned to the current task -"adaptive" visual processing. A component of Xu's proposal is that posterior parietal visual representations are dependent on connectivity with occipitotemporal areas, potentially via the vertical occipital fasciculus (Kravitz et al., 2013;Yeatman et al., 2014).
Thus, and to be perfectly speculative, when the task used to query the intuitive physics judgements imposes an allocentric frame, the system is pushed toward having to use its mechanisms for adaptive visual manipulation to solve the task. That system of adaptive visual manipulation (mental imagery) is like a white board for physical reasoning. The task solution is disconnected from some of the first-person inferences that are being generated.
Indeed, we can process and make predictions about events that are not first personit's not that we cannot perceive, think about, and make predictions in allocentric frames. A growing literature has implicated regions of lateral posterior temporo-occipital cortex that process event structure driven by biological agents, and event structure driven by mechanical interactions among inanimate entitles (Beauchamp et al., 2002;Beauchamp et al., 2003;Wurm et al., 2017). It is an open question as to why those lateral temporo-occipital regions are not able to support veridical physical reasoning in scenarios that yield systematically wrong intuitive physics judgements. Nonetheless, such systems live alongside the endogenous first-person physics engine that we have argued is supported by the fronto-parietal network highlighted in Figure 1.
What does "overlap" even mean?
Our argument is (unapologetically) rooted in a form of "reverse inference" (Poldrack, 2011), and based on a simple-minded construal of "neural overlap". "Neural overlap" is in turn based on empirical observations that have not (yet) been demonstrated within the same individuals (across tasks). The empirical generalization on which we have premised our proposal is that the supramarginal gyrus, together with other parietal and frontal areas, is reliably engaged across different tasks (Figure 1). We have argued for the strongest form of our proposal: There is a superordinate computation shared by those tasks. Here we unpack some challenges faced by this proposal. The intention is not to put such concerns to rest, but to expose the vulnerabilities of our proposal as clearly as possible, with the goal of motivating empirical studies that might resolve the issues. 1 There is precedent in the field for the general structure of our argument: Observations of neural overlap are used to support inferences of common computations. There are also clear examples where prior claims of computational overlap, motivated by neural overlap, have been empirically disconfirmed. Briefly reviewing a few examples will help frame expectations for evaluating the current proposal.
Perhaps the most widely discussed example comes from the class of proposals that are broadly descended from Motor Theories of Perception (Liberman et al., 1967): Observations that motor production processes (and their brain regions) are automatically active during perception have been taken as evidence that production process are (constitutively) part of perception. The "mirroring" hypothesis argues that a motor computation is involved in both action (by definition, as it is the motor system) and perception and recognition (Rizzolatti & Fogassi, 2014). The "mirroring nature" of some neural responses in some motor areas has been invoked as an explanation of how we recognize speech sounds, the sounds of bodily actions, and visual observations of hand actions (among other applications; di Pellegrino et al., 1992;Galantucci et al., 2006;Pazzaglia et al., 2008; for broader and critical discussion, see Dinstein et al., 2008;Hauk, 2016;Hickok, 2009;Negri et al., 2007).
Stepping back from Motor Theories of Perception: There are "motor simulation" approaches to meaning representation that emphasize the role of motor processing in knowledge representation (Gallese & Lakoff, 2005;Glenberg, 2015;Pulvermuller, 2013). Those proposals assume that, for instance word meanings that refer to actions (kick, punch, etc.), depend on the concurrent simulation of the corresponding motor processes. The evidence for such simulationist accounts of meaning representation is that the motor system is automatically active during the processing of meaning of those words (Hauk et al., 2004). A third group of proposals emphasizes the role of motor processes and regions in motor imagery (see review and discussion in Hetu et al., 2013).
There are thus three related groups of "motor simulationist" proposals that all start with an observation of overlapnamely that motor processes/ regions are active during action perception | understanding | imagery. 2 All three proposals assume a common superordinate (in this case motor) computation is drawn upon across tasks.
"Motor simulationist" theories are vulnerable to empirical disconfirmation in at least two ways: (i) show that transient or long-term disruption of the motor region in question does not disrupt perception | meaning representation | imagery; and/or (ii) disconfirm the empirical premise of "overlap", for instance with methods or techniques with greater spatial sensitivity. We briefly consider two examples here to illustrate how such tests could be applied to our proposal and some implications.
An example test using lesion evidence. The observation that listening to speech sounds leads to activity of speech motor areas has been argued to support the claim that motor production processes (in speech) are constitutively involved in perception of speech (D'Ausilio et al., 2009;Galantucci et al., 2006). Stasenko and colleagues (2015) found that a brain lesion to the speech motor system can disrupts speech motor ability while completely sparing the ability to perceive speech sounds (see also Rogalsky et al., 2011). Such patients cannot reliably produce "pear" versus "bear", but have no difficulty discriminating those minimal pairs (p/b). "Motor simulationist" theories predict that perceptual processes will (necessarily) be disrupted in the measure to which the primary motor processes are disrupted. Thus, such observations are incompatible with the claim that motor simulation is a necessary part of perception (see discussion in Lotto et al., 2009;Stasenko et al., 2013).
An example test of the claim of overlap: Persichetti and colleagues (Persichetti et al., 2020) applied a new functional MRI method called vascular space occupancy (VASO) to the question of whether there is overlap between motor production and motor imagery within primary motor cortex. VASO measures cerebral blood volume and has higher contrast-tonoise at high spatial resolution than conventional BOLD fMRI (Huber et al., 2017). VASO has the sensitivity to distinguish superficial laminar activity, associated with afferent inputs, from deep laminar activity, associated with efferent outputs (Huber et al., 2017). The technique is thus suited to test for intra-cortical overlap. Using that technique, Persichetti and collogues found that overt hand actions (finger tapping) led to activity in both superficial and deep layers, while motor imagery engaged only superficial layers. Thus, the theoretically predicted overlap was not found. By comparison, conventional BOLD fMRI with the same experiment produces robust "overlap", but only because the technique does not distinguish the different cortical layers where overlap is not present. While Persichetti and colleagues focused on motor imagery, their findings motivate similar tests of other motor simulationist theories (see discussion in . The bottom-line of this short discussion cautions against inferring computational overlap from neural overlapthat gambit is bound to fail as new technology dissects the functional organization of the brain at finer and finer scales of spatial resolution. Let us (perhaps safely) assume, for the sake of argument, that a future empirical study will disconfirm the assumption of "overlap" on which our proposal is premised.
For instance, lets us assume that a traditional (e.g., 3 mm voxel size) BOLD fMRI study confirms our core prediction that there will be overlap, within individual brains, in the supramarginal gyrus for tool representation, intuitive physics, and grasping. Imagine then a subsequent study, using higher resolution BOLD fMRI (or, for instance, VASO) finds such areas of "overlap" can be separated into smaller subregions that are task specific. Imagine this hypothetical study finds one subregion (or single unit, or cortical layer) involved in tool representation, while another subregion is involved in intuitive physics judgements (and so on). Empirically, it would be clear that the "overlap" was only apparent, and due to the (comparatively) low spatial resolution of conventional BOLD fMRI. How would such an outcome relate to the hypothesis that there is a superordinate computation that is common across tasks, and which is implemented by the supramarginal gyrus, together with the broader network in Figure 1?
Consider the following analogy. Gambling is legal in some locations in the United States. One of those locations is the entire state of Nevada. In this example, Nevada is analogous to the supramarginal gyrus, and the Nevada law that makes gambling legal is analogous to the "superordinate computation" that is common across the tasks highlighted in Figure 1. One could further analogize the different contexts (cards, roulette, sports, etc.) in which "gambling" is implemented as analogues to the different tasks that activate the supramarginal gyrus. Now let's run the overlap experiment on the analogy: There could be two ways in which gambling is implemented by task (i.e., by cards, roulette, sports) geographically across Nevada. In one world, only a single type of gambling (cards OR roulette OR sports games) occurs at any given physical establishment or business. In the other world, any given establishment, at any given physical location, has all types of gambling occurring under one roof. Both worlds, to our intuition, are compatible with the idea that a superordinate computation (the Nevada law) is the reason why all of the different forms of gambling (cards, sports, roulette) can occur in Nevada (and not in nearby states).
The point of the gambling analogy is illustrative (as opposed to demonstrative) about what granular physical overlap, or lack thereof, might mean for a theory of how processing unfolds across different tasks. In the end, the core issue reduces to specifying the relevant spatial granularity at which overlap is expected -"relevant", because this begs the hard problem of specifying what parts of the biology match up with which parts of a computational theory of how the process works (the "granularity mismatch problem"; Krakauer et al., 2017;Poeppel, 2012). Perhaps "brain region", as identified functionally using BOLD fMRI, picks out areas that for which a condition is met that allows a certain type of computation to occur (the way the laws of Nevada are the condition for gambling, in all its forms, to co-localize in Nevada).
One difference between our interpretation of "overlap", compared to the "motor simulationist" examples discussed above, is that our proposal posits a "superordinate" computation that applies to all tasks. By contrast, motor simulationist theories started with the assumption that motor-relevant areas implement motor processes; thus, any task that activates those areas is assumed to involve a motor process. That assumption was licensed, within the framework of reverse inference, because the effects were observed in motor-or peri-motor areas (localized by a "motor task"). The common computation that we have proposed is not perfectly aligned with any one task. Indeed, one way in which the view that we have proposed can be further challenged and tested is to consider other tasks that may also drive activity in the same regions/network.
Notably, there is a well-attested role of the supramarginal gyrus in phonological processing (for review and empirical investigation, see Oberhuber et al., 2016). Is this a counterfactual to our proposal? Or, does it further triangulate a computational refrain common to the tasks that engage the supramarginal gyrus? Phonemes are perceptual categories (in speech perception) and action categories (in production). As high-level action categories, phonemes must be implemented and linearized, with accommodation (e.g., co-articulation) to the current and future states of the speech system. Hickok and Poeppel (2007) argued that the dorsal language pathway, which is supported by the long fibres of the arcuate fasciculus, maps sound categories to motor categories and the supramarginal gyrus is a key hub integrating the "long" segment of the arcuate fasciculus (connected to frontal speech motor areas) with the descending segment of the arcuate (connecting to temporal lobe perceptual representations). The expectation on our proposal would be that what the supramarginal gyrus is "doing" in the context of phoneme processing is fundamentally about prediction.
To conclude this discussion about "overlap", we return to another potential take-away from the analogy to gambling. The Nevada law that makes gambling possible need not describe every form that gambling may take. We have emphasized the idea of a common superordinate computation across tasks that seeks to explain why there is neural overlap. The neural overlap does not necessarily tell us what the computation is (although we have suggested it can be triangulated by studying the different tasks and contexts that lead to activity in that region). Perhaps what makes the supramarginal gyrus play host to such a diverse set of tasks is not a computation, but a condition that makes certain types of computations possible. Each of those computations might be inherently tasks specific (like different forms of gambling), but they share (like gambling) a common local condition that allows them to colocalize there (while other brain regions do not have such conditions). One intriguing possibility is that such 'conditions' can be quantified in terms of patterns of structural connectivity.

Next steps
Why do the representation of manipulable objects, action planning, and intuitive physics co-localize in fronto-parietal areas? The premise of this review has been that it is because of a common computation: Physical reasoning about the future state of the world in a first-person perspective. Actions are first person; perception is also first person. Thus, by arguing for a computational stance that is inherently first person, we are not arguing that the relevant superordinate computation is necessarily motor-or action-based. Both perceptual and motor constraints shape the inferences that are generated by the brain's first-person physics engine.
Of the regions that emerge across studies, the supramarginal gyrus is a structure that we would (speculatively) propose as the basis for such predictive first-person inferences. The supramarginal gyrus is a structure that anatomically and functionally integrates perception (vision, auditory, proprioceptive, somatosensory) with actions (by the hands, mouth, limbs, and eyes). Presumably, the conditions (or computations) that drive co-localization of those tasks to the supramarginal gyrus and its network are innately specified. When thinking about why there are innate constraints in the brain, it is easy to gin up stories of selective pressures operating to encourage the system toward its current organization. Such "just so" stories offer no hard constraints on theories. This is because the pressures that led to the current universal organization may not be the same as current use, and the pressures may not have even had to do with current function (Dehaene & Cohen, 2007). "Innate constraints" does not imply "selected for current use" it could have been a spandrel of other constraints (Gould & Lewontin, 1979).
Actions, manipulable object representation, and intuitive physics reasoning are tasks and groupings of stimuli. Those tasks and groupings of stimuli neatly capture variance in neural responses in the common fronto-parietal network shown in Figure  1 because each of those tasks, by hypothesis, engages a computation that is enabled by that network. The merit of the approach we have proposed will be weighed in whether it generates expectations that organize available evidence and future studies. Notes 1. We are very grateful to Alex Martin for raising the conceptual issues, and some of the empirical examples, to be discussed in this section. 2. It should be noted that the original Motor Theory of Speech Perception predates all forms of functional brain imaging.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Funding
Preparation of this manuscript was supported, inpart, by NIH Grant R01EY028535 to BZM.