Figure 1: A new framework to formulate vision as mainly looking and seeing through a bottleneck. Traditional ideas view vision as mainly seeing, i.e., recognizing what is where in visual field. However, due to a processing bottleneck, humans consciously recognize only 40 bits out of 107 bits of retinal input information each second [3, 4]. Only scene fragments around our gaze are seen clearly,due to visual crowding [5]. Hence, vision must select which input fraction to see. This selection is looking, largely by deciding where to shift our gaze or attention. Humans make saccades (ballistic gaze shifts) three times per seconds. When saccades target destinations outside the central zone of clear visibility, the destinations are largely non-random [6]. These observations and considerations motivate this central-peripheral dichotomy [7]: vision in the peripheral visual field is specialized for looking, selecting a peripheral visual location for the next gaze or attentional shift, whereas vision in the central visual field is specialized for seeing. Looking and seeing are the selection and decoding stages when vision is viewed as having three stages: encoding, selection, and decoding, with encoding as mainly to sample and efficiently represent visual inputs before selection [4] for complex computations. For example, when we direct our gaze to (i.e., fixate on) the first letter of a word in this sentence, individual letters in the next word are typically illegible [9]. Likewise, in a visual scene (Figure 1), only a small region around our gaze is clearly visible. Vision must decide where in the scene to direct the next gaze shift. Hence, as a first approximation, vision should encompass looking and seeing: looking selects the fraction of visual input for entry into the bottleneck, largely by gaze shifts to center the selected content at our fovea, and seeing recognizes the selected content.Previous formulations that divided vision into low-, mid-, and high-level stages [10] are imprecise and thus difficult to test. Marr’s influential formulation notably omitted the selection (looking) stage by viewing vision as successively building primal sketch, 2.5D, and 3D representations of the scene [11]. My recent textbook [4] characterizes vision as comprising three stages: encoding, selection, and decoding. Looking and seeing correspond to selection and decoding. Encoding — efficient sampling and representing visual information prior to selection — characterizes largely what retinal (and partly V1) neural receptive fields do [4].
Blind to our own blindness, we have the illusion of seeing everything clearly because our scene appears clear wherever we direct our gaze — like assuming the refrigerator light is always on because it is on whenever we open the door [12, 13]. This illusion has impeded progress.
"Vision as looking and seeing through a bottleneck"
in press in Current Opinion in Neurobiology