You may recall from high school physics that light exhibits characteristics of both particles and waves. Different light attributes may be better thought of in terms of one or the other. For example, brightness is a function of the number of light particles — photons — arriving per unit time, whereas colour is a wave-like phenomenon, depending on the wavelength of those photons. For a given colour, increased brightness means having more photons arriving, not that the individual photons are themselves brighter.
(Note that I am being incredibly slipshod with the terminology here. As we’ll discuss in a little while, colour and brightness are not really attributes of the light at all, they are perceptual properties. But they’re a useful shorthand precisely because of that subjective familiarity.)
The energy of an individual photon depends on its wavelength or (reciprocally) frequency. Short wavelength photons have high frequencies and high energies, longer wavelengths have lower frequencies and lower energies. High energy photons can be damaging to biological tissues; lower energies can be harder to detect and broadly require larger sensory apparatus to distinguish from noise. That roughly 400 to 700 nm wavelength range of visible light represents a sort of Goldilocks zone of relatively harmless detectability in the context of human physiology. Visibility is not some physical property of light, it’s just how things are for us. Other animals will have different sensitivity ranges.
Light travels in straight lines at — famously — a constant speed. In a vacuum, that is. Matter complicates the picture in various ways. Usefully, we can think of the speed varying according to the medium through which the light is passing. Changes in medium induce speed changes and these lead to path changes. We can see this in effects such as heat haze, where fluctuating air density gives rise to wobbling. But in particular, crossing over distinct boundaries between different media — say between the air and a piece of glass — can alter the direction of the light, which we call refraction. This turns out to be super important.
If we want to use light as a medium for gathering information about the outside world, it’s helpful to know where it’s coming from. It’s not absolutely essential — there may still be some value in getting a general impression of aggregate lightness or darkness, and there are animals that do just that, but that only really scratches the surface of what we tend to think of as “vision”.
Assuming we have some kind of light detector — maybe some silver nitrate on a film or plate, a photodiode, or a nerve cell containing photosensitive pigments, whatever — if light reaches it simultaneously from different locations in the world, the information from each location will just pile up together in the detector’s response as an indistinguishable blur and we won’t be able to unpick which bit relates to what. What we would like is some form of spatial organisation, so that the light from one place all goes to the same detection point, while light from other places goes to other points.
One way to do this is to restrict the possible straight line paths of the light by forcing it all through the same point in space — a pinhole aperture. All the light arriving at any detector must be coming from exactly the direction of the pinhole because there’s nowhere else to go. This kind of setup is nice and simple and does occur in nature, especially in simpler organisms, requiring a lot less physiological machinery to evolve and build. But it has the disadvantage of capturing only a tiny fraction of the available light, discarding a lot of potentially useful information and reducing signal to noise.
An alternative is to have a larger aperture and use refraction through the curved surface boundaries of a lens to differentially bend the light passing through it, bringing the light to focus at the detector. Lens systems are harder to build than pinholes, both for human engineers and for biology, but they allow the capture of more light and more information.
The detector at a focused point has good information about one specific external position. If we have many detectors at many locations, that gives us a spatial map of optical information over a whole visual field — which is to say, an image.
Of course, focus is not perfect and not the same everywhere. Lenses will only bring to focus some parts of the external world — some range of distances from the lens. Detectors at well focused points will get strongly localised information, those where the focus is less good will receive light from a wider range of external regions and be correspondingly more imprecise, more blurry.
We are probably all familiar with this kind of locally focused imaging from using cameras, which combine a lens system for light gathering and focusing with some kind of spatial array of detectors — like a film frame or CCD — to perform the actual image capture.
Human vision is quite unlike photography in various important ways, some of which we’ll get into. But there are also enough structural similarities in the early stages that cameras are at least analogically useful as a model for starting to think about eyes.