I think it has to do with information type. Sounds are basically one-dimensional, which means that all you need is a quick Fourier transform to de-convolute the individual components. Add in two receivers for spatial placement and you're set.
Sight just has a lot more information. There's color, contrast, movement, and spatial relationships, and every object constantly reflects or emits light, while only a few objects reflect or emit sound. Hence, a 2D visual field with two receivers for 3D placement.