Extracting audio from video information

This story is a bit old, but it was orphaned in one of my browser tabs. This is some grade-A sci-fi hocus pocus:

Researchers at MIT, Microsoft, and Adobe have developed an algorithm that can reconstruct an audio signal by analyzing minute vibrations of objects depicted in video. In one set of experiments, they were able to recover intelligible speech from the vibrations of a potato-chip bag photographed from 15 feet away through soundproof glass.

In other experiments, they extracted useful audio signals from videos of aluminum foil, the surface of a glass of water, and even the leaves of a potted plant.


In the experiments reported in the Siggraph paper, the researchers also measured the mechanical properties of the objects they were filming and determined that the motions they were measuring were about a tenth of micrometer. That corresponds to five thousandths of a pixel in a close-up image, but from the change of a single pixel’s color value over time, it’s possible to infer motions smaller than a pixel.

Suppose, for instance, that an image has a clear boundary between two regions: Everything on one side of the boundary is blue; everything on the other is red. But at the boundary itself, the camera’s sensor receives both red and blue light, so it averages them out to produce purple. If, over successive frames of video, the blue region encroaches into the red region — even less than the width of a pixel — the purple will grow slightly bluer. That color shift contains information about the degree of encroachment.

In recent years, researchers have developed methods for detecting heart rate of people purely through video, using small fluctuations in the color of their skin.

What will happen when we are naked to computer vision? What if we can no longer hide when our heart starts racing, our skin flushes, or our hand quivers ever so subtly? We always thought we'd be the one administering the Voight-Kampff test to cull the replicants from the humans, but maybe it's the reverse that arrives first. Machines just sitting their motionless, staring at us, and seeing everything.

The 130 million pixel camera

We all have them. Forget Apple's, the original retina display is still the best: the human eye.

The article is fascinating throughout. For example, the focal length of lens that best approximates human vision is not 50mm, as is commonly supposed, but 43mm. Its aperture is roughly f3.2 to f3.5. Since the human retina is curved, it is sharper in the corners than a camera sensor, which is flat and causes the corners of the sensor to be further away from the center. Of the human eyes' roughly 130 million pixels, only 6 million see color.

We are still waiting for some new type of connector or bus that will allow us to use retina displays larger than those on Macbook Pros today. The amount of data to transmit is beyond that of the existingThunderbolt connectors.

So how does your brain deal with 130 million pixels of information being thrown at it in a constant stream? The answer is it doesn't.

The subconscious brain also rejects a lot of the incoming bandwidth, sending only a small fraction of its data on to the conscious brain. You can control this to some extent: for example, right now your conscious brain is telling the lateral geniculate nucleus “send me information from the central vision only, focus on those typed words in the center of the field of vision, move from left to right so I can read them”. Stop reading for a second and without moving your eyes try to see what’s in your peripheral field of view. A second ago you didn’t “see” that object to the right or left of the computer monitor because the peripheral vision wasn’t getting passed on to the conscious brain.

If you concentrate, even without moving your eyes, you can at least tell the object is there. If you want to see it clearly, though, you’ll have to send another brain signal to the eye, shifting the cone of visual attention over to that object. Notice also that you can’t both read the text and see the peripheral objects — the brain can’t process that much data.

The brain isn’t done when the image has reached the conscious part (called the visual cortex). This area connects strongly with the memory portions of the brain, allowing you to ‘recognize’ objects in the image. We’ve all experienced that moment when we see something, but don’t recognize what it is for a second or two. After we’ve recognized it, we wonder why in the world it wasn’t obvious immediately. It’s because it took the brain a split second to access the memory files for image recognition. (If you haven’t experienced this yet, just wait a few years. You will.)

ADDENDUM: The way human vision works, always putting the center of your vision in focus and blurring the edges so as to avoid overwhelming your brain with data, is somewhat replicated in form by these hyperphotos. That is, you are presented a photo with some baseline of resolution, but as you drill in on particular sections, the photo zooms and increases the resolution.