From Neuroscience to Computer Vision

A 50-year look at human and computer vision

Ben Rogojan
Better Programming

--

Photo by David Travis on Unsplash

How vision operates is a complex task the human brain (and now computer brains) have to take on. We take much of what our brains do for granted. For instance, there is depth perception, object tracking, differences in lighting, edge detection, and many other features that our brains keep track of. Scanning the environment and localizing where we are in space is an undertaking that our brain is constantly doing.

At some point in the past, researchers may have never thought it possible to create systems that can perform similar tasks to that of our own brains. Yet, in the last 50 years, we have gone from what might seem like small steps in neuroscience to computers being able to describe scenes in pictures.

There are plenty of anecdotes that are taught during neuroscience courses to help students understand how the brain functions. It may be Phineas Gage getting his left frontal cortex destroyed by a railroad rod and surviving, or Britten’s paper depicting when the brain can detect a signal in a chaotic mess of moving dots. All of these bits and pieces of research start to develop an understanding of how our brain operates.

One such example that laid the foundation for vast amounts of research in human vision, as well as in computer vision, was the research of Hubel and Wiesel.

Hubel and Wiesel were given the Nobel Prize in Physiology or Medicine in 1981 for their work in psychology. They had made groundbreaking discoveries concerning information processing in the visual system.

They brought us the original snap, crackle, and pop — and no, not the cereal. By connecting an electrode to a neuron, they were able to listen to the neuron respond to the stimulus of a bar of light.

They opened up a new understanding of how the V1 cortical neurons operated, and it was mind-blowing. Their research helped lay out the mapping and function of neurons in the V1.

In the video below, the pair demonstrates how neurons in the V1 specifically only respond to bars of lights in certain locations and angles. As the bar of light is moved, there is a crackle. You are hearing the neurons of a cat respond to the stimulus.

With this experiment, they demonstrated how several types of neurons were activated only under certain stimulation. Another fascinating feature was the fact that the cells seemed to naturally map for different angles. As the picture below demonstrates, each section of V1 contains a very specific set of neurons that mostly respond to bars of light with a specific angle.

Image source: Hubel and Wiesel

These cell reactions, when combined, were theorized to somehow be able to create a bottom-up image of the natural world. That is to say, by taking the response of many neurons responding to various bars of light, the human brain begins to draw a picture of the world around it.

Fast forward near 30 years to Olshausen and D. J. Field, two researchers who are focused on the field of computational neuroscience. This is the study of how the brain encodes and decodes information. Olshausen and Field take this work a step further. In fact, they even reference the work done by Hubel and Wiesel 30 years prior.

Instead of just focusing on single bars of lights, their team took pictures and started to look at how algorithms could possibly recognize and code features inside of said images.

One of their papers is called “Natural Image Statistics and Efficient Coding and it was written back in 1996, 20-plus years ago.

The purpose of this paper was to discuss the failures of Hebbian learning models in image recognition, specifically, the Hebbian learning algorithms that utilized principal component analysis. The issue is that the models could not at the time learn the localized, oriented, bandpass structures that make up natural images.

This is theoretically a few layers up from what Hubel and Wiesel started to demonstrate in their research with real neurons. Except now, they were modeling the output of 192 units (or nodes) in a more modern neural network. Their research showed how developing models that focused more on sparseness when it came to coding the regularities that exist in natural images was far more effective.

Using a spare model focused on limiting the number of coefficients required, per the list of basis functions, to represent the various features in an image.

This is demonstrated by the formula below.

Olshausen and D J Field Natural Image Statistics and Efficient Coding

Essentially, the section below references searching for the lowest mean error between the actual image and the functions that represent the image.

Olshausen and D J Field Natural Image Statistics and Efficient Coding

It was then partnered with a cost function that forced the algorithm to limit the number of coefficients required to represent the image.

Olshausen and D J Field Natural Image Statistics and Efficient Coding

Using gradient descent then minimizes the number of coefficients required to represent the image.

This paper in itself did not yet show a neural network that was capable of translating an image.

Give them a break, this was 1996! The internet had just become public in 1991!

Now science has gone from detecting bars of lights with a cat’s neuron to a mathematical model of a network that outputs actual features from images.

The last line of the 1996 paper was “An important and exciting future challenge will be to extrapolate these principles into higher cortical visual areas to provide predictions.” This was the challenge, to create models that could take lower-level features that were currently being modeled by various computational research scientists and then create a bottom-up network that could actually predict the context of an image.

Olshausen and D J Field Natural Image Statistics and Efficient Coding

The outputs of the Olshausen and Field model were similar to the one above.

Does this matrix of outputted lower-level features look familiar? Yes, especially if you’re a deep learning fan!

There are many papers in the last few years that have utilized a very similar matrix. These matrices are utilized as the convolutional layer in convolutional neural networks. This is supposed to mimic the way a single neuron responds to a visual stimulus.

Andrej Karpathy and Li Fei-Fei “Deep Visual-Semantic Alignments for Generating Image Descriptions

Except now, the thought of taking these lower-level features and predicting the actual context of the image is no longer the last line of a paper. It is no longer just a theory. It is reality.

This is where things go, from demonstrating that neurons can recognize bars of light, to having neural networks that can take low-level features and predict what an image contains.

A great paper on the subject was written in 2015 by Andrej Karpathy and Li Fei-Fei at Stanford called “Deep Visual-Semantic Alignments for Generating Image Descriptions.” They demonstrate a recurrent neural network that is capable of providing a detailed description of an image — not just pointing out a cat or a dog in a picture but also being able to describe the image. Like “boy is doing a backflip on a wakeboard”(like in the picture below).

Andrej Karpathy and Li Fei-Fei “Deep Visual-Semantic Alignments for Generating Image Descriptions

Now, it’s not perfect. However, it’s still leaps and bounds from 1968!

It has been a long process to get here. Just the papers referenced in this article have 50 years of time from start to finish. Yet, in the grand scheme of things, that’s fast, and it’s only getting faster. Neural networks are going beyond just recognizing an image. Neural networks are being used for cancer detection from medical images, predicting the emotional expressions being shown by a human, self-driving cars, etc.

What do the next 50 years have in store for computer vision?

--

--

#Data #Engineer, Strategy Development Consultant and All Around Data Guy #deeplearning #dataengineering #datascience #tech https://linktr.ee/SeattleDataGuy