Watson

14.2.5.3 Computer Vision

Broadly speaking, Computer vision is the branch of computer science which aims to give computers the ability to see. Achieving computer vision has long been one of the goals of artificial intelligence researchers. In the early days it was through that this problem shouldn’t be too difficult, as one could simply attach a camera to a computer. There were even some early successes, such as the perceptrons that could distinguish male faces from female faces (which we discussed in Section 14.2.4). Unfortunately teaching a computer to recognize various objects in the real world from different angles and under different lighting conditions is far, far harder than distinguishing men from women based solely on still photographs of front facing ‘head shots’.

One of the first computer vision projects, called Freddy, was conducted at the University of Edinburgh under the direction of Donald Michie from 1969 through 1976. In its initial version (1969-1971) Freddy consisted of a camera overlooking a rotating platform on which common objects, such as a cup, were placed. The camera was hooked to a computer that attempted to recognize the objects on the platform. The system was later expanded in the 1973 to 1976 timeframe as Freddy II. The second iteration of Freddy included a robot arm that could manipulate objects and be programmed to perform simple tasks like putting together a toy car from various parts – the car body, four wheels, and two axles. While Freddy was able to recognize, under carefully controlled conditions, the objects it had been programmed to detect, the process was quite slow, taking several minutes on some objects. The methods used to program Freddy couldn’t be effectively scaled up to larger numbers of objects or ‘real world’ lighting and viewing conditions.

Figure 14.19: A video of Freddy in action circa 1971. This scene is taken from the 1992 PBS series “The Machine That Changed The World”, episode four “The Thinking Machine”.

While we humans are very good at recognizing an immense number of objects under widely varying lighting conditions and viewing angles this problem is still beyond the capabilities of today’s computer systems. Nonetheless, computer vision systems have slowly improved over the years and in specialized domains are beginning to become rather impressive.

One computer vision application you might be familiar with is Google Goggles. Google Goggles is an application that uses the camera on your smart phone (Android or iPhone) to take a picture and then send it to Google’s servers where computer vision algorithms search the image to determine whether it contains any recognizable objects. The system works best on product logos, works of art, text, and famous landmarks. Food, cars, plants, and animals are examples of objects that Goggles can’t recognize.

Figure 14.20: An advertisement for Google Goggles.

Another application of computer vision that you may have heard of is facial recognition. Facial recognition systems are applications that focus on automatically identifying an individual from a digit image by taking that image, extracting relevant features, and then comparing those features against a database of known individuals.

While these systems are used by counter-terrorism and law enforcement officials, they do not (yet) work nearly as well as they are generally portrayed in movies and on television. Today’s facial recognition systems tend to give the most accurate results when presented with high resolution images where the subject is evenly lit, and looking directly towards the camera. Thus, their effectiveness can be quite limited in ‘real world’ situations where the lighting, viewing angle, and video quality cannot be easily controlled. Despite recent advances in facial recognition systems, these systems were unable to determine the identity of the suspects in the April 15th, 2013 Boston Marathon bombing – even after authorities spotted the individuals in video surveillance footage.

Today the most wide spread consumer facing implementation of computer vision is found in Microsoft’s Kinect. Kinect is a sensor system that allows Xbox game machines to, in a limited sense, see – to recognize and identify players from their facial features and to track their movements in real time.

Kinect was first announced in June of 2009 at E3 (the Electronic Entertainment Expo) under the name “Project Natal”. Project Natal promised to combine motion tracking of individual players, facial recognition, voice control (speech recognition), and video conferencing into a mass produced inexpensive (under $200) consumer product.

Figure 14.21: A “product vision” video of “Project Natal” from 2009.

About a year and a half later, in November 2010, the first version of Kinect for Xbox360 was released. While the initial version of Kinect was a revolutionary product that achieved many of the goals set out in the original 2009 “product vision” announcement, it did come with a number of limitations. These limitations included: the ability to motion track only two individuals at a time, facial recognition that often didn’t work well in the typical lighting environments found in a home, and voice control that wasn’t very reliable in noisy environments.

As an early adopter of Kinect, this author can confirm that the original Kinect had problems recognizing and logging me in under many common lighting and distance situations (e.g., sitting on my couch when lit from behind) and the voice controls were much less reliable than when using Siri or Google Voice Search (circa 2012-2013). “Waking up” the system with a wave of the hand also felt awkward in many situations. Of course, making these complaints about a 1st generation product that was the first to bring a working implementation of computer vision technology to a mass audience seems somewhat petty.

In late 2013 Microsoft released the second iteration of Kinect as an integral part of their Xbox One launch.

Figure 14.22: A Wired magazine video that demonstrates the technical capabilities of the Kinect for Xbox One.

The second version of Kinect overcame many of the problems associated with the original Kinect: the 3-D depth sensor has a higher resolution in version 2 of Kinect, the 2-D color camera now supports 1080p resolution (instead of 640 by 480 in the original version), and the system now includes an active IR (Infra Red) camera that allows Kinect to see in the dark, overcoming the problems associated with illumination that the original Kinect suffered from. Skeletal tracking has improved: up to six individuals can be tracked simultaneously (instead of only two), with more articulation points per person (25 joints, including wrist and thumb tracking, up from 20 joints per person in the original version), and a new physics model allows the system to estimate the amount of force behind a movement. version 2 of Kinect can also perform attention and mood estimation by looking at a player’s head orientation and facial features. It does this by tracking whether a player’s head and eyes are looking in the direction of the Kinect or not, and whether he or she is smiling or frowning. The system is even able to estimate the heart rate of the individuals it is tracking by detecting the very slight flushing that our faces undergo with each heart beat.

Despite being a truly impressive innovation, it is important to clearly understand that Kinect doesn’t “see” in the same sense that humans do. Essentially Kinect is blind to most things in range of its sensors other than people’s faces and their bodies. What I mean by this is that Kinect possesses an internal ‘skeletal model’ of people’s joints and how those joints can move. It then looks for groupings of pixels in the images it is recording that appear to be moving in ways that match its internal model of how humans move. When it finds pixels moving in this way, it then tracks them. The result is a system that tracks human movement. Likewise, the system has a model of the way human faces are constructed and how they can move – smile, frown, close their eyes, etc. As with full body motion tracking, the system looks for pixels in configurations that match human faces, it can then crudely track a person’s facial features and even identify individuals it has been trained to recognize. Kinect doesn’t recognize cats, or coffee tables, or cups, or couches, or most anything else – it is limited to people’s faces and their bodies.

Return to top