Watson

While it is often difficult to precisely define the moment that a new field of endeavor springs into existence, a defining event in the emergence of AI as a formal academic discipline was certainly the Dartmouth Conference of 1956. This conference is credited with coining the term “artificial intelligence”. It was attended by a number of individuals, such as John McCarthy, Marvin Minsky, and Claude Shannon who went on to be become the ‘founding fathers’ of the field.

The Dartmouth Conference is also noteworthy for giving rise to what has to be one of the worst time and effort estimates ever generated by a group of experts in the history of Man. Essentially the organizers of the conference believed that a group of ten carefully chosen scientists working together over the summer of 1956 could accomplish the goals now associated with AI. Over ½ a century later, working with machines that are billions of times more powerful, we have yet to achieve many of the goals these visionaries first thought they would be able to tackle in a two month summer project.

As evidenced by the Dartmouth Conference, in the early days of computing, there was much optimism that computers would attain general intelligence in a reasonably short period of time.

A 1961 prediction about the future of AI made by Claude Shannon five years after the Dartmouth Conference. This scene is taken from the 1992 PBS series “The Machine That Changed The World”, episode four “The Thinking Machine”.

This optimism was bolstered by an impressive array of accomplishments that were rapidly achieved. In the late 1950’s, Arthur Samuel wrote a checkers playing program that could learn from its mistakes and thus, over time, become better at playing the game. There was much excitement when the program eventually became better than its creator at the game. Other AI programs developed in the late 1950’s and 1960’s include Logic Theorist (1956) by Allen Newell and Herbert Simon, which could prove simple mathematical theorems, SAINT (1963) by Jim Slagle which could solve calculus problems, and STUDENT (1967) by Daniel Bobrow which could solve word-based algebra problems.

A retrospective of some of the early successes made by AI researchers in the late 1950’s and early 1960’s. These scenes are taken from the 1992 PBS series “The Machine That Changed The World”, episode four “The Thinking Machine”.

Much of the work in this period was based on variations of the search techniques and formal logics discussed in and . What researchers soon discovered, however, was that the techniques they used to achieve their impressive results on small, limited problems could not be scaled up to attack more substantial problems. This was due to the underlying exponential nature of the search spaces associated with these problems. When one is dealing with a very small problem, such as searching a maze for an exit, the “brute force” approach of exhaustively testing every possible solution path is feasible. This approach, however, cannot be applied to larger problems, such as communicating in a natural language, where there are billions upon billions of potential solution paths.

For these reasons, the methods that were explored in the late 1950’s and 1960’s have become known as weak general methods. They are “general” in that they can be applied to a wide variety of problems. Yet they are “weak” in the sense that they cannot directly solve problems of substantial size in any area.

The way around the problem of exponential search spaces appeared to be the application of knowledge to limit the amount of searching required to reach a solution. In other words, if a program had a way to recognize which solution paths looked most promising, substantial amounts of time could be saved by exploring these paths first. This idea was formalized as a heuristic. A heuristic is a “rule of thumb” which is usually true. For example, a heuristic for the game of checkers is: “it is better to have more pieces on the board than your opponent.” While this is not always true (e.g., you have more pieces but they are in terrible positions and will be lost on the next move), it is usually true. Other heuristics drawn from everyday life are: bright red apples are tasty, milk with a more distant expiration date is fresher than milk with a closer expiration date, jobs with higher salaries are more desirable than jobs with lower salaries.

Originally, AI programs had very little knowledge. They depended primarily on searching through a huge number of alternatives to find an acceptable solution. Over time, however, the emphasis switched from developing better search strategies to developing structures for encoding and applying knowledge to solve problems. One such structure, called a script, is used for representing knowledge about stereotypical situations, such as what happens at a birthday party or when one goes out to see a movie. A similar structure, called a frame, is used for representing knowledge about objects.

During the 1970’s many AI researchers began developing and testing strategies for knowledge representation and manipulation. As was the case in the 1950’s and 1960’s, their implementation efforts generally focused on small, limited problems, which came to be known as micro-worlds.

One of the most impressive early examples of a micro-world was Terry Winograd’s program SHRDLU. SHRDLU, developed in 1972, simulated a world filled with blocks, pyramids, and boxes together with a robot arm that could manipulate those objects. What made SHRDLU so unique for its time was that a human could tell the robot arm what to do by typing commands in ordinary English, such as “Find a block which is taller than the one you are holding and put it in the box.” As this example illustrates, the system was capable of understanding some of the more subtle features of English such as pronoun reference. SHRDLU also understood simple “physics” concepts, such as: “two different objects cannot occupy the same physical space at the same time” and “if you release an object it will fall to the ground unless some other object is directly supporting it” and even “a cube cannot be supported by a pyramid”.

Terry Winograd interacting with SHRDLU. This scene is taken from the 1992 PBS series “The Machine That Changed The World”, episode four “The Thinking Machine”.

SHRDLU showed that programs could converse “intelligently” with humans in English, if they were provided with a thorough knowledge of the world being discussed and the rules that governed what was possible in such a world. Although SHRDLU-like natural language interfaces were used in the text-based adventure games of the 1980’s and early 1990’s, the approaches employed by SHRDLU and similar programs to reason about their micro-worlds could not be easily scaled up to handle the vast amounts of knowledge necessary to make sense of the real world.

As AI researchers began to recognize these obstacles, the early optimism of the 1950’s and 60’s gave way to a period of retrenchment in the 1970’s. Much of the government support and funding that had been lavished on AI in the early days dwindled during this time. The frustration of the AI researchers was that while they could create programs capable of impressive feats in very limited and focused areas, their techniques did not work well on the large, poorly defined problems that characterize the real world.

One of the breakthroughs of the early 1980’s was the widespread realization that micro-worlds could be constructed to represent small but important aspects of human knowledge. Much of the knowledge that is valued in our world is specialized knowledge – knowledge that is possessed by only a small fraction of the human population – such as knowledge of finance, medicine, and law. Even within these fields people further specialize into subfields such as corporate tax law or neuromedicine.

Many of these highly specialized, highly valuable, areas of human knowledge can be modeled by expert systems. Expert systems are computer programs that behave at or near (or even above) the level of a human expert in a narrowly defined problem domain, such as certain kinds of stock market transactions or credit card fraud detection. Expert systems focus on micro-worlds. But instead of being micro-worlds about simple domains of little intellectual or monetary value, they concern scarce and valuable information.

While the 1980’s was the decade in which expert systems began to become commercially viable, their roots extend as far back as the late 1960’s. In 1969, Ed Feigenbaum, Bruce Buchanan, and Joshua Lederberg developed the DENDRAL program at Stanford. DENDRAL could determine the structure of molecules from their chemical formula (e.g., H2O) and the results of a mass spectrometer reading. The first rule-based expert system was MYCIN by Feigenbaum, Buchanan, and Edward Shortliffe. MYCIN, which was developed in the mid 1970’s, was capable of diagnosing blood infections based on the results of various medical tests. The MYCIN system was able to perform as well as some infectious blood disease experts and better than non-specialist doctors.

With the emergence of expert systems in the 1980’s, there was renewed interest in artificial intelligence. Governments and corporations once again began heavily funding AI research. This time the goal was not to construct a general-purpose intelligence, but instead to develop practical commercial systems.

A large number of expert systems in fields ranging from financial services to medicine were implemented during this period. While many of these systems proved practical and cost effective, many did not. AI researchers tended to underestimate the difficulty of acquiring and encoding human expertise into the rule-based form required by expert system software. In some fields, it was found that the costs of acquiring and encoding expert knowledge could not be justified due to factors such as the rate at which knowledge in certain fields changed or the size of the potential target market. In other fields, such as credit card fraud detection, expert systems have become fully integrated into the way organizations do business and are no longer even thought of as “an AI application”.

Expert systems, by their very nature, are limited to small problems. They focus on capturing what has come to be called surface knowledge. Surface knowledge consists of the simple rules that generally characterize reasoning in a particular area. For example, the rule “If mild fever and nasal discharge then cold” could be part of a basic medical expert system. The type of knowledge captured by these rules is far different from a deep understanding of the field, which requires knowledge of anatomy, physiology, and the germ theory of medicine. While MYCIN was excellent at diagnosing blood infections, it had no understanding of this deeper knowledge. In fact, MYCIN didn’t even know what a patient was or what it means to be alive or dead.

In addition to being constrained to surface knowledge, expert systems suffer from two other major limitations. One is that they do not learn. Humans must hand-code each of the rules that form the knowledge base of an expert system. These rules are fixed. They do not change over time. Hence, when particular rules are no longer valid, humans must recognize this situation and update the knowledge base. Otherwise, the expert system will provide dated advice.

Expert systems also tend to be quite brittle. AI researchers use the word brittle to mean that when presented with a problem that does not fit neatly into their area of expertise, an expert system will often give completely inappropriate advice. This problem is made more vexing by the fact that few expert systems are able to determine when they are being asked to give advice on problems that are outside their area of expertise.

Artificial intelligence research went through yet another sea change in the late 1980’s and early 1990’s. As the limitations of expert systems began to become apparent, attention turned to the much-neglected topic of machine learning – particularly neural networks.

Neural networks, or connectionist architectures, consist of a large number of very simple processors which are interconnected via a network of communications channels, where each channel has an adjustable strength or weight associated with it. Neural networks are modeled (very loosely) on the brain – each processor represents a neuron (a brain cell), each weighted interconnection a synapse (the connections between brain cells). The knowledge contained in a network is encoded in the weights associated with the connections. The most astonishing feature of neural networks is that humans do not directly program them, instead they learn by example.

The neural network model of computing is discussed in . In this section we confine ourselves to a few words about the history of this subject.

In 1943, Warren McCulloch and Walter Pitts proposed a simple model of artificial neurons in which each neuron would be either “on” or “off”. The artificial neurons had two kinds of inputs “excitatory” and “inhibitory.” The neuron would “fire” – switch to “on” when the total number of excitatory inputs exceeded the number of inhibitory inputs.

McCulloch and Pitts were able to prove that their networks were equivalent in computational power to general-purpose computers. John von Neumann showed that redundancies could be added to McCulloch-Pitts networks to enable them to continue to function reliably in spite of the malfunction of individual elements. While these early neural networks were certainly interesting, they were of limited practical value, since the network interconnections had to be set by hand – no one knew how they could be made to learn.

Work by Frank Rosenblatt in the 1950’s and early 1960’s overcame the learning problem to some degree. Rosenblatt proposed the perceptron, a single-layer, feed-forward network of artificial neurons together with a learning algorithm. He was able to prove that his learning algorithm could be used to teach a perceptron to recognize anything it was capable of representing simply by presenting it with a sufficient number of examples.

This was a very important result, and led to much excitement. Perceptrons were built and trained to recognize all manner of things during the 1960’s. One famous example was a perceptron for distinguishing “males” from “females” by examining photos of peoples’ faces.

This is a perceptron learning to distinguish male faces from female faces taken from the 1992 PBS series “The Machine That Changed The World”, episode four “The Thinking Machine”.

Research into artificial neural networks came to an abrupt halt in the early 1970’s. Many people credit[6] Marvin Minsky and Seymour Papert’s 1969 book Perceptrons with bringing about the end of research into neural network based machine learning for nearly a decade and a half. Minsky and Papert proved that while it is true that perceptrons can learn anything they are capable of representing, the fact is that they are actually capable of representing very little. The most often cited example is that a perceptron with two inputs cannot learn to distinguish when its inputs are the same from when they are different. In other words, a perceptron cannot represent the exclusive-or, xor, operation of .

Minsky and Papert’s proofs only applied to perceptrons, single-layered neural networks not multiple-layer networks. However, at the time, single-layered networks were the only kind of neural networks that computer scientists had a learning algorithm for. In other words, the situation in the early 1970’s was that AI researchers knew that neural networks could, in theory, compute anything that a general-purpose computer could. But, the only kind of neural networks that they had a learning algorithm for (perceptrons) could not compute many basic functions.

In the late 1980’s and 1990’s many AI researchers began returning attention to neural networks. A number of groups independently developed the “back-propagation” learning algorithm. This algorithm allows multiple-layer feed-forward networks to be trained. The good news was that these networks, given a sufficient number of processing units, could represent any computable function. The bad news was that back-propagation, unlike the learning rule for perceptrons, does not guarantee an answer will be found simply because it exists. In other words, back-propagation does not guarantee that a multilayer network will be able to learn a particular concept regardless of how many examples it is given or how much time is spent going over these examples. In many cases, however, the algorithm does enable the network to successfully learn the concept.

Neural networks and other approaches to machine learning also suffer in comparison to the symbolic approach to machine intelligence in that they are often unable to explain or justify their conclusions. The practical result is that one can never really be sure that the network has been trained to recognize what was intended.

There is a humorous story from the early days of machine learning about a network that was supposed to be trained to recognize tanks hidden in forest regions. The network was trained on a large set of photographs – some with tanks and some without tanks. After learning was complete the system appeared to work well when “shown” additional photographs from the original set. As a final test, a new group of photos were taken to see if the network could recognize tanks in a slightly different setting. The results were extremely disappointing. No one was sure why the network failed on this new group of photos. Eventually, someone noticed that in the original set of photos the network had been trained on, all of the photos with tanks had been taken on a cloudy day, while all of the photos without tanks were taken on a sunny day. The network had not learned to detect the difference between scenes with tanks and without tanks, it had instead learned to distinguish photos taken on cloudy days from photos taken on sunny days!

Footnotes

[6] Or blame, depending on their point of view.