Watson

14.2.5.1 Big data, statistical machine learning techniques, and machine translation

Computer Scientists use the term big data to describe the extremely large datasets that are becoming available in a wide variety of disciplines, such as genomics, social networks, astronomy, Internet text and documents, and atmospheric science – to name but a few. Big data is often characterized as high volume (very large datasets), high velocity (the data changes rapidly), and/or high variety (large range of data types and sources of data).

Statistical machine learning techniques refer to methods, or techniques, that apply statistical models to large amounts of data in order to enable machines to automatically infer relationships from the patterns that can be found in the data. There are a number of statistical inference models used in machine learning but one of the most popular is called Bayesian inference.

Note that while statistical machine learning techniques differ from the neural networks mentioned in Section 14.2.4 above, they are both sub-disciplines of machine learning – they are both trained on large data sets and learn to recognize patterns present in the data. The major difference between the two is that neural networks are inspired by biological neural systems and the feedback cycles that appear to be inherent to those systems, while statistical machine learning techniques have grown out of mathematics and statistics rather than biology. Neural networks were once the most popular machine learning model but since the turn of the century statistical methods have become dominant.

Until relatively recently, progress had been quite limited on a number of important problems in AI. For example, the traditional symbolic AI approach to translating documents from one language to another proved ineffective. There are simply too many exceptions to the ‘rules’ governing how humans use languages for it to be practical to try to capture them all. Many computer scientists thought that computers to would need to master common sense knowledge, in order to ‘understand’ what was being said, before substantial progress could be made. What big data and statistical machine learning techniques have shown us is that given enough data many of these problems can be solved to a large degree, absent deep understanding, by looking for patterns in the data.

Figure 14.9: An early attempt at machine translation. This scene is taken from the 1992 PBS series “The Machine That Changed The World”, episode four “The Thinking Machine”.

To use an example you may be familiar with, let’s look at Google Translate. Google Translate learns to translate documents between two languages, say English and French, by first comparing millions of documents, such as books and web sites, which humans have already translated. These document pairs are scanned for statistically significant patterns. The billions of patterns generated by this process can then be used to translate new documents between the two languages. Given enough source data the resulting translations, while not perfect, are usually “good enough” for the reader to understand the ideas the writer is attempting to communicate.

Figure 14.10: A Google video that provides a brief overview of the technology underlying Google Translate.

In the Google approach humans are not directly teaching the computers how to translate documents. Instead, humans have taught the computers how to compare documents and look for statistically significant patterns. We humans then give these computers millions of pairs of already translated documents. The computers then figure out which input patterns (say in French) are most likely to be associated to which output patterns (say in English). When presented with some new input text in French, the system produces the most likely English output pattern associated with that text.

It is very important to understand that Google Translate has no understanding of what the words and sentences it translates actually mean, it simply replaces one string of text with another string of text based on statistical patterns derived from human translated documents.

As we approach the midpoint of the second decade of the 21st century we appear to be on the verge of achieving real time natural language translation. Real time natural language translation occurs when human speech is translated from one language to another as it is spoken, with little or no apparent delay. While we are not there yet, apps such as Google Translate with Conversation Mode for Android – where two people can take turns speaking and the smart phone acts as translator – hint at what may soon become possible.

Figure 14.11: A Google video demonstrating how Google Translate with Conversation Mode works.

Return to top