European Affairs

The authors, who claim to be the “messengers” of big data and not its “evangelists”, note that a half century after computers entered the mainstream, “data has begun to accumulate to the point where something new and special is taking place.” And that new and special thing has to do with the scale of data now available to us and our ability to sort through it. Scale at some point changes essence

How much data is there in big data? The authors quote University of Southern California expert, Martin Hilbert , who estimates that as of last year there existed 1,200 “Exabytes” of data in the world, of which less than 2 per cent is non-digital. The authors concede there is no good way to think about what this size means.

But they try. First, a full length feature file in digital form can be compressed in one gigabyte (a billion bytes). An Exabyte is one billion gigabytes. If reduced to CD Rom discs and stacked up, 2012 data would stretch from Earth to the Moon in five separate piles.

AND the amount of data is doubling every three years, in an analog to Moore’s law about the growth of computer chip capabilities.

The arrival of big data has some startling implications that, as the authors contend, will change the way we perceive the world.

The biggest “aha” moment of this work is the notion that big data will, to some degree, replace the importance of “causation” with the concept of “correlation.”   In other words, in many situations, knowing “what” is likely to be more useful than knowing “why.”   Let the data speak, say the authors. In an information rich world, sampling increasingly becomes obsolete and is neither useful nor necessary. When it is possible to work with huge numbers of data points (sometimes all the data points, N=all) results are more accurate and less biased. Hence the geek in the movie “Moneyball” got more accurate results about baseball players by crunching the data than the old fashioned scouts did through limited observation and personal instinct. “Causality is nice when you can get it,” say the authors. “The problem is that it’s often hard to get, and when we think we’ve found it, we’re often deluding ourselves.”  The idea that causality is “Nice when you can get it,” understates the value of causality in many situations and is one of the real downsides of a world of big data.

Manipulating data, like Google search queries, can turn up unexpected and unusual correlations : i.e. Cars of what color are less likely to have mechanical problems? Yellow. Or persons who type in either all caps or all lower case are more likely to be worse credit risks than those who type in the standard fashion. Or pinpointing where flu is likely to break out is based on the locations of concentrations of Google searches for certain flu related terms.

Certain predictions based on personal data highlight the danger of big data.   The movie “The Minority Report” starring Tom Cruise, dealt with a society that arrested and punished people for what they were likely to do before they had done it. The authors cautioned that big data is likely to provide the temptation to judge on the basis of propensity—a temptation that society must be careful to squelch. The issue already arises with insurance policies where predicting who is likely to come down with which illness is more and more prevalent.

Another dark side of big data is its potential to undermine whatever the information age has left of personal privacy. With big data, where personal data is reused for innumerable purposes at the point of data collection, the notions of notice and consent, opt-in and even anonymity become both anachronistic and ineffective.   The authors contend that there is even increasing value in “data exhaust”; data that is shed as a byproduct of peoples actions in a digital world.   For example, hitting the “opt out” tab still provides valuable information to data miners. And, it is these “algorithmists” whom the authors expect to constitute a new class of computer professional, who are helping companies, governments and others exploit the riches of this new and ever increasing resource—big data.

Another useful concept explored by the authors is “datafication,” the process through which data is accumulated and rendered useful in the quantities that characterize “big data.” When Google set out to scan the pages of millions of books, it not only digitized the pages but it also datafied the text so that letters, words and paragraphs could be read and indexed and searched. An estimated 130 million unique books have been published since the invention of the printing press, estimate the authors.   As of 2012, Google had scanned over 20 million titles, more than 15 percent of the world’s books. This data has multiple uses, only one of which is actually reading a book.   For example, the project allows scholars to discover when certain words or phrases are used for the first time. The Google project has also been used to facilitate the accuracy of Google’s language translation algorithms. Other key sectors where datafication is changing our world is the datafication of location through GPS and cell phone signals, and the datafication of relationships. I.E. Facebook’s one billion users and 100 billion “friendships.”

“Big Data” will not be the last word, by a long shot, on the topic, but it is an highly instructive and, at times, cautionary analysis of trends that are truly changing the way we live, interact and think in today’s digital world.


“Big Data, A Revolution that Will Transform How We Live, Work and Think,” by Kenneth Cukier and Viktor Mayer-Schonberger , Houghton Mifflin Harcourt, 242 pages.