Big Data: Hope and Hype, Part 1

Stephen DeAngelis

February 21, 2012

According to McKinsey analysts, “‘Big data’ is the ‘next frontier for innovation, competition and productivity.” [“Big data is ‘the next frontier’,” by Jessica Twentyman, Financial Times, 14 November 2011] Daniel W. Rasmus isn’t quite so sanguine about the future of Big Data. “If 2012 is the year of Big Data,” he writes, “it will likely be the year vendors and consultants start to over-promise, under-deliver, and put processes in motion that will generate insights and potential risks for years to come.” [“Why Big Data Won’t Make You Smart, Rich, Or Pretty,” Fast Company, 27 January 2012] As President and CEO of company that analyzes Big Data, I believe that both points of view have merit. I know that sounds like a waffle; but, historically, the “next big thing” has always been over-hyped before proving itself to have lasting value.

First, let’s examine why McKinsey analysts believe that Big Data is the next frontier. Twentyman reports:

“A recent report by the management consultancy argued that the successful companies of tomorrow, whether they are market leaders or feisty start-ups, will be those that are able to capture, analyse and draw meaningful insight from large stores of corporate and customer information. The implication is that businesses that cannot do so will struggle. For that reason, McKinsey argues, ‘all companies need to take big data seriously’.”

The key words in that paragraph are “meaningful insight.” Mountains of data are useless unless some actionable insights can be drawn from it. The challenge, of course, is that so much data is being generated that it is impossible to glean anything from it manually. That is why Twentyman reports that IT companies enthusiastically agree with the conclusions of the McKinsey report. The message (i.e., data analysis is important for businesses), she writes, “helps sell information management systems and software.” She continues:

“[Big] data stands out in four ways, according to James Kobielus, analyst with Forrester Research: for its volume (from hundreds of terabytes to petabytes and beyond); its velocity (up to and including real-time, sub-second delivery); its variety (encompassing structured, unstructured and semi-structured formats); and its volatility (where scores to hundreds of new data sources come and go from new apps, web services, social networks and so on).”

For his part, Rasmus is uncomfortable with all these data sources. “The vast hordes of data [collected] during e-commerce transactions, from loyalty programs, employment records, supply chain and ERP systems are, or are about to get, cozy,” he writes. “Uncomfortably cozy.” He continues:

“Let me start by saying there is nothing inherently wrong with Big Data. Big Data is a thing, and like anything, it can be used for good or for evil. It can be used appropriately given known limitations, or stretched wantonly until its principles fray. … The meaningful use of Big Data lies somewhere between these two extremes. For Big Data to move from anything more than an instantiation of databases running in logical or physical proximity, to data that can be meaningfully mined for insight, requires new skills, new perspectives, and new cautions.”

He’s afraid that new cautions are being ignored. As an example, he points to Dirk Helbing of the Swiss Federal Institute of Technology in Zurich, who is spending more than €1-billion on a project whose aim is “nothing less than [foretelling] the future.” Rasmus writes that Helbing’s project hopes to “anticipate the future by linking social, scientific, and economic data.” If it succeeds, Rasmus writes, “This system could be used to help advise world governments on the most salient choices to make.” He continues:

“Given the woes of Europe, spending €1-billion on such a project will likely prove to be wasted money. We, of course, don’t have a mechanical futurist to evaluate that position, but we do have history. Whenever there is an existential problem facing the world, charlatans appear to dazzle the masses with feats of magic and wonder. I don’t see this proposal being anything more than the latest version of apocalyptic sorcery.”

In a post entitled Artificial Intelligence: The Quest for Machines that Think Like Humans, Part 1, I cited an article that discussed a DARPA-supported IBM project involving cognitive computing. The head of the project hopes to develop a cognitive computing system than can do things like monitor the world’s oceans and “constantly record and report metrics such as temperature, pressure, wave height, acoustics and ocean tide, and issue tsunami warnings based on its decision making.” That’s approaching the grandiose level that concerns Rasmus. The head of the IBM project also admits, however, that the system could be used for much more modest activities, like monitoring the freshness of produce on a grocer’s shelves. While I agree in principle with Rasmus that Big Data can be used for both good and ill, I believe the good far outweighs the bad.

In the blog cited above, I identified several technologists who believe we are a long way off from developing computers that think like humans. Rasmus counts himself among that number. Since Enterra Solutions uses a Cyc ontology, I found it interesting that Rasmus mentions Cyc. He writes:

“Cyc [is] a system conceived at the beginning of the computer era, [whose aim was] to combat Japan’s Fifth Generation Project as it supposedly threatened to out-innovate America’s nascent lead in computer technology. Although Cyc has yielded some use, it has not yet become the artificial human mind it was intended to be, able to converse naturally with anyone about the events, concepts, and objects in the world. And artificial intelligence, as imagined in the 1980s, has yet to transform the human condition.”

I agree that Cyc has not resulted in computer systems that think like humans. I also agree that it has been used to create some very useful artificial intelligence systems that are more nuanced than some other applications. Cyc ontologies help add common sense into AI systems that are notorious for lacking it. Rasmus’ bottom line: “As Big Data becomes the next great savior of business and humanity, we need to remain skeptical of its promises as well as its applications and aspirations.”

As president of a company that analyzes Big Data, I agree with Rasmus that we shouldn’t let the hype get ahead of the reality. Big Data allows us to dream big; but, those dreams must be anchored in a cold business reality that can provide a solid return on investment. The reason that analysts and IT companies are enthusiastic about Big Data is that the tools necessary to gain insights from it are not very old. That means we are only beginning to understand what can be done with Big Data. Rasmus, however, sees “a number of existential threats to the success of Big Data and its applications.” The first threat is the flip side of hype — overconfidence. Rasmus writes:

Many managers creating a project plan, drawing up a budget, or managing a hedge fund trust their forecasts based on personal abilities and confidence in their knowledge and experience. As University of Chicago professor Richard H. Thaler recently pointed out in the New York Times (“The Overconfidence Problem in Forecasting“), most managers are overconfident and miscalibrated. In other words, they don’t recognize their own inability to forecast the future, nor do they recognize the inherent volatility of markets. Both of these traits portend big problems for Big Data as humans code their assumptions about the world into algorithms: people don’t understand their personal limitations, nor do they recognize if a model is good or not.”

Rasmus’ concern is valid to a point. One of the reasons that researchers are trying to develop artificial intelligence systems is to eliminate bias. By equipping systems with a few simple rules, researchers are letting systems “learn” on their own. Enterra’s supply chain optimization solutions, for example, use a Sense, Think/Learn, Act™ system. We believe that machine learning is an extremely valuable tool. However, when decision makers are presented with information upon which they must act, AI systems can’t completely eliminate decision maker bias or overconfidence. Rasmus’ next concern is a graver one. Are Big Data solutions going to get so large that no one is going to be able to understand and challenge all of the assumptions used to generate its algorithms? It’s a good question. Rasmus writes:

“Even in a field as seemingly physical and visceral as fossil hunting, Big Data is playing a role. Geologic data has been fed into a model that helps pinpoint good fossil-hunting fields. On the surface that appears a useful discovery, but if you dig a bit deeper, you find a lesson for would-be Big Data modelers. As technology and data sophistication increases, the underlying assumptions in the model must change. Current data, derived from the analysis of Landsat photos, can direct field workers toward a fairly large, but promising area with multiple types of rock exposures. Eventually the team hopes to increase their 15-meter resolution to 15-centimeter resolution by acquiring higher-resolution data. As they examine the new data, they will need to change their analysis approach to recognize features not previously available (for more see “Artificial intelligence joins the fossil hunt” in New Scientist). Learning will mean reinterpreting the model.”

Anytime you change the parameters of a query you are likely to get different results. Rasmus’ concern is that if you change enough parameters in a large system you might not really know some of the underlying assumptions the model is now making. Rasmus’ next example underscores that point. He continues:

“On a more abstract level, recent work conducted by ETH Zurich looked at 43,000 transnational companies seeking to understand the relationships between those companies and their potential for influence. This analysis found that 1,318 companies were tightly connected, with an average of 20 connections, representing about 60 percent of global revenues. Deeper analysis revealed a ‘super-entity’ of 147 firms that accounts for about 40 percent of the wealth in the network. This type of analysis has been conducted before, but the Zurich team included indirect ownership, which changed the outcome significantly (for more see “The network of global control” by Bitali, Glattfelder, and Battiston). If organizations rely on Big Data to connect far-ranging databases–well beyond corporate ownership or maps of certain geologies–who, it must be asked, will understand enough of the model to challenge its underlying assumptions, and re-craft those assumptions when the world, and the data that reflects it, changes?”

That’s a good question on which to end the first post of this two-part series. Companies can avoid the dilemma Rasmus has identified by identifying more modest goals than forecasting the future with precision. At Enterra Solutions, we believe that Big Data applications are used best in management-by-exception situations (i.e., where decision makers have the final say, but are only involved when the system identifies that a situation that is abnormal). Monitoring Big Data for abnormalities can be just as important as mining it for deep insights. Tomorrow I’ll look at the remainder of Rasmus’ existential threats to the success of Big Data and its applications.