Data Science, Data Scientists and the Evolution of Analytics

Stephen DeAngelis

June 30, 2014

“In a generation,” writes Ikkel Krenchel and Christian Madsbjerg, “the relationship between the ‘tech genius’ and society has been transformed: from shut-in to savior, from antisocial to society’s best hope. Many now seem convinced that the best way to make sense of our world is by sitting behind a screen analyzing the vast troves of information we call ‘big data’.” [“Your Big Data Is Worthless if You Don’t Bring It Into the Real World,” Wired, 11 April 2014] Scientists, as a group, have never been lauded for their social skills and tech geniuses are a just the latest breed of “nerd.” It appears that each new generation must learn for itself how important scientists and mathematicians are for addressing global and societal challenges. Data scientists are the most recent group to achieve social recognition. Robert Williamson notes, “The crucial thing about ‘big data’ is the data. ‘Big’ is relative, and while size often matters, real disruption can come from data of any size. This not a new idea, being several hundred years old. The key advance of the scientific revolution (and associated industrial revolution) was in order to understand something you had to measure it – that is gather the data.” [“Why Big Data Is Mostly A Matter Of Science,” Lifehacker, 22 November 2013]

Irving Wladawsky-Berger, a former vice-president of technical strategy and innovation at IBM and an advisor to Citigroup, notes that one reason data scientists have been in the news is because there are not enough of them. “Data Science is emerging as one of the hottest new professions and academic disciplines in these early years of the 21st century,” he writes. “A number of articles have noted that the demand for data scientists is racing ahead of supply. People with the necessary skills are scarce, primarily because the discipline is so new.” [“Why Do We Need Data Science When We’ve Had Statistics for Centuries?,” Wall Street Journal, 2 May 2014] The headline of Wladawsky-Berger’s article asks an interesting question. As Williamson notes, scientists have been analyzing data for hundreds (if not thousands) of years, so what’s different today? He begins to answer that question by asking another one: “What is data science all about?” He continues:

“One of the best papers on the subject is Data Science and Prediction by Vasant Dhar, professor in NYU’s Stern School of Business and Director of NYU’s Center for Business Analytics. The paper was published in the Communications of the ACM in December 2013. ‘Use of the term data science is increasingly common, as is big data,‘ Mr. Dhar writes in the opening paragraph. ‘But what does it mean? Is there something unique about it? What skills do data scientists need to be productive in a world deluged by data? What are the implications for scientific inquiry?’ He defines data science as being essentially the systematic study of the extraction of knowledge from data. But analyzing data is something people have been doing with statistics and related methods for a while. ‘Why then do we need a new term like data science when we have had statistics for centuries? The fact that we now have huge amounts of data should not in and of itself justify the need for a new term.’ In short, it’s all about the difference between explaining and predicting. Data analysis has been generally used as a way of explaining some phenomenon by extracting interesting patterns from individual data sets with well-formulated queries. Data science, on the other hand, aims to discover and extract actionable knowledge from the data, that is, knowledge that can be used to make decisions and predictions, not just to explain what’s going on.”

That is an excellent explanation about why we need data science and data scientists; but, a little more detail might be helpful. Thomas H. Davenport, the President’s Distinguished Professor of IT and Management at Babson College, explains that analytics have evolved through three stages (Analytics 1.0, 2.0, and 3.0). [“Analytics 3.0,” Harvard Business Review, December 2013] Concerning Analytics 1.0, he writes:

“What we are here calling Analytics 1.0 was a time of real progress. … New computing technologies were key. Information systems were at first custom-built by companies whose large scale justified the investment; later, they were commercialized by outside vendors in more-generic forms. This was the era of the enterprise data warehouse, used to capture information, and of business intelligence software, used to query and report it. New competencies were required as well, beginning with the ability to manage data. Data sets were small enough in volume and static enough in velocity to be segregated in warehouses for analysis. However, readying a data set for inclusion in a warehouse was difficult. Analysts spent much of their time preparing data for analysis and relatively little time on the analysis itself. More than anything else, it was vital to figure out the right few questions on which to focus, because analysis was painstaking and slow, often taking weeks or months to perform. And reporting processes — the great majority of business intelligence activity — addressed only what had happened in the past; they offered no explanations or predictions.”

Davenport notes that the Analytics 1.0 era lasted for nearly half a century. A new form of analytics was required after the Internet and social media companies started collected massive amounts of data. Davenport explains:

“Although the term ‘big data’ wasn’t coined immediately, the new reality it signified very quickly changed the role of data and analytics in those firms. Big data also came to be distinguished from small data because it was not generated purely by a firm’s internal transaction systems. It was externally sourced as well, coming from the internet, sensors of various types, public data initiatives such as the human genome project, and captures of audio and video recordings. As analytics entered the 2.0 phase, the need for powerful new tools — and the opportunity to profit by providing them — quickly became apparent. … Innovative technologies of many kinds had to be created, acquired, and mastered. Big data couldn’t fit or be analyzed fast enough on a single server, so it was processed with Hadoop, an open source software framework for fast batch data processing across parallel servers. To deal with relatively unstructured data, companies turned to a new class of databases known as NoSQL. Much information was stored and analyzed in public or private cloud-computing environments. Other technologies introduced during this period include ‘in memory’ and ‘in database’ analytics for fast number crunching. Machine-learning methods (semiautomated model development and testing) were used to rapidly generate models from the fast-moving data. Black-and-white reports gave way to colorful, complex visuals.”

The new era required new skills and the era of the data scientist was born. Unlike the Analytics 1.0 era, the Analytics 2.0 era didn’t last decades. Companies of all kinds began to realize that Big Data analytics could provide them with benefits and insights to which they weren’t previously privy. To serve their needs, Analytics 3.0 were developed. Davenport explains:

“Analytics 3.0 marks the point when other large organizations started to follow suit. Today it’s not just information firms and online companies that can create products and services from analyses of data. It’s every firm in every industry. If your company makes things, moves things, consumes things, or works with customers, you have increasing amounts of data on those activities. Every device, shipment, and consumer leaves a trail. You have the ability to analyze those sets of data for the benefit of customers and markets. You also have the ability to embed analytics and optimization into every business decision made at the front lines of your operations.”

The Analytics 3.0 era probably won’t last much longer than the 2.0 era. At Enterra Solutions®, we believe that Analytics 4.0 will involve cognitive computing. Analytics at that point will go beyond explanations, prescriptions, and predictions to real understanding thanks to the reasoning capabilities of the systems providing the analysis. Analytics 4.0 systems will also democratize analytics by making it easy for non-data scientists to conduct computational analysis without the need of a subject matter expert.