Big Data Requires Big Talent, Part 2

Stephen DeAngelis

May 23, 2012

In Part 1 of this two-part series focusing on the talent and skills required to make Big Data analytics the next big thing, I discussed an article by Ben Rooney in which he insisted that what Big Data companies need is a lot more data scientists. [“Big Data’s Big Problem: Little Talent,” Wall Street Journal, 29 April 2012] Carole-Ann Matignon agrees that data scientists are necessary but not sufficient to meet future Big Data needs. She believes that knowledge engineers also play a valuable role. [“Data versus Expertise Dilemma,” SmartData Collective, 21 March 2012] She writes:

“In the decade (or two) I have spent in Decision Management, and Artificial Intelligence at large, I have seen first-hand the war raging between knowledge engineers and data scientists. Each defending its approach to supporting ultimately better decisions. So what is more valuable? Insight from data? Or knowledge from the expert?”

Before proceeding further, let me provide a couple of definitions that will help frame the debate. Here is one definition of a knowledge engineer:

“A knowledge engineer is a computer systems expert who is trained in the field of expert systems. Receiving information from domain experts, the knowledge engineers interpret the presented information and relay it to computer programmers who code the information in to systems databases to be accessed by end-users. Knowledge engineers are used primarily in the construction process of computer systems.” [Bultman, Arne; Kuipers, Joris; van Harmelen, Frank (2000), “Maintenance of KBS’s by Domain Experts: The Holy Grail in Practice”, in Logananthara, R.; Ali, M., Thirtheenth International Conference on Industrial & Engineering Applications of Artificial Intelligence & Expert Systems IEA/AIE’00, Lecture Notes in Artificial Intelligence, 1821, Heidelberg: Springer Verlag, ISBN 978-3-540-67689-8]

Here is a definition of a Data Scientist:

“A data scientist is a job title for an employee or business intelligence (BI) consultant who excels at analyzing data, particularly large amounts of data, to help a business gain a competitive edge. … A data scientist possesses a combination of analytic, machine learning, data mining and statistical skills as well as experience with algorithms and coding. Perhaps the most important skill a data scientist possesses, however, is the ability to explain the significance of data in a way that can be easily understood by others.” [“Data Scientist,” Business Analytics, September 2011]

I’m not sure those definitions will help you keep the players straight, but it gives you some idea about the differences in focus between the two disciplines.

Matignon finds the fight for supremacy between data scientists and knowledge engineers a political battle for prestige between combatants who are really on the same side. She says that companies should want both kinds of professionals and points to an article written by Mike Loukides, in which he writes, “Experts make the leap from correct results to understood results.” [“The unreasonable necessity of subject experts,” O’Reilly radar, 20 March 2012] Loukides wrote his article after having attended an event that included a debate on the following topic: “In data science, domain expertise is more important than machine learning skill.” He said, the “Con” side won the debate (i.e., machine learning was judged to be more important that domain expertise). That should make all you data scientists happy.

He claims the results were not surprising “given that we’ve all experienced the unreasonable effectiveness of data. From the audience, Claudia Perlich pointed out that she won data mining competitions on breast cancer, movie reviews, and customer behavior without any prior knowledge.” That’s certainly strong empirical evidence for the value of machine learning. Loukides, however, doesn’t believe that an impromptu debate can capture all of the nuanced reasons that subject matter expertise remains important. He continues:

“Here’s the question that I was left with. The debate focused on whether domain expertise was necessary to ask the right questions, but a recent Guardian article, “The End of Theory,” asked a different but related question: Do we need theory (read: domain expertise) to understand the results, the output of our data analysis? The debate focused on a priori questions, but maybe the real value of domain expertise is a posteriori: after-the-fact reflection on the results and whether they make sense. Asking the right question is certainly important, but so is knowing whether you’ve gotten the right answer and knowing what that answer means. Neither problem is trivial, and in the real world, they’re often closely coupled. Often, the only way to know you’ve put garbage in is that you’ve gotten garbage out. By the same token, data analysis frequently produces results that make too much sense. It yields data that merely reflects the biases of the organization doing the work. Bad sampling techniques, overfitting, cherry picking datasets, overly aggressive data cleaning, and other errors in data handling can all lead to results that are either too expected or unexpected. “Stupid Data Miner Tricks” is a hilarious send-up of the problems of data mining: It shows how to “predict” the value of the S&P index over a 10-year period based on butter production in Bangladesh, cheese production in the U.S., and the world sheep population.”

Matignon insists that Loukides’ article “provides a clear picture as to why and how we would want both [knowledge engineers and data scientists].” She continues her discussion by first concentrating on the plusses offered by data scientists and their algorithms:

“In the world of uncertainty that surrounds us, experts can’t compete with the sophisticated algorithms we have refined over the years. Their computational capabilities goes way above and beyond the ability of the human brain. Algorithms can crunch data in relative
ly little time and uncover correlations that did not suspect. Adding to Mike’s numerous examples, the typical diaper shopping use case comes to mind. Retail transaction analysis uncovered that buyers of diapers at night were very likely to buy beer as well. The rationale is that husbands help the new mom with shopping, when diapers run low at the most inconvenient time of the day: inevitably at night. The new dad wandering in the grocery store at night ends up getting ‘his’ own supplies: beer.”

The beer probably helps the new father sleep better as well! Matignon reiterates the point made by Loukides that data can be manipulated. She notes, “A hidden bias can surface in a big way in data samples, whether it over-emphasizes some trends or cleans up traces of unwanted behavior.” She continues:

“If your data is not clean and unbiased, value of the data insight becomes doubtful. Skilled data scientists work hard to remove as much bias as they can from the data sample they work on, uncovering valuable correlations. … When algorithms find expected correlations, … analytics can validate intuition and confirm fact we knew. When algorithms find unexpected correlations, things become interesting! With insight that is ‘not so obvious’, you are at an advantage to market more targeted messages. Marketing campaigns can yield much better results than ‘shooting darts in the dark’. Mike raises an important set of issues: Can we trust the correlation? How to interpret the correlation?”

To underscore her point, Matignon points out some correlations that have been made in the past:

  • People who dislike licorice are more likely to understand HTML
  • People who like scooped ice cream are more likely to enjoy roller coasters than those that prefer soft serve ice cream
  • People who have never ridden a motorcycle are less likely to be multilingual
  • People who can’t type without looking at the keyboard are more likely to prefer thin-crust pizza to deep-dish

She writes, “There may be some interesting tidbit of insight in there that you could leverage. but unless you *understand* the correlation, you may be misled by your data and make some premature conclusions.” It is in the field of understanding that Matignon believes that both data scientists and knowledge engineers add value. She concludes:

“Mike makes a compelling argument that the role of the expert is to interpret the data insight and sort through the red herrings. This illustrates very well what we have seen in the Decision Management industry with the increased interplay between the ‘factual’ insight and the ‘logic’ that leverages that insight. Capturing expert-driven business rules is a good thing. Extracting data insight is a good thing. But the real value is in combining them. I think the interplay is much more intimate than purely throwing the insight on the other side of the fence. You need to ask the right questions as you are building your decisioning logic, and use the available data samples to infer, validate or refine your assumptions. As Mike concludes, the value resides in the conversation that is raised by experts on top of data. Being able to bring those to light, and enable further conversations, is how we will be able to understand and improve our systems.”

Since Matignon makes so many references to Loukides’ article, I felt it only fair to provide his conclusions as well. He writes:

“Whether you hire subject experts, grow your own, or outsource the problem through the application, data only becomes ‘unreasonably effective’ through the conversation that takes place after the numbers have been crunched. … That’s the territory we’re entering here: data-driven results we would never have expected. We can only take our inexplicable results at face value if we’re just going to use them and put them away. Nobody uses data that way. To push through to the next, even more interesting result, we need to understand what our results mean; our second- and third-order results will only be useful when we understand the foundations on which they’re based. And that’s the real value of a subject matter expert: not just asking the right questions, but understanding the results and finding the story that the data wants to tell. Results are good, but we can’t forget that data is ultimately about insight, and insight is inextricably tied to the stories we build from the data. And those stories are going to be ever more essential as we use data to build increasingly complex systems.”

Companies don’t need the drama and tension created by pride-based conflicts about which discipline brings the most value to the table. In the era of Big Data, there is plenty of work and recognition for both data scientists and knowledge engineers.