Big Data Requires Big Talent, Part 1

Stephen DeAngelis

May 22, 2012

Noting how well the IPO for Splunk Inc., a Big Data company, did on opening day (its shares “soared 109% on its first day of trading”), Ben Rooney remarked, “It seems that the markets are as much in love with ‘Big Data’ — the ability to acquire, process and sort vast quantities of data in real time — as the technology industry.” There is, however, a fly in the ointment. Rooney reports, “According to a report published last year by McKinsey, there is a problem. ‘A significant constraint on realizing value from Big Data will be a shortage of talent, particularly of people with deep expertise in statistics and machine learning, and the managers and analysts who know how to operate companies by using insights from Big Data,’ the report said.” [“Big Data’s Big Problem: Little Talent,” Wall Street Journal, 29 April 2012] As president of a company that offers Big Data services, I know that Rooney and the McKinsey report speak the truth. Rooney continues:

“Big Data refers to the idea that an enterprise can mine all the data it collects right across its operations to unlock golden nuggets of business intelligence. And whereas companies in the past have had to rely on sampling, Big Data, or so the promise goes, means you can use your entire corpus of digitized corporate knowledge. It is, by all accounts, the next big thing.”

Unlocking those golden nuggets from mountains of unanalyzed data requires real expertise and skill. Authors of the McKinsey report wrote, “We project a need for 1.5 million additional managers and analysts in the United States who can ask the right questions and consume the results of the analysis of Big Data effectively.” When you look at recent anemic job numbers, 1.5 million new jobs is significant and welcome — if you can find the right people to fill them. That’s the rub. Rooney continues:

“What the industry needs is a new type of person: the data scientist. According to Pat Gelsinger, president and chief operating officer of EMC Corp., the giant U.S. data company, this isn’t an unprecedented problem. ‘IBM started a generation of Cobol programmers,’ he said, referring to one of the first dominant programming languages. ‘Thirty years ago we didn’t have computer-science departments; now every quality school on the planet has a CS department. Now nobody has a data-science department; in 30 years every school on the planet will have one.’ Hilary Mason, chief scientist for the URL shortening service bit.ly, says a data scientist must have three key skills. ‘They can take a data set and model it mathematically and understand the math required to build those models; they can actually do that, which means they have the engineering skills … and finally they are someone who can find insights and tell stories from their data. That means asking the right questions, and that is usually the hardest piece.'”

Mason’s last point is absolutely correct. Good solutions always start with good questions. Over the years, I’ve found that the best analysts are the ones who know how to ask the best questions. Yet that doesn’t seem to be a skill that we stress very much. In fact, too many parents tell their inquisitive children, “Stop asking so many questions.” Rooney continues:

“It is this ability to turn data into information into action that presents the most challenges. It requires a deep understanding of the business to know the questions to ask. The problem that a lot of companies face is that they don’t know what they don’t know, as former U.S. Defense Secretary Donald Rumsfeld would say. The job of the data scientist isn’t simply to uncover lost nuggets, but discover new ones and more importantly, turn them into actions. Providing ever-larger screeds of information doesn’t help anyone.”

Rooney needed to go step further and state that once lost nuggets are uncovered, and other nuggets are newly discovered, they need to be presented in a user-friendly interface to decision makers. All the links need to be complete to turn data into actionable intelligence. Rooney notes that during a London conference in April, “the data scientist was called, only half-jokingly, ‘a caped superhero.'” He continues:

“So where can companies find these superheros? Not from universities, it seems. Nigel Shadbolt, who doubles up as the professor of artificial intelligence at the University of Southampton as well as co-director (along with Tim Berners-Lee) of the U.K.’s Open Data Institute, said the courses don’t yet exist. ‘Bits of it do exist in various departments around the country, and also in businesses, but as an integrated discipline it is only just starting to emerge.’ Nor can they be found in recruitment agencies. Rob Grimsey, a director of IT recruitment agency Harvey Nash, said they had limited experience in recruiting data scientists — ‘which might be a statement in itself about how common these kind of roles are,’ he added.”

I find it encouraging that people are starting to talk about creating a new discipline that will help fill the void when it comes to Big Data technology. For companies like mine, curriculum for the discipline can’t be developed and implemented fast enough. Rooney continues:

“One of the problems with Big Data is the fact that it has to deal with real data from the real world, which tends to be messy and difficult to represent. Conventional relational databases are excellent at handling stuff that comes in discreet packets, such as your social security number or a stock price. They are less useful when it comes to, say, the content of a phone call, a video, or an email. Out in the real world, most data is unstructured. Handling this sort of real, messy, scrappy data, isn’t so simple. ‘People have been doing data mining for years, but that was on the premise that the data was quite well behaved and lived in big relational databases,’ said Mr. Shadbolt. ‘How do you deal with data sets that might be very ragged, unreliable, with missing data?'”

One answer to Shadbolt’s question about how to deal with messy data is to use a rules-based ontology like Enterra Solutions’ Sense, Think/Learn, Act™ system. The beauty of a rules-base ontology is that it can discover obscure relationships and make inferences that could escape other kinds of analysis. It’s particularly useful when analyzing natural language and unstructured data. But, like every other aspect of business, such a system relies on process, technology, and people. Nick Halstead, CEO of DataSift, a U.K. start-ups actually doing Big Data, reiterated to Rooney the importance of finding people who can ask the right questions. He told him “that the ability to ask questions about the data is the key, not mathematical prowess.” Nevertheless, math prowess is eventually a necessity. “You have to be confident at the math,” Halstead said, “but one of our top people used to be an architect.”

Rooney reports that not everyone is as concerned about the looming shortage of talent. He writes that “Fernando Lucini, chief architect for Autonomy Corp., a U.K. software maker recently acquired by Hewlett-Packard Co., is much more optimistic.” He explains:

“Mr. Lucini said the industry is fretting unnecessarily and should have more confidence in its own abilities. Most of these problems can be tackled through algorithms, he said, which coincidentally is the promise of Autonomy. ‘The problem can be solved by better tools. The tools need to help you understand the data. They can do the heavy lifting for you so that anyone in a business can use them and ask the questions they need to answer.'”

While I agree that good algorithms can do a lot of the heavy lifting and that artificial intelligence technologies can help systems learn, those algorithms must be devised by clever mathematicians. People, processes, and technology must all work together to uncover deep insights that are contained in the mountains of data that businesses are now collecting. Although Rooney insists that what Big Data companies need are data scientists, in Part 2 of this two-part series, I’ll present the views of Carole-Ann Matignon, who argues that knowledge engineers are just as important as data scientists.

I want to close this post by noting that, when it comes to Big Data, simply hiring people with the correct skills is no longer enough. People with access to large (and often sensitive) sets of data potentially have the ability to create havoc or do harm. To protect clients from such events, companies are scrutinizing new employees more closely and monitoring their activities once they are on the job. [“The Enemy Within,” by Shara Tibken, Wall Street Journal,  2 April 2012] Although these are necessary precautions, they add to the complexity and cost of hiring people with the right skills. Tibken reports:

“Companies put IT professionals under the microscope even before they’ve joined the outfit. Many organizations perform tougher background checks on potential IT employees than on others, making sure the job candidates can be trusted to carry out critical security tasks. And once candidates are hired, their actions typically are scrutinized more closely than those of others on the network. Many companies do this using technology that analyzes network traffic and alerts them to anything abnormal—such as employees opening files they don’t normally access or going on the network at odd hours. … Companies are also employing a newer class of technology that allows them to examine how the language used in communications among IT staff changes over time. That helps the organization figure out who might have motivation for stealing data or sabotaging the network.”

With data breaches being routinely reported in the press, Big Data companies understand that their future relies in being able to maintain their reputation. In other words, Big Data requires big talent and it comes with big responsibility.