The Search for Data Scientists
February 26, 2013
“Big Data can present companies with big challenges,” writes Elyse Dupré and Melissa Mazza, from Direct Marketing News, “especially when there’s a gap between what Big Data is and how it can be applied.” [“Infographic: Mind the Gap,” 8 February 2013] One of the “gaps” noted in the attached infographic created by Dupré and Mazza is people. There simply aren’t enough of them with the right skills. According to Dupré and Mazza, nearly half of all companies that deal with big deal report a shortage of people with the right skills.
One of the most sought after individuals is the data scientist. “Outside of companies like Google that have long made use of rich rosters of PhDs,” writes Connie Loizos, “there are nowhere near enough ‘data scientists’ — graduate-level candidates with backgrounds in machine learning or statistics — to analyze the massive streams of information that are being produced, and that gap is growing by the day.” [“As Startups Produce More Data, the Search for Data Scientists Grows Frantic,” PE Hub, 7 February 2013] “Data science [is] a crucial facet to the Big Data world,” writes Doug Turnbull. That being the case, he asks the obvious questions, “What is a data scientist?” He writes:
“The traditional scientist builds an experimental design based on a hypothesis. … How does a data scientist differ? Well it’s still hypothesis driven research. The difference is in the nature of the experimental design. Instead of running an experiment in a sterile lab where we are carefully controlling everything from the humidity to the temperature to the expression on the experimenters face, we instead have a giant mass of data potentially from uncontrolled conditions. The experiment becomes a matter of combing through massive amounts of data after the fact. For example, finding enough cases where rats ran through the maze at a given temperature, light level, and all other variables constant, except for a varying humidity. Because we simply have massive and massive amounts of data, there are enough times of all things being equal except humidity, that we can go back and make a statistically significant assertion about what rats do when the humidity changes. This is particularly useful when there is simply no way to control all the independent variables, and notions of causality are weaker. For example, tracking children through education programs. There is simply no way to setup an experimental design where we force one set of children to undergo one set of circumstances and force another set of children to go through another. Moreover, we can’t ethically create an experiment that ensured each child had the exact same socioeconomic background, home life, cultural background, exercise, environment, and all the other dozens of factors that might influence their education outcome. So our only option is to collect tons and tons of data about kids, and see what shakes out. There may be enough times when certain variables are held steady except for one that a definite outcome could be measured.”
Turnbull points out that getting to the hypothesis testing point is not a simple task. “Warehousing, and processing massive amounts of data efficiently to answer the data scientist’s questions is in itself a hard problem.” He continues:
“At the core of the problem is knowing what data structures are the best tool for the job. At OpenSource Connections, we have a pretty broad understanding of the strengths and weaknesses of various solutions such as distributed search, NoSQL databases, relational databases, and plain flat-file logs in Hadoop using Map Reduce. Matching up hard data problems with the right solution using the right data structure requires crucial collaboration between everyone. … For example, what are the similarities and differences between natural language processing and search at scale vs collecting raw numeric statistics? Are there things that the two groups could learn from each other when it comes to how data is stored and processed?”
The answer to Turnbull’s last question is a resounding, “Yes.” Many of the solutions we are developing at Enterra Solutions involve bringing together two distinct philosophical and technological computing camps (i.e., Mathematical
Optimization and Reasoning). At Enterra Solutions, we believe in bringing together smart people from different disciplines to address vexing problems. They are constantly learning from one another and making our solutions better for it. Turnbull asserts that another tool that any good data scientist needs in his or her kit is data visualization. I certainly agree with him. Good insights that aren’t easily understood are little better than no insights. The end user of big data analytics must always be kept in mind.
Paige Roberts insists that data scientists play another important role in organizations. She claims they are bridge builders. [“Big Data Scientists Are Bridge Builders,” SmartData Collective, 6 February 2013] Roberts came to this conclusion while reading an article written by Kathyrn Kelly in which Kelly argued “that the hype that has been building around the data scientist is over.” [“Data Scientists Not Required: Big Data Is About Business Users,” SmartData Collective, 5 February 2013] To be fair, Kelly wasn’t dissing data scientists per se, she was arguing that the goal of vendors offering big data solutions should be to create systems that “use intuitive, interactive UIs to derive value from big data and avoid the dependency on data scientists.” In other words, business people need tools they can use without having to turn to a data scientist every time they want an answer. Roberts agrees with Kelly on that point. Does that mean that data scientists should be placed on the short list of jobs that could be going away? Hardly. Data scientists are critical in the creation of the kind of systems that Kelly envisions. Roberts writes:
“I don’t think that the data scientist/data analyst/statistician, whatever they’ve been called over the years, will disappear. The name ‘data scientist’ may be new, but the job isn’t. And there’s a good reason for that. Analytics, over time, becomes more and more accessible to the business person or average end consumer of that information, yes. The goal of most analytics projects and analytics software is to bridge the gap between the person who asks questions and the person who can find answers. Data scientists and the tools they use are the bridge for a time, then as the technology matures, they become the designers and builders and maintainers of the bridge. But once the bridge is built, anyone can walk across. When the person who asks the question and the person who can find the answer are the same person, the bridge is built, that area of questioning is mature.”
Since good solutions start with good questions, an infinite number of questions are waiting to be asked. As Roberts put it, “Once we’ve reached the point where we can find all the answers to the questions we’ve been asking, [we’ll] simply find tougher questions to ask.” With each new set of tough questions, she writes, “You’re right back where you were, needing a person with greater knowledge of analytics and analytic technology to dive in and find them, and be our bridge to that information.” As a result, she concludes, “As long as human beings continue to ask questions, to push the envelope, and to be hungry for more and more information, i.e., as long as they continue to be human, data scientists will have a job.”
Not only will they have a job, but they should have a good paying job. According to Loizos, “Pay for data scientists has rocketed … even for those straight out of school.” Robert F. Morison, from the International Institute for Analytics (IIA), reports that the IIA predicts:
- “There will continue to be a shortage of data scientists. Companies will compensate by forming small analyst teams.
- “The mystique of the data scientist will persist, but the lines between data scientists and other analytics professionals will blur.” [“Building Data Scientist Capability,” International Institute for Analytics, 1 February 2013]
If those predictions are correct, then, writes Morison, companies are faced with a real conundrum. He explains:
“Most businesses have difficulty attracting analytics professionals to begin with, let alone snag people in the top tier who possess multiple components of the data science skill set. Analyst talent supply is short, demand is high and growing, and competition can be fierce. Top data scientists tend to work in the hard sciences – physics, climatology, genomics – fields where they can immerse in new data and build original models. When they work for corporations, it’s usually for technology vendors or Wall Street firms or organizations with oversized data and computational challenges. Most companies must come to grips with the fact that they cannot hire enough data scientist caliber analysts, and that they can’t get all the skills they need in one person. So they must focus more on the composition, development, and deployment of analyst teams that combine the needed skills and experience as coherently as possible.”
If team building is really the answer to closing the data scientist skills gap, Morison asks, “How might the roles on analyst teams be delineated?” First, he notes that the work most teams will be asked to do “falls into four clusters, each with its own skills, demographic, and psychographic profile.” Those clusters are programmers, data preparers, generalists, and business experts. He concludes:
“As people work on teams and learn from one another, roles overlap and lines blur. And as organizations approach data science as a composite capability, not an individual role, we’ll probably see more ‘data scientist’ titles but view the holders as less rarefied. I can’t see a day, at least anytime soon, when we have a surplus rather than a shortage of data science and advanced analytics skills. But aggregate capability is building as more professionals migrate to the field. And more people will have the jump start of academic programs. They will not churn out full-fledged data scientists, because many of the key skills come with practice and experience. But current academic programs (and others to follow, no doubt) enlarge the pool and accelerate development of the next generation top-tier analytics professionals. Most of the organizations I work with know that they’re just scratching the surface of the enormous opportunities with big data and advanced analytics. But they can’t go deep and capitalize with simply a data scientist hire or two. They need to grab the best analysts they can, line up outside sources for skills they lack, and blend that talent on high-power analytics teams.”
If you have the skills discussed by Morison and are looking for an exciting position in a company on the cutting edge of big data analytics, I hope you will consider working for Enterra Solutions. If you are interested in being considered for a position at Enterra Solutions, please submit your current resume in Word format to email@example.com.