Big Data Analytics Continue to Improve

Stephen DeAngelis

August 22, 2018

Big data is not a new topic, but it remains a fresh topic of discussion because the amount of data being generated each day continues to grow. Without advanced analytics, finding actionable insights and gaining new knowledge from that data would be difficult and slow — too slow for today’s speed of business. Over the years, however, big data analytics has earned a reputation of under-delivering on promises. Andrew Froehlich (@afroehlich), President and Lead Network Architect at West Gate Networks, explains, “There’s little doubt that the concept of big data analytics has been dragged through the mud multiple times over the years. Early adopters struggled in many areas that ultimately led to higher than expected failure rates — and ultimately — a poor return on investment. Yet, many of the mistakes of the past have long since been overcome. … Despite the less than stellar track record — or perception — big data remains a big deal.”[1]

Common mistakes add to big data’s less than stellar reputation

Like most new developments with great potential, big data had a rocky start. Providers of big data analytics often over-promised and under-delivered. Larry Alton (@LarryAlton3) suggests the problem can’t completely be laid at the feet of vendors. He explains that mistakes by “newcomers and non-experts” resulted in many of the poor results achieved in the early years of big data analytics. He lists nine common mistakes needing to be avoided if big data analytics projects are to succeed.[2] They are:

1. Allowing bad data to compromise your conclusions. “If your data is bad,” Alton writes, “even the best data analyst in the world can’t save it from leading to bad conclusions. Bad data comes in many varieties, such as unwittingly duplicated records, illegitimate sources, and old data that haven’t been sufficiently updated. If you want to make accurate conclusions, you have to start by studying the data at the foundation of those conclusions.” When we work with clients, we spend a great deal of time ensuring we are dealing with the correct data. The dictum “garbage in, garbage out” remains true.

2. Blindly trusting data visualizations. “Data visualization is reshaping the industry,” writes Alton, “putting powerful and intuitive pattern-recognition tools in the hands of everyday users. At a glance, you can get a feel for how a trend is manifesting or get a quick answer to your question — but blindly trusting these data visualizations can blind you to variables that are getting skipped over, and skew your perceptions.” Visualization is important, because not everyone in your company is, or needs to be, a data scientist. Bill Franks (@billfranksga), Chief Analytics Officer for The International Institute for Analytics, explains, “I get concerned when I hear the suggestion that everyone in the organization needs to create, use, and understand analytics. Many people don’t (and shouldn’t!) understand analytics at all. There are absolutely people within an organization who must understand how analytics work. … However, in most cases, it is still a relatively small number as a percentage of all employees.”[3] His observation is followed by a big “but”. Even though most employees don’t need to be data scientists, they do need to understand how analytics can improve their job performance. Visualizations can help — but ensure the data scientists approve them.

3. Studying limited variables. Many, if not most, companies operate in high-dimensional environments (i.e., they must contend with a lot of variables). Alton explains, “If you only focus on one example or one variable, your conclusions will be off the mark. … If you judge everything based on one or two key metrics, you’ll prevent yourself from seeing the big picture.”

4. Incorporating all variables. “Conversely,” Alton explains, “it’s a bad idea to try and look at every variable or metric you collect — especially in today’s era of big data. Taking this broad approach can make you focus on the noise, instead of the signal, and prevent you from seeing the most important patterns in your data set.” To ensure the right variables are in play, Enterra Solutions® leverages the Representational Learning Machine™ (RLM) created by Massive Dynamics™. The RLM can help determine what type of analysis is best-suited for the data involved in a high-dimensional environment. This high-dimensional model representation (HDMR) analysis relies on global variance-based sensitivity analysis to generate an understanding of the relationships among variables and is therefore of particular benefit for applications with a large number of input variables, permitting the practitioner to focus principally on those variables that prove important to driving pertinent outcomes.

5. Falling victim to bias. Bias is currently a hot topic in artificial intelligence discussions. Alton notes, “Confirmation bias is one of the most prominent cognitive biases in data analytics, and one of the hardest to compensate for. Essentially, the idea is that if you have a preconceived idea of what your conclusions will be, you’ll disproportionately favor evidence that supports those conclusions. Guard against this by specifically challenging your assumptions and prioritizing objective evidence.” Easier said than done. Alton also notes you can fall victim to self-serving bias. He explains, “Self-serving bias is our natural tendency to credit ourselves with our successes, and blame our failures on external variables beyond our control. If you apply this faulty reasoning to data analysis, you may mistakenly credit your company or team for ‘good’ things, and disproportionately label the ‘bad’ things as being random hiccups or someone else’s fault.”

6. Neglecting outliers. “Outliers,” explains Alton, “are pieces of data that don’t fit with the rest of your set. It’s easy to write these off as an insignificant blip on the radar — such as a surveyed customer who didn’t take the survey seriously or a flaw in your data recording. However, this can be a crucial mistake; outliers often lead to important conclusions you’d miss by just studying the averages.”

7. Prioritizing outliers. Just like with variables, one can too far as well as not far enough with outliers. Alton explains, “If you zoom in to your outliers too far, you could favor an individual over the group, skewing your conclusions the other way. You need to retain a balanced approach if you want to truly understand what’s going on.”

8. Staying isolated. “You aren’t the only person analyzing and studying data. One of the worst mistakes you can make is to stay isolated and not learn from those around you. … Stay up on the latest trends, and challenge yourself to learn more about data science every day.”

What are the latest trends?

Jane Thomson, Content Manager at GreyCampus, notes, “Big data analytics is one of the key pillars of enabling digital transformation and devising solutions to solve business issues across various industries globally. But big data management and analytics are evolving at a rapid pace.”[3] She identifies five trends continuing to drive analytics forward. They are:

1. Predictive Analytics. Most analysts discuss four types of analytics. In order of sophistication, they are: descriptive, diagnostic, predictive, and prescriptive. Writing about predictive analytics, Thomson notes, “Analysts can explore new behavioral data by combining big data and computational power. Traditional machine learning tends to use statistical analysis on the basis of a total dataset sample. By bringing in reasonable computational power to a problem, you can analyze a huge number of records consisting of an enormous number of attributes per record and this considerably increases predictability too.” Cognitive technologies are also getting better at prescriptive analytics which prescribe actions to take to eliminate a future problem.

2. Cognitive technology. There are a number of machine learning techniques that can be used for analytic purposes. Thomson notes that deep learning is one the techniques receiving a lot of attention. She explains, “Big data uses advanced analytic techniques such as deep learning methods to manage diverse and unstructured text.” There are other techniques, however, and you should ensure the method you use is the best one for the results you desire.

3. In-memory Database. The larger datasets become the more challenging the task of accessing the data. Thomson notes, “To speed up analytical processing, the use of in-memory databases is increasing exponentially and this proves beneficial to many businesses. Various enterprises are encouraging the use of Hybrid Transaction/Analytical Processing (HTAP) by allowing analytic processing and transactions to stay in the same existing in-memory database. As HTAP is extremely popular, some organizations are using it repetitively on a massive scale by putting all transactions together from different systems onto one database only. By doing this they are able to manage, protect, and assess how to integrate diverse datasets.”

4. Use of Hadoop. Thomson writes, “Enterprise data operating systems and distributed big data analytics frameworks, such as MapReduce, are used to perform various analytical operations and data manipulations and store the files into Hadoop. As a practice, organizations use Hadoop as the distributed file storage system for holding in-memory data processing functions and other workloads. One key benefit of using Hadoop is that it is low-cost and is used as a general purpose tool for storing and processing huge datasets and is treated as an enterprise data hub by many organizations.”

5. Data Lakes. Data storage continues to get a lot of attention. Thomson explains, “A storage repository which holds an enormous amount of raw data in its native format is known as a data lake. There are hierarchical data warehouses which store data in files or folders, but a data lake tends to use a flat architecture to store data. A data lake is like an enterprise data hub. It not only stores a vital amount of data but also includes high-definition datasets which are incremental for building a large-scale database. In general, big data lakes are being used only by highly-skilled professionals for analysis purposes only.”

Summary

Big data analytics are essential for any business hoping to succeed in the Digital Age. Navigating the analytics landscape can be both challenging and confusing. Cognitive technologies, with embedded analytics suites, are helping ensure the right method is used to analyze structured and unstructured data populating today’s datasets. By avoiding common mistakes and keeping up with the latest trends, companies can improve their chances of successfully implementing big data projects.

Footnotes
[1] Andrew Froehlich, “Debunking 8 Big Data and Analytics Myths,” InformationWeek, 21 September 2017.
[2] Larry Alton, “9 key mistakes organizations make when analyzing data,” Information Management, 19 April 2018.
[3] Jane Thomson, “5 top trends driving big data analytics,” Information Management, 10 July 2018.