Going Long with Big Data

Stephen DeAngelis

February 19, 2013

Applied mathematician and network scientist Samuel Arbesman, a senior scholar at the Ewing Marion Kauffman Foundation and a fellow at the Institute for Quantitative Social Science at Harvard University, offers a unique take on the subject of big data. He claims that we should stop admiring how much new data we can accumulate and concentrate more of our attention on data sets that go back further in time. He wants us to appreciate history. “Our species can’t seem to escape big data,” he writes. “We have more data inputs, storage, and computing resources than ever, so Homo sapiens naturally does what it has always done when given new tools: It goes even bigger, higher, and bolder. We did it in buildings and now we’re doing it in data.” [“Stop Hyping Big Data and Start Paying Attention to ‘Long Data’,” Wired, 29 January 2013] It’s hard to argue with Arbesman that we humans have a tendency to super-size things. If you don’t believe it, read my post entitled The Supersized Supply Chain.

Arbesman isn’t arguing that we should downsize. In fact, he acknowledges, “Big data is a powerful lens — some would even argue a liberating one — for looking at our world. Despite its limitations and requirements, crunching big numbers can help us learn a lot about ourselves.” His concern is that we seem to be accepting the notion that big data represents all data. It’s doesn’t. “No matter how big that data is or what insights we glean from it,” he writes, “it is still just a snapshot: a moment in time. That’s why I think we need to stop getting stuck only on big data and start thinking about long data.” He explains:

“By ‘long’ data, I mean datasets that have massive historical sweep — taking you from the dawn of civilization to the present day. The kinds of datasets you see in Michael Kremer’s ‘Population growth and technological change: one million BC to 1990,’which provides an economic model tied to the world’s population data for a million years; or in Tertius Chandler’s Four Thousand Years of Urban Growth, which contains an exhaustive dataset of city populations over millennia. These datasets can humble us and inspire wonder, but they also hold tremendous potential for learning about ourselves.”

In that respect, Arbesman agrees with futurist Peter Schwartz who argues that we need to master “the art of the long view.” Whereas, Schwartz insists that companies need to contemplate events that could take place well into the future, Arbesman argues that we can learn from things that have occurred well into the past. He writes, “Because as beautiful as a snapshot is, how much richer is a moving picture, one that allows us to see how processes and interactions unfold over time?” Arbesman believes that the past is a prelude to the future. He explains:

“We’re a species that evolves over ages — not just short hype cycles — so we can’t ignore datasets of long timescale. They offer us much more information than traditional datasets of big data that only span several years or even shorter time periods. Why does the time dimension matter if we’re only interested in current or future phenomena? Because many of the things that affect us today and will affect us tomorrow have changed slowly over time: sometimes over the course of a single lifetime, and sometimes over generations or even eons. Datasets of long timescales not only help us understand how the world is changing, but how we, as humans, are changing it — without this awareness, we fall victim to shifting baseline syndrome. This is the tendency to shift our ‘baseline,’ or what is considered ‘normal’ — blinding us to shifts that occur across generations (since the generation we are born into is taken to be the norm).”

Before continuing with Arbesman’s arguments about why we need to spend more time analyzing long data, I need to point out the obvious. Some of the insights that businesses want to gain from big data have little to no history to draw upon (like insights that can be obtained from social media databases). Arbesman undoubtedly agrees that such insights are useful for business purposes, but he is concerned about larger social and environmental issues that could have dramatic impact on the world as a whole. He continues:

“Shifting baselines have been cited, for example, as the reason why cod vanished off the coast of the Newfoundland: overfishing fishermen failed to see the slow, multi-generational loss of cod since the population decrease was too slow to notice in isolation. ‘It is blindness, stupidity, intergeneration data obliviousness,’ Paul Kedrosky, writing for Edge, argued, further noting that our ‘data inadequacy … provides dangerous cover for missing important longer-term changes in the world around us.’ So we need to add long data to our big data toolkit. But don’t assume that long data is solely for analyzing ‘slow’ changes. Fast changes should be seen through this lens, too — because long data provides context. Of course, big datasets provide some context too. We know for example if something is an aberration or is expected only after we understand the frequency distribution; doing that analysis well requires massive numbers of datapoints. Big data puts slices of knowledge in context. But to really understand the big picture, we need to place a phenomenon in its longer, more historical context.”

Even in a business setting, long data has its place. “Want to understand how the population of cities has changed,” Arbesman asks. “Use city population ranks over history along with some long datasets.” With the world becoming more urbanized each and every year, businesses know that their consumer base is going to be found in and around cities. That means that businesses need to know more about cities: how they grow, how ethnic groups tend to congregate, how trading systems between cities work, and so on. Arbesman continues:

“The general idea of long data is not really new. Fields such as geology and astronomy or evolutionary biology — where data spans millions of years — rely on long timescales to explain the world today. History itself is being given the long data treatment, with scientists attempting to use a quantitative framework to understand social processes through cliodynamics, as part of digital history. Examples range from understanding the lifespans of empires (does the U.S. as an ’empire’ have a time limit that policy makers should be aware of?) to mathematical equations of how religions spread (it’s not that different from how non-religious ideas spread today).”

It doesn’t take much imagination to grasp that it is important for marketing purposes to understand how ideas spread. Arbesman argues that we are so focused on change that we lose sight of the fact that there may be some “constants we can rely on for longer stretches of time.” He also argues that taking the long view allows us to make educated decisions about “what efforts to invest in if we care about our future.” Arbesman then gets a bit more technical. He writes:

“If we’re going to move beyond long data as a mindset, however — and treat it as a serious application — we need to connect … intellectual approaches across fields. We need to connect professional and academic disciplines, ranging from data scientists and researchers to business leaders and policy makers. We also need to build better tools. Just as big data scientists require skills and tools like Hadoop, long data scientists will need special skillsets. Statistics are essential, but so are subtle, even seemingly arbitrary pieces of knowledge such as how our calendar has changed over time. Depending on the dataset, one might need to know when different countries adopted the Gregorian calendar over the older Julian calendar. England for example adopted the Gregorian calendar nearly two hundred years after other parts of Europe did.”

In other words, there is still a need for liberal arts in a technologically-advanced world. In fact, the art and design crowd is making a concerted effort to get those subjects back on the priority list of educators and policymakers. “In this current moment of economic uncertainty, America is once again turning to innovation as the way to ensure a prosperous future,” states a website dedicated to the subject. “Yet innovation remains tightly coupled with Science, Technology, Engineering, and Math – the STEM subjects. Art + Design are poised to transform our economy in the 21st century just as Science and Technology did in the last century. We need to add Art + Design to the equation — to transform STEM into STEAM.” If historians, sociologists, and political scientists want to get in on the movement they could argue that you just need to place liberal arts in front of STEM to create LA STEM (but that may be too French for American tastes!). Arbesman hopes that we don’t lose sight of history as we position ourselves for the future. He concludes:

“Long data shows us how our species has changed, revealing especially its youth and recency. Want data on the number of countries every half-century since the fall of the Roman Empire? That’s only about thirty data points. But insights from long data can also be brought to bear today — on everything from how markets change to how our current policies can affect the world over the really long term. Big data may tell us what we need to know for hype cycles today. But long data can reach into our past … and help us lay a path to our future.”

Long data is the polar opposite of real-time data. Both types of data are valuable and have their place. The key to gaining insights from them is knowing when, where, and how to analyze different datasets.