Big Data and Epidemics

Stephen DeAngelis

January 30, 2014

Matthew Braga reports, “There exists an organization that, using nothing more than the cellphone in your pocket, can track where you are – and, with a certain degree of certainty, where you soon will be. But it isn’t the National Security Agency. … Rather, it’s a Swedish project called Flowminder, designed to map population movements in the aftermath of natural disasters for the sake of emergency rescue and relief.” [“The rise of the digital epidemiologist: Using big data to track outbreaks and disaster,” Financial Post, 10 September 2013] People are naturally wary of the motives of any organization trying to track their movements. Nevertheless, when it comes to the spread of disease, tracking infected individuals is in all of our best interests. Braga continues:

“[Flowminder is] just one of many projects that is part of what some are referring to as digital epidemiology – a field of otherwise traditional scientific research that, increasingly, has come to rely on digital or computational methods to measure health and the spread of disease. But while epidemiology has always been about using big datasets – long before the somewhat ambiguous buzzword ‘big data’ came into vogue – what is novel about the field in these recent years is from where, exactly such data is culled. In more traditional epidemiological studies, data might be gathered or generated by manual means, such as in-person interviews, voluntary surveys or other types of deliberate, purposeful data collection. Nowadays, researchers and computer scientists are turning to novel sources of data — including cellphone records, blog posts, Tweets and flight data – to draw new, highly experimental, though sometimes questionable, inferences about the world at large.”

Back in 2011, when Robert Munro was Chief Technology Officer of EpidemicIQ, he worked on “an initiative that aimed to track all the world’s disease outbreaks.” [“Tracking Epidemics with Natural Language Processing and Crowdsourcing,” Jungle Light Speed, July 2013] He writes, “The result was an order of magnitude larger and more sophisticated than any other disease-tracking system. It made me appreciate how unready the world is to respond to epidemics.” At the same time, Munro understood that identifying infected individuals often put them and their families at risk. “Even if you do not work in epidemiology,” he writes, “I’m sure it is obvious to you that there is great responsibility in managing information about at-risk individuals.” He notes that, in some African countries, individuals infected with potentially fatal diseases and their families are sometimes killed by villagers who fear the infection will spread and wipe everybody out. He writes about one young girl suspected of having contracted Ebola who was put at risk when health agencies published the name of the Ugandan village from which she came. Although her name wasn’t published, she was the only girl from the village who had been rushed to the hospital. “Her diagnosis was ultimately incorrect,” Munro writes, “which doesn’t really affect the anonymization issue, but it makes any identification/vilification even more disturbing.” Munro highlights the greatest conundrum faced by epidemiologists, how to manage Big Data responsibly and ethically. Munro believes that “the problem comes from open-data-idealists. They believe that so long as we can share information at scale, … the problem is treated as solved. This is misleading and dangerous.” Despite his concerns, Munro is convinced that the use of Big Data is an important tool in tracking global diseases. He writes:

“Like every other type of organism, the majority of the world’s pathogens come from the band of land in the tropics. 90% of our (land-based) ecological diversity comes from 10% of the land. The same is true for our linguistic diversity, and it is the same 10%. So, the first signal about an outbreak can be reported in any one of 1000s of languages, and concentric animated dots on a map aren’t going to help reach those people to help with either reporting or response. We implemented adaptive machine-learning models that could detect information about disease outbreaks in dozens of languages. When the machine-learning was ambiguous, microtaskers (crowdsourced workers) employed via CrowdFlower would correct the machine-output in their native language. Their output would then update the machine-learning models. As a result, we could pull in information from across the world and process it in near real-time, dynamically adapting to changing information. One of the most promising outcomes was that we were ahead of any other organization in identifying outbreaks, including prominent solutions that used search engines and social media. This is also in contrast to many data collection methods favored by open-data-idealists.”

Like Munro, I’m a big supporter of natural language processing. The results that Munro and colleagues achieved speak for themselves. Many of the products we offer at Enterra Solutions® involve the use of natural language processing because it can often provide relevant insights faster than other analytic techniques. As Munro notes, the faster outbreaks can be identified the greater the chances are that they can be contained. According to Munro, using the natural language capabilities of EpidemicIQ, analysts can know “not just how many people [are] infected, but what kind of transport they [take] when they [go] from their village to the hospital in the nearest main town.” He concludes, however, that such information must be provided only to professionals who understand the sensitivity of such data. Obviously, different infections involve different sensitivities. Despite Munro’s distaste for open source and social media-related data, several such programs exist. Braga reports:

“Google Flu Trends [was] released to the public in 2008; the tool monitors influenza outbreaks worldwide. But rather than rely on doctor and patient survey data provided by, say, the U.S. Centers for Disease Control and Prevention (CDC), Google’s team uses the frequency of certain search queries as a basis for predicting flu hotspots – and does so weeks faster than organizations such as the CDC. Another project, Toronto-based BioDiaspora, models the spread of infection in a different way, using global airline data to predict and track the spread of diseases based on the origins, travel routes, and destinations of commercial flights. And, of course, there is the unparalleled value of vast, sprawling data sources such as Twitter or blogs, which researchers have used to mine everything from location data to user sentiment for a wide range of study.”

In the end, however, Braga agrees with Munro that open source data is probably not the best source of information when it comes to tracking infections and helping establish a strategy for containing them. He reports:

“Observing these interactions take place in real-time can be invaluable to epidemiologists who have grown accustomed to data that is often retrospective – surveys and interviews that are conducted after an outbreak or disaster has occurred, and often subject to the faults and limits of human memory. But there are downsides to a reliance on data from networks such as Twitter, too. It can hard to gleam historical data in a real-time stream, or context about when, why, or how something has been posted or shared. The sample can also be biased, epidemiologists say – in part due to unreliable or absent information on gender, medical history, economic status and more. It’s hardly a problem unique to Twitter, but also affects data culled by Google Flu Trends and cultivated from blogs; the population segment that tends to use these online services and networks must be affluent enough to have access to the internet and internet-enabled devices, which isn’t necessarily the most representative sample. Or, as Queen’s University associate professor and Master of Public Health Program director Dr. Duncan Hunter says, rather succinctly, ‘Twitter is never going to collect information on the number of persons with intellectual disabilities.’ This doesn’t mean the data that can be gleamed from Twitter posts or Google search queries is necessarily wrong, or can’t be useful, but rather, shouldn’t be taken on its own. ‘It’s easy enough to make extrapolations from social media and stuff like that which might not actually be accurate,’ said Dr. Norman. … So while so-called big data sources and analysis techniques are helping epidemiologists ask more questions – and often better ones, at that – big data itself isn’t necessarily the answer.”

Big Data may play a more important role when looking back on an event. Munro points out that post-event analysis and visualization of Big Data can help provide insights to an outbreak, but useful, real-time data can be difficult to obtain and must be handled ethically by professionals who understand all of the ramifications.