ProPublica borrowed machine learning methods from academic research to better understand links between forest loss and spillover risk. The results were surprising, but led us to a story we wouldn’t have found otherwise.
ProPublica is a nonprofit newsroom that investigates abuses of power. Sign up to receive our biggest stories as soon as they’re published.
This year at ProPublica, we’ve paired computer modeling with traditional reporting to explore questions around viral outbreaks: What causes them and what can be done to prevent the next big one?
One of the most feared diseases is Ebola, which kills about half the people it infects and has shown that it can pop up in unexpected countries such as Guinea. The virus jumped from a wild animal to a human there in 2013, leading to an epidemic that ultimately left 11,000 dead around the globe.
Researchers studying how outbreaks begin have learned that deforestation can increase the chances for pathogens to leap from wildlife to humans. Jesús Olivero, a professor in the department of animal biology at the University of Malaga, Spain, found that seven Ebola outbreaks, including the one that started in Meliandou, Guinea, were significantly linked to forest loss. We found that, around five of those outbreak locations, forests had been cleared in a telltale pattern, increasing the chances that humans could share space with animals that might harbor the disease.
We wondered: Could we use what we learned about these locations to find places that had not yet experienced outbreaks but could be at risk for one? Were there places where Ebola could emerge that look a lot like Meliandou did in 2013?
With the help of epidemiologists and forest-loss experts, along with one of ProPublica’s data science advisers, Heather Lynch, professor of ecology and evolution at Stony Brook University, we developed a machine-learning model designed to detect locations that bore striking similarity to places that had experienced outbreaks.
The result? Out of a random sample of nearly 1,000 locations across 17 countries, ProPublica’s model identified 51 areas that, in 2021 (the most recent year that satellite image data on forest loss was available at the time of our analysis), looked a lot like places that had experienced outbreaks driven by forest changes.
These locations fell within forested zones in Africa that have wildlife believed to be carrying Ebola; that had recently experienced extensive forest fragmentation (that is, clearing of forests in many small, disconnected patches); and that have a population baseline that could sustain an outbreak if one emerged. To our surprise, 27 of the locations were in Nigeria, where an Ebola outbreak has never started.
After reviewing our findings, one of the researchers we consulted, Christina Faust, a research fellow at the University of Glasgow, Scotland, called the analysis a “best estimate of risk,” in light of the many outstanding questions about how Ebola arises.
“You’ve clearly identified ecological features that are consistent across the spillover locations,” Faust said. “And these ecological conditions and human conditions are cropping up in other places. And given that we don’t know so much about the reservoirs, I think this is our kind of best ability to do a risk analysis.”
Why Random Forests
This model was developed out of an earlier analysis we published in February. We used satellite imagery and epidemiological modeling to show that villages where five previous Ebola outbreaks occurred are at a greater risk of spillover happening today, including Meliandou, Guinea, the site of the worst Ebola outbreak in history.
In five locations where outbreaks had occurred, we found a distinctive pattern in how forests erode over time. At the highest level of fragmentation, the areas where humans and virus-carrying animals might interact, or “mixing zones,” are largest, and risk is at its peak. But after the forest becomes so eroded by human activity that it can’t sustain wildlife anymore, risk decreases.
That analysis focused on the research led by Olivero and an epidemiological model created by Faust and her colleagues that tracked how spillover risk changes as forests become increasingly fragmented. But there was also other intriguing research on the link between land use and Ebola spillover that caught our attention.
As part of the first project, we created a data set of ecological characteristics from satellite imagery. We were curious if some of the factors, like the number of forest patches or proportion of mixing zones around those patches, could shed additional light on how susceptible a location could be to disease spillover.
Months in, we asked ourselves, could we combine the 23 environmental and population characteristics and what we learned from work by Olivero, Faust and Rulli into a single model? Could such a model reveal new insights into the conditions related to forest change that make it possible for Ebola to jump from animals to humans?
On the advice of Lynch, our science adviser, we started by looking for any clear patterns or clusters among the characteristics.
But after squinting at lots of tiny scatter plots, nothing jumped out. This wasn’t entirely unexpected, because we had only seven outbreaks to compare. When the number of characteristics far outnumbers the events you’re interested in, it can be hard to tease out clear relationships. So Lynch suggested something straight from her own research playbook: decision trees and random forests.
Decision trees, Lynch explained, are machine learning algorithms that create chains of binary decisions to help distinguish groups from one another. We hoped they could help us find places that looked a lot like locations where Ebola outbreaks had occurred. These trees — not to be confused with the leafy trees in our forest data — are useful because they can sort and cluster data based on combinations of characteristics that might not be obvious when considering each individually, and flag potential matches.