Jenna Wiens isn’t a medical doctor. But someday, her work might save your life. It won’t be because she developed a new medicine or invented a revolutionary surgical procedure. Instead, you might owe your extra years to one of her algorithms.
Wiens, an assistant professor of computer science and engineering at U-M, is one of an army of data scientists and other engineers who are descending on healthcare to tackle what could be the most massive data science puzzle the world has ever seen: a movement to transform medicine by harnessing information about patients much more effectively. The effort, known broadly as precision medicine, is expected to help doctors customize treatments to individual patients’ genetic makeup, lifestyle and risk factors, and predict outcomes with significantly higher accuracy.
One major branch of precision medicine is the development of big-data tools to customize treatments. Experts envision a future in which doctors and hospitals can draw on a web of constantly churning analytical tools that mash up data from a huge variety of sources in real time – for instance, your electronic health record, genomic profile, vital signs and other up-to-the-moment information collected during a hospital stay or via a wearable monitor. It could give doctors and hospitals the ability to make meticulously informed decisions based on an analysis of your entire medical history, from birth to right now.
Researchers like Julia Adler-Milstein, an assistant professor at the U-M School of Information and School of Public Health, say that in some ways, today’s move to data-driven medicine is similar to the move to data-driven retailing that took hold over a decade ago. Online sellers like Amazon have compiled exhaustive data stockpiles, analyzing years of browsing and purchase history for millions of customers, then using that past data to predict what you’ll buy next. It’s obsessively detailed, computationally advanced, and sometimes a little creepy – and it has revolutionized how consumer goods are sold. If a computer can analyze your purchase data to predict what you’ll buy next, why can’t it analyze your medical data to predict whether you’ll get sick? It’s an enticing question for doctors and data scientists alike. But, as with most things in healthcare, it’s complicated.
Health data is going to be valuable in ways we don't even understand yet.–Jenna Wiens, Assistant Professor of Computer Science and Engineering, U-M
“Amazon’s decisions are tightly intertwined with data. But healthcare has only started to evolve the model of decisions based on physician expertise,” Adler-Milstein explains. “That has always been exciting to me, and learning how to integrate data and information technology pieces into the complexity of healthcare is especially fascinating.”
Perhaps it’s no surprise that bringing healthcare data into the 21st century is tougher – far tougher – than building an algorithm that suggests new socks to go with your new shoes. The stakes are higher. The regulations are tighter. The costs are greater. And there’s more data. So much more data.
Eric Michielssen, the Louise Ganiard Johnson Professor of Engineering at U-M, predicts that the amount of healthcare data generated annually worldwide will rise to 2,300 exabytes (2.3 trillion gigabytes) by 2020. And new data sources are coming online all the time, from wearable sensors to new kinds of imaging and video data. It’s predicted that healthcare data will eventually push past traditional scientific data hogs like astronomy and particle physics. Much of this is due to genomics data, which gobbles up so much space that there isn’t a cloud big enough to hold it – scientists are largely limited to on-site data storage. One researcher even joked that “genomical” might soon overtake “astronomical” as a term for incredibly large things. And much of this data is piling up across a fragmented hodgepodge of systems that were never meant to work together, at hospitals and other healthcare players that are often reluctant to share it.
Syncing up an ocean of fragmented, inconsistent data with the advanced analytics and databases that will drive precision medicine knowledge seems like an impossible problem. But maybe that’s what makes it so attractive to engineers. Today, more of them are working in healthcare than ever before, at U-M and elsewhere. Biomedical experts, data scientists, electrical and computer science engineers are dedicating their careers to it, and healthcare providers, tech companies and the government are investing massive amounts of resources. With the challenge ahead of them, they’ll need it.
Fighting Infection with Data
Wiens, for her part, is working on front line tools – the system of analytics and other digital machinery that will turn raw data into knowledge that doctors can use to make better decisions. Among her projects is a tool that predicts which hospital patients are at risk of developing a life-threatening intestinal infection called Clostridium difficile, or C. diff. The disease has evolved into an antibiotic-resistant superbug at hospitals, where it affects an estimated 500,000 patients per year in the United States alone.
The actions that can slow the spread of C. diff can be surprisingly simple, like moving high-risk patients to private rooms or limiting their movement around the hospital. The trouble is, doctors don’t have a good way of figuring out who’s at risk.
Wiens’ team is solving the problem with machine learning, a technology that’s already widely used in online marketing and retailing and is gaining ground in precision medicine. It enables computers to “learn” by combing through vast pools of data, using elaborate mathematical algorithms to compare pieces of information and look for obscure connections. They then use those connections from past data to make predictions about the future.
Data is the raw material that makes tools like this possible. And the team gained access to a lot of it at the project’s outset: the entire electronic health record for nearly 50,000 hospital admissions at a large urban hospital. The data also included demographic information and detailed records of each hospital stay: vital signs, medications, lab test results, even their location in the hospital and how prevalent C. diff was in the hospital during their stay.
Armed with this cache of data, they set out to build a tool that could estimate a patient’s risk of developing C. diff by going far beyond known risk factors and analyzing thousands of variables in a way that humans can’t. It would look for relationships between variables, calculate how
those relationships change during the course of a patient’s stay, and turn it all into a numerical score that estimates an individual patient’s probability of becoming infected with C. diff during their hospital stay.
It was a tall order, particularly because of the complex way the risk factors change during the course of a hospital stay. So the team used what are called multi-task learning techniques. Multi-task learning breaks a single task into several individual problems, looks for common threads and connections between each problem, then combines them into a single model.
The team includes dozens of experts on infectious disease and machine learning; its founding members include John Guttag, a professor in the MIT Department of Electrical Engineering and Computer Science and Eric Horvitz, Technical Fellow and Managing Director at Microsoft Research. They started by crunching the patient data into binary variables that a computer can understand, ending up with around 10,000 binary variables per patient, per day. They then broke the task into six individual machine learning problems (see equation).
Doing the Math
Wiens’ team used the optimization problem above to learn a predictive model that calculates a patient’s daily likelihood of contracting a C. diff infection during a hospital stay.
They used multi-task learning to calculate a set of risk parameters (θ) by analyzing the electronic health records from a large set of hospital stays. Patient data (x) included a variety of clinical information — some of which may change over time, such as patient location, vitals and procedures — and some of which remains the same, such as patient demographics and admission details. C. diff infection status was represented by y. The expression considers each day of the stay (t), taking into account that a person’s status at admission matters less as the patient spends more time in the hospital. To reflect this, the risk parameters vary over the course of a hospital stay.
Wiens’ group calculated the set of risk parameters (θ) for different time periods simultaneously by finding the set of values that minimize the objective function given above. Then, they used these parameters to build a model that produces a daily risk score, estimating a patient’s probability of acquiring C. diff during a particular hospital stay.
Finally, they set the computer to work trawling through the data to build (or “learn”) a model. When the dust cleared, their learning algorithm found connections between C. diff and everything from patients’ specific medication history to their location in the hospital. It was a model that no human could have come up with, and a far cry from the quick bedside analysis that doctors rely on today.
Testing showed that their model was more effective at predicting which patients would get C. diff than current methods, correctly classifying over 3,000 more patients per year in a single hospital. Perhaps most importantly, it predicted who was at risk nearly a week earlier, providing more time to identify high-risk patients and take potentially life-saving action.
Wiens says the computing power needed run the model at a hospital is minimal. It’s already being integrated into one major hospital’s operations and could be rolled out at others in as little as a year, crunching actual patient data in real time to identify high-risk patients and alert doctors.
Wiens’ model is just one of many data-driven analytical tools that doctors may one day use to make better decisions and tailor treatments and medications to individual patients. Similar tools could predict which patients will suffer complications from heart surgery, more precisely target medications based on genomic and lifestyle data, and even predict the progression of complex diseases like Alzheimer’s and cancer.
“Health data is going to be valuable in ways that we don’t even understand yet,” Wiens said. “It’s going to move us away from a one-size-fits-all healthcare system and toward a model where physicians make decisions based on data collected from you and millions of others like you.”
But getting there isn’t just a matter of doing the math. It’ll take a new level of collaboration between data scientists, hospitals and others in the healthcare community. And in the world’s most fragmented healthcare system, that could be even tougher than it sounds.
Good Data is Hard to Find
To build the kind of health system Wiens and others envision, we’ll first need a better health data system. And that’s an area where the go-go world of computer science collides messily with the more cautious culture of medicine. Hospitals are collecting more data than ever, but most of it is sitting idle on proprietary record systems that weren’t designed to talk to each other. And for healthcare providers, sharing comes with risks: they worry about giving away secrets to competitors, angering patients, running afoul of vague privacy regulations and a variety of other pitfalls.
But that data is the lifeblood of the work that researchers like Wiens are doing. There are some large storehouses of data, and in fact U-M has one of the largest stores of genomic data in the world. But there’s no central source of broad, widely accessible data. And that limits what researchers can do.
Sometimes scientists and doctors don't know what to look for, and I think that's this millennium's challenge.–Barzan Mozafari, Assistant Professor of Computer Science and Engineering, U-M
Wiens believes that the pace of discovery could increase dramatically if more data were publicly available. It would enable multiple researchers to use the same set of data, leading to more consistent research results and making it easier for researchers to verify each other’s findings. It would also mean that research topics would less often be limited by the types of data available.
“My work is about taking data and turning it into knowledge, and sharing data publicly would be such a game changer for the field,” she said. “There’s so much data out there, but we don’t have access to the vast majority of it.”