Data lakes vs data commons in healthcare data liquidity

Big data expert Pam Baker discusses the need to make data multipurpose and readily available for a wide variety of analyses and researchers.

posted: Tuesday 23rd of December 2014 by Pam Baker

One of the biggest obstacles to using big data in healthcare and elsewhere is in making the data multipurpose. When data is filed in folders and organized in data storage in the traditional manner, the effect is assigning it a specific purpose or purposes. While that’s generally helpful in finding the data again for some uses, it’s not particularly helpful if we come up with some other purpose for that same data later. How then do we know, or find, this data for the new use? How do we make it liquid enough to flow to a variety of analyses?

Thus the search is on to find ways to make data multipurpose, meaning highly sharable, useful for many purposes, and readily available for a wide variety of analyses and researchers.

Among the more promising ways of making data effectively multipurpose and liquid is to form a data lake. But there’s another good way to solve this data management problem too, and we’ll compare the two in a moment.

The data lake approach

Essentially, a data lake is a repository where raw data is stored in its native format where it sits until called up for analysis by authorized researchers. Each data element is named, i.e. given an identifier, and given a large set of extended metadata tags to aid in determining its relevancy to any given question.

One example of this approach is the recent development of Partners Data Lake by Partners HealthCare in conjunction with EMC Corporation.

“Realizing the vision of Big Science that is delivered from our extensive research and clinical development efforts, from basic research to cancer genomics, requires new technology collaboration,” said Dr. Shawn Murphy, Corporate Director of Research IS and Computing, Partners HealthCare in a statement.

“The Partners Data Lake offers us the chance to lead the world in novel treatments and diagnostics,” he continued. “Our ability to lead innovation will be greatly enhanced though our collaboration with EMC by having valuable data available for our researchers to perform analysis and develop applications more dynamically than ever before."

Note that this particular data lake is available only to researchers in the Partners HealthCare network. This adds fluidity to the data amassed or imported by members of that network so that all members can use the information to best effect both for patient outcomes and members’ competitive advantage.

Members of Partners Healthcare include two academic medical centers, community and specialty hospitals, a managed care organization, a physician network, community health centers, home health and long-term care services, and other health care entities. They each stand to gain considerably from pooling their data in this way.

The Genomic Data Commons

Meanwhile, the University of Chicago and the National Cancer Institute (NCI) are teaming up on a different approach to sharing medical research data. They will be creating the Genomic Data Commons (GDC) database wherein they will harmonize and standardize data specific to genomic research in regards to cancer of all types.

“The Genomic Data Commons has the potential to transform the study of cancer at all scales,” said Robert Grossman, director of the GDC project and professor of medicine at UChicago, in a statement. “It supplies the data so that any researcher can test their ideas, from comprehensive ‘big-data’ studies to genetic comparisons of individual tumors to identify the best potential therapies for a single patient.”

Further, this data will be available to researchers throughout the U.S. Officials say it will be “an important foundation for the push toward personalized treatments for cancer.”

The speed in new discoveries is expected to increase substantially.

“With the GDC, the pace of discovery shifts from slow and sequential to fast and parallel,” said Conrad Gilliam, Ph.D., dean for basic science at the University of Chicago Biological Sciences Division in that same statement. “Discovery processes that today would require many years, millions of dollars, and the coordination of multiple research teams could literally be performed in days, or even hours.”

The reusability of the data is extremely important to rushing breakthroughs in diagnostics, treatments and cures.

“The availability of high-quality genomic data and associated clinical annotations is extremely important because this information can be combined and mined repeatedly to make new discoveries,” said Louis Staudt, Ph.D., M.D., director of NCI’s Center for Cancer Genomics.

The two approaches compared

The data lake approach dumps all data of any kind and in any format in a single repository whereas this “commons” approach standardizes the format of data related to a single subject, in this case cancer genomics data.

Data lakes continue to grow as data is amassed over time, but the data can also become even more disparate as new types and formats of data are added to the lake. Data management and fluidity issues thus become increasingly complex. Even so, protecting data fluidity by breaking data free of siloes and limited organization is essential and data lakes generally do achieve this to varying degrees of success.

The GDC approach will also become increasingly complex as more data flows into it as well, yet it will largely remain effectively organized given the repository’s focus on one broad subject. One could argue that is it a data silo, albeit a huge one open to many researchers to mine. This model can also be easily and endlessly repeated as needed.

“The open-source software being developed by the GDC has the potential to become a model for data-intensive research efforts for other diseases such as Alzheimer’s and diabetes, which desperately need similar large-scale, data-driven approaches to develop cures,” explains the University of Chicago’s statement.

Which model you decide to use to make data more open and readily accessible for analysis is, of course, dependent on your own goals and strategies. The important thing is to begin using big data now rather than waiting until all your data is perfectly organized. If you wait, you’ll never be done with the housekeeping and patients will die in the meantime.

Pam Baker is a regular nuviun contributor, the editor of FierceBigData and author of Data Divination: Big Data Strategies. For more expert insights from Pam, follow her on Twitter @bakercom1 and at FierceBigData.

The nuviun blog is intended to contribute to discussion and stimulate debate on important issues in global digital health. The views are solely those of the author.