ARPA-H Biomedical Data Fabric Toolbox

The Big Question

What if new data integration tools made it possible to extract more value out of data?

The Problem

Each time a health research study is conducted, data is collected and analyzed to find ways to improve health. All those datasets, while powerful individually, could be so much more useful when pooled together with research from across diseases and diverse populations. However, using different platforms to store different datasets, limitations in accessing that data, and the challenge of sharing data while preserving privacy, all make building a common and large data pool – one where datasets can be reasonably compared – more difficult. Access limitations, siloed data platforms, and a lack of privacy-preserving access methods stymy researchers in trying to analyze critical biomedical data. These barriers make it difficult to leverage data from thousands of labs, hospitals, and centers, because each entity tends to organize and manage data using incompatible biomedical dialects.

The Current State

Today, many data science efforts seek to leverage established technologies and operationalize data infrastructure platforms to make data findable, accessible, interoperable, and reusable (“FAIR”). However, these technologies often fail to improve the quality, standardization, and timeliness of data availability for data collected across thousands of labs and hospitals.

Current software for experimental research falls short of consistently capturing the fidelity of data provenance, calibration information, and protocol specifications needed to reliably test for experimental reproducibility across different labs. Established technologies are limited in their ability to integrate data from multiple sources and to support intuitive multi-source exploration or data analysis by a range of human users, including through artificial intelligence/machine learning (AI/ML).

The Challenge

The ARPA-H Biomedical Data Fabric (BDF) Toolbox seeks to make it easier to connect biomedical research data from thousands of sources and overcome barriers caused by incompatible data dialects. The BDF Toolbox effort will seek to advance capabilities in five areas: (1) lowering barriers to high-fidelity, timely data collection in computer-readable forms, (2) preparation for multi-source data analysis at scale, (3) advanced and intuitive data exploration, (4) improving stakeholder access while maintaining privacy and security measures, and (5) generalizability of biomedical data fabric tools across disease types. Together novel data fabric capabilities will lower the barriers associated with data collection, reduce the time needed to integrate new data sources, and improve data usability by community members across disciplines and biomedical literacy levels.

The Solution

The ARPA-H BDF Toolbox Combined Module Announcement called for innovative proposals for research and development (R&D) in data integration and usability technologies. Proposed R&D will investigate innovative software approaches that enable revolutionary advances in the collection and usability of biomedical datasets that originate from thousands of different research labs, clinical care centers, and other sources of data to accelerate technical innovation across the health ecosystem.

ARPA-H has partnered with several institutes and centers at the National Institutes of Health to tackle this problem, with funding available through ARPA-H and the National Cancer Institute in partnership with Frederick National Lab (FNL).

Why ARPA-H

The BDF Toolbox builds on National Cancer Institute efforts to operationalize data infrastructure platforms. ARPA-H's aggressive program goals aim to develop new capabilities for collecting, sharing, and analyzing biomedical data.