Bill Van Etten discusses the data dictionary and how BioTeam has been working with Bristol-Myers Squibb to build their new data commons. This article originally appeared in Bio-IT World in March 2021.
The problems are common in any large, data-centric organization. We don’t know what we have. We don’t know where it is. We need to be able to clean, combine, and search our data assets. A favorite solution is a data commons, an architecture for holding all of an organization’s data in common with well-defined connections. The idea is to make the data within an organization FAIR: findable, accessible, interoperable, and reusable.
The foundation of a data commons is the data dictionary—the map or model populated with all of the data within an organization and the relationships between them. The hardest part of FAIR is the “I”—interoperability, says Bill Van Etten, Senior Scientific Consultant at BioTeam. The data dictionary is what brings you from FAR to FAIR.
The data dictionary includes both technical information—for instance, the path to an S3 object URL, who owns the data, when they were created—as well as the research metadata—project names, study names, analyses, demographic data, clinical data, etc.—Van Etten explained last year. Since 2019, he and BioTeam have been working with Bristol-Myers Squibb to build their new data commons, the BMS Genomics Data Hub.
That’s a lot of content to gather, organize, and arrange. And in the past, building such data dictionaries took a great deal of time, Van Etten said. Early on in the process of building the BMS Genomics Data Hub, there would be lots of meetings where user groups discussed the data dictionary, defined by node.yaml files, then ran tests and circled back, he said. Eventually the yaml files would be transpiled to json schema files and sent to an Amazon S3 bucket. Gen3 microservices—the platform on which the BMS data commons is built—would need to be restarted to ingest the new dictionary, and the process would start over for any new changes.
It was a tedious development cycle, slowed by meetings between individuals from unrelated groups.
But Van Etten and John Jacquay, Scientific Systems Engineer at BioTeam, have been developing something better: collaborative dictionary authoring with GitHub and Travis integration.
We’re applying standard software development techniques to data dictionary development, Van Etten explained. Instead of everyone having to agree on a common dictionary from the start, they can branch the dictionary and add or modify dictionary nodes specific to their research domain. Using git and continuous integration and continuous development (CI/CD) practices automates the building, testing, and deploying of data dictionaries. Now, Van Etten said, the process takes about 30 seconds.
GitHub and Travis bring different functionalities to the new process. Users can modify the data dictionary directly through GitHub, Van Etten explained. But even better is the continuous integration Travis offers. Within a few seconds of committing changes, Travis will test the dictionary, deploy it to S3, and offer a visualization of the new dictionary (using React components from Gen3) as a serverless web application from a static S3 website.
Data Dictionaries For All
While the functionality was built specifically for collaborative dictionary authoring for the Gen3 platform, Van Etten emphasized that there is value here for any organization. Every group needs to understand how their data relate, he said.
Even without all of the Gen3 microservices, this open-source, serverless, collaborative data dictionary development tool allows users to visualize their data relationships graphically revealing how all the data are interrelated.
Anyone can build a data dictionary, Van Etten contends. The only skill needed is the ability to author a yaml file. At BMS, some research groups are authoring the yaml files themselves; other groups are getting help from BioTeam.
There will be foundational terms that must be defined, Van Etten said. For instance, at BMS, “We are calling a study a study. We agreed what ‘biospecimen’ vs ‘sample’ vs ‘patient’ means. We agreed on the definition of ‘subject.’” But in general, Van Etten advocates for just letting people get started. Any connections can be fixed after the fact. “Like in software engineering: carve out a part of the problem for yourself, fix that problem, and submit a pull request.”
BMS has made the feature open source. Anyone can clone the repository and use it to develop their own dictionaries. In fact, Van Etten pointed out, anyone can look at the data dictionaries of the Gen3 data commons listed at stats.gen3.org. Append /DD to the end of the URLs to explore each group’s dictionary. Toggle between graph and table view in the left column. Anyone can see (and use!) the nodes each commons has defined.
Of course that begs the question: is there any security risk inherent in exposing your data organization model this way? Absolutely not, Van Etten says. No one is sharing their data; they are only sharing the schema. In fact, Van Etten and others hope that all of pharma will coalesce on a common dictionary. “A dictionary is not a competitive advantage,” he said. “Two pharma could share the same dictionary and still be competitors.”