Our founder Stan Gloss and Matthew Trunnel, data-commoner at large, discuss trends in data science, genomics, COVID-19, and the future of interdisciplinary research. This article is part of the Trends from the Trenches series.
Stan Gloss: Let’s go back 20 years to when we first met each other. Data didn’t seem like an issue back then. Why not?
Matthew Trunnell: Twenty years ago, the volume of data was not tremendous. GenBank was still being distributed on CDs. The challenge was the kind of computation that people wanted to do. BLAST was the big consumer of CPUs to do all versus all kinds of comparisons searches. Scientists were outgrowing their desktop workstations, and that was really the opportunity at the time.
We were still looking at results more than data. When I got involved in this pre-Blackstone (pre-2000) working at Genome Therapeutic Corporation, there was a data product, this pathogen database. Celera and others were getting into the data product space. But the idea of generalized data reuse was just not there really.
We had reference data and we had primary data, but we weren’t actually making secondary use of primary data.
Stan Gloss: The NCBI provided a centralized resource for genomic data. Was some data being reused?
Matthew Trunnell: I think genomic data was the leading edge of data reuse because genomic data—even if it wasn’t collected for the purpose that you are interested in—could provide a reference that would increase the statistical power of your analysis. And in many ways the introduction of short-read genomic data was the beginning of data science and data engineering in our space.
Stan Gloss: Why do you say that?
Matthew Trunnell: That was when we started seeing the Bayesian modelers come in and the tools that were developed for analyzing short read data like GATK. GATK is a statistical tool package, very different from what we were using to analyze capillary data. So you have the biostatisticians that are doing conventional statistics, and GATK and most of the modeling stuff continues to be done to kind of analyze genomic data, whether it’s GWAS or the primary upstream analysis associated with alignment is probabilistic modeling.
So when we started seeing a lot of short reads sequence, we suddenly needed not just software engineers in the lab to make the data coming off instrument usable, but we actually needed data engineers, data scientists in the lab. We hadn’t needed that before. I would argue that data engineering was just not very big for most organizations. The Broad was out in front of it and kind of hit it in 2007, but it was several years after that before other organizations started to feel that same organizational gap: the gap between IT that knew how to store and manage data in associated compute, and the research that knew about the data.
That was a fine separation for a long time. IT didn’t need to understand the contents of the data in order to run the investigators’ tools. But we came to a point where investigators began to outgrow their familiar tools (because of data size/complexity), and it wasn’t going to be IT that solved that, at least not the IT organizations of yesterday.
And that’s when data, for me, started becoming a thing. There were various aspects of data that became more important. The first is a very practical thing, the storage of genomic data. In this case we were spending $6M/year on new storage hardware to store data in a form we knew had not been optimized for storage. This was not a problem IT could solve by itself because it required a research-level understanding of the data. Yet it was not only a research problem either.
Stan Gloss: When high-throughput short-read DNA sequencers started coming online, was that a defining moment in data production?
Matthew Trunnell: Yes. On the life sciences side and the research side two things happened: one is that it removed a bottleneck in data generation. One of the rate-limiting factors in genomic analysis certainly was data generation. People would spend their whole careers isolating a single gene. Suddenly we had this ability to generate data with tremendous velocity. But the second thing is that the data being spewed out of these machines weren’t usable by the average researcher.
The whole introduction of data engineering and data science into this space was how to distill that data coming off the sequencers into something that could be usable.
Stan Gloss: These sequencers first landed in the big sequencing centers like the Broad then eventually proliferating everywhere else.
Matthew Trunnell: The NIH-funded sequencing started with just three sequencing centers. It was pretty capital intensive, so in order for a site to be competitive in that, there was a tremendous capital investment. But to be able to do it efficiently, meant doing it at scale. Eventually the cost came down to the point that individual sequencers were manageable by labs.
Stan Gloss: Was the explosion of data driven by Moore’s law and the ability to put more and more powerful compute processors into the laboratory instruments?
Matthew Trunnell: Short-read sequencing was really about biochemistry, molecular biology. In fact, that the engineering part and the computing part was really not very good for the first couple of years. We used to joke about when the Solexa machine would stop running, it’s because the rubber bands broke, which was actually literally true. They weren’t good mechanical engineers, but they were biochemical engineers.
Stan Gloss: It seems like what you’ve described is kind of still where we are today. There’s a lot of data out there with many different data formats from many different types of instruments that are still pumping out tons and tons of data. Now more and more people want to ask better questions of more data, but they find themselves stuck in some ways.
Matthew Trunnell: Yes, it’s not just the sequencing. We’re seeing now advances in microscopy and CryoEM. It’s the advances in the laboratory technology that are driving the volumes of data that are overwhelming the existing infrastructure for data management and analysis. Microscopy has just been continuing to drive technology. The thing that advanced even faster than Moore’s law for CPUs was the performance and reduced cost of CCD’s, digital cameras. And so that CCD technology has also had an impact in in this space. Flow cytometry is another great example. And then in the last two years there has been this rise of single cell, which of course is all of this tied together.
So, where we find ourselves now is A, we have a ton of data and B, we’re generating large scale multimodal data.
And so now we find ourselves back in this place of, okay, I have high resolution imaging data and I have RNASeq data. How do I analyze these together?
Stan Gloss: One of the things that I’ve noticed in traveling around and talking to people is they still have a problem with their data being locked up in silos. The way in which organizations are structured by dividing people up by specialties has created a culture of silo building. You’ve been part of the Broad and the Hutch and all these places. What do you think about the culture of scientists? Do we need to start thinking also about people engineering, not just necessarily technological engineering or data engineering? Where do people come into this?
Matthew Trunnell: The technology systems we’ve seen are a reflection of the organizations. That’s the way the technology evolves. There are two things that drive this in my mind in this space. One is that biology itself has been traditionally a very reductionistic practice, right? You’ve got your cell biologists and your molecular biologist and you look at the NIH and you divide everything up by disease area. There’s no single center institute at NIH, for example, that deals with inflammation or remodeling. And yet we know that spans so many different diseases, but it doesn’t fit into the NIH organizational structure. This is also how academia works. Academia is built around individual labs and all the incentives are built around individual labs.
So, we had a culture that still is to a large degree emphasizing silos. It’s hard to lead that silo busting with technology because you still run into all the cultural and social issues around data sharing and the sort of fundamental imbalance around the cost benefit of data sharing: To make data shareable requires more effort on the data producer’s side, but the benefit is realized by the data consumer and not the data producer. And that fundamental imbalance will continue to drive things toward silos without some investment of efforts on the parts of organizations.
Stan Gloss: You actually talk quite a bit about silos in the chapter on Democratizing the Future of Data in Brad Smith’s book “Tools and Weapons”.
Matthew Trunnell: That conversation has been going on for a long time. The incentives are misaligned to make that change and there are organizational barriers. My focus has been almost entirely organizational for the last five years. I have no technical capabilities anymore; I am an organizational engineer.
Depending on what field you’re in, I think there’s more and more awareness of the value of cross-disciplinary research. Data science has been a great example of this.
Typically, data science is about joining domain expertise with the computational and statistical expertise, which is less common to find in a single person. So, data science institutes are about collaboration.
If you look at the Moore Foundation’s report on the three data science institutes they funded—Berkeley, NYU, and University of Washington—they stress the value of collaboration and having physical collaborative spaces for interdisciplinary groups to come together.
Last week Berkeley announced a $252 million gift, the largest ever. They’re instantiating a new data science department joining together stats and computer science and some others. That is the future, it’s cross-disciplinary and I think in all honesty that’s going to happen in some places reasonably quickly and then we on the data technology side are going to be scrambling to catch up again.
Stan Gloss: Is the way in which scientists gain attribution for their research a barrier to sharing data?
Matthew Trunnell: Absolutely, that’s because attribution is the currency of academia. So the entire academic system is built around attribution. It’s an interesting time right now with the COVID-19 where organizations will make a big deal about sharing data immediately. Like, oh, we’re doing this for the public good, and I don’t understand why we can’t push people to the realization that that’s exactly the same thing with cancer. Cancer is a bigger epidemic than COVID-19. It’s just not spreading the same way. The value of putting data out into the public is exactly the same and I find it frustrating that we’ll pat ourselves on the back for sharing some data while we continue to go on and hoard other data.
Stan Gloss: Do you think we’ll ever get to a time where datasets, clean datasets, kind of look like movies that we get from Netflix and the ability to have a system of streaming high-quality data?
Matthew Trunnell: The streaming is an interesting point. I believe that it is true that in some areas we will benefit from thinking not about data as sets but as time-dependent streams. And there’s a lot of reason for that. I mean, now that all of our ability to produce data is going up so fast, but to the extent that we keep looking at things as static datasets, which is the NIH view. They’re thinking in a backwards way rather than thinking in a forward way. We’re going to have data spewing out these various sources, and we do now. I think the thing that’s going to keep slowing us down in biomedicine is our relative inability to deal effectively with clinical data.
Stan Gloss: Why is clinical data a problem?
Matthew Trunnell: There’s traditionally been a pretty big gap between the research side and the clinical side. There’s great interest in bringing those areas closer together and building a more “learning health system.” And this is why the position of Chief Research Information Officer was created to try to bridge those two domains.
The challenge is that as a data producer, the health system sees very little value in participating in a broader ecosystem.
That’s a general statement, but hospital leadership is focused on business metrics on a five-year horizon, and there hasn’t been sufficient demonstration of a learning health system.
Some are starting to see opportunities with Google and others to look at their data as an alternative source of revenue. And that’s of direct interest to the hospital leadership. Hospitals are crazy businesses with razor thin margins, so if they can find novel sources of revenue, that’s a win. When a group like Google says, “Give us your data, we’ll do other work on it. You don’t have to do any extra work. Just give it to us and we’ll give you some money,” that’s appealing to some. I believe that until the hospital systems in healthcare enterprises get better about data, we are going to be hampered in our ability to drive all of the great advances on the research side into the clinical side.
Stan Gloss: I’d contend that the Coronavirus infection rate is a prime demonstration of the gap between research and clinical.
Matthew Trunnell: I would argue that too. Trevor Bedford, who is at Fred Hutch and has been leading covid-19 analysis efforts since the first deaths here, has become a statistical spokesperson for coronavirus. He was already relatively geared up because this is second year of the Seattle Flu Study, an effort to study the propagation and mutation of flu virus. All of the analysis pipelines, sequencing, and sample collection is already in place, which is great.
But we still don’t have any of the clinical data. How can we expand beyond our local health system to be pulling in the data we need from all of the other Seattle healthcare systems? And that’s a big problem.
Healthcare data is the prime example of data that was not collected for secondary use, right? You collect data in a clinical setting for the sole purpose of treating that patient and that’s reasonable. That’s the mission of the health system.
Apart from efforts around quality improvement, there’s very little systematic reuse of those data and I think that’s a huge opportunity.
That’s one of the areas that I’m pretty excited about.
Stan Gloss: So we have researchers and clinicians that work on one question or patient as a time rather than thinking holistically?
Matthew Trunnell: I think that’s true and that’s not an unreasonable research approach. The thing that we really haven’t seen come to penetrate healthcare and life sciences is this so called fourth paradigm of science, which is data-driven hypothesis generation. And I think it will have big impact, but you have to have the data in order to do it and it’s a really interesting challenge to try to talk to an ethics review board about the value of data-driven discovery when our whole system of human subjects research is focused protecting individuals. Data-driven discovery is a really hard sell.
Stan Gloss: If you could go back 15 years and advise yourself with the knowledge that you have now, what advice would you give yourself?
Matthew Trunnell: Good question.
I think 15 years ago I wasn’t thinking data. I was thinking storage. I was thinking of data as nothing more than a collection of objects and I didn’t care what the objects were, and if I had been on this data bandwagon 15 years ago, I think I could have had more impact.
Certainly we would have gotten the Broad to a place closer to where it is now much sooner.
Stan Gloss: Right, I know. It’s kind of like back then people thought of data as grains of sand and it’s almost like we have to change the perception of piece of data is actually a seed and not sand. It’s something that with nurturing could actually grow into something.
Matthew Trunnell: Yeah, I like that.