Big Data vs. Alzheimer’s

Stan Gloss, founding partner at BioTeam, interviewed Sudeshna Das about how MGH is collecting and working with enormous datasets in the fight against Alzheimer’s.  This article was originally published by Bio-IT World in December 2021.

Sudeshna Das graduated with degree in electrical engineering, paired it with computational biology graduate studies, and followed a trajectory via industry (Millennium, now Takeda Pharmaceuticals) into her current faculty position at Massachusetts General Hospital (MGH) and Harvard Medical School, where she directs the data core of the Massachusetts Alzheimer’s Disease Research Center (MADRC).

Das recently sat down with Stan Gloss, founding partner at BioTeam, to discuss how MGH is collecting and working with enormous datasets to push back against Alzheimer’s. Bio-IT World was invited to listen in.

Editor’s Note: Trends from the Trenches is a regular column from BioTeam, offering a peek into some of their most interesting case studies. A life science IT consulting firm at the intersection of science, data and technology, BioTeam builds innovative scientific data ecosystems that close the gap between what scientists want to do with data—and what they can do. Learn more at

Stan Gloss: Tell me a little bit about more about your current role.

Sudeshna Das: My role as the data core leader at the MADRC led by Dr. Bradley T. Hyman, is to make sure all the data is collected and transmitted to the national coordinating center, which includes data from over 30 US-based Alzheimer’s centers. Then we make sure all the data is available for researchers to work with. These data tend to be multimodal: there’s clinical data, biomarker data, imaging, brain autopsy data, and various other types. There are a lot of challenges to making sure that those different data types can be integrated and analyzed and varied. In my laboratory, our mission is to develop and apply integrative computational methods to advance biomedical and brain research. We see data science as a key element of interdisciplinary science.

Can you elaborate a bit more on what you mean by multimodal?

The data that’s generated as part of our Alzheimer’s center is just one glimpse of the different kinds of data that I work with. We work with a range of data types that largely falls under three categories. First, a lot of data is generated in the laboratory as part of basic neuroscience research. That data can be from cell lines, from animals, or from human brains, blood, or CSF. We have one of the largest brain banks. This laboratory data tends to be multidimensional. There’s a lot of genomics, transcriptomics, proteomics, and other kinds of -omics data. That’s the lab-generated, basic research data.

Then there are clinical data. Clinical data are collected from participants in a study. These may include determinants of health—their age, sex, socioeconomic status, race, and ethnicity—and information about their clinical diagnosis and comorbidities. Then we have laboratory tests, blood tests, imaging, etc. Many of our participants donate their brain for research, so it could be from postmortem tissue. A lot of the clinical data comes from our Alzheimer’s center’s longitudinal study where we’ve been following participants sometimes up to 15 years collecting their data. They may have joined when they had normal cognition and now have mild cognitive impairment or dementia. So, we have a lot of follow-up data on our participants. It’s very deeply phenotyped, rich data that we collect.

Finally, a lot of data also comes from our hospitals, and this is real-world data. Over six million patients have been through our Mass General Brigham hospital system, which provides a really rich data source for studying Alzheimer’s disease and other diseases. These datasets need a lot of work to transform them from their native state to features that can be used in studies.

Okay wow, you have lots of information from many different sources. Have you come up with some method of tagging all the data with metadata?

The clinical data that comes from our structured longitudinal studies, all participants have or should have GUIDs on them so that we have a unique identifier. The GUID is generated by the National Institute of Aging (NIA), which funds our center. Using those GUIDs, we can join data collected from the physician’s evaluation to biomarker data, to imaging data. The goal is to have a global identifier so that if participants move from one study to another, we can track the information.

Okay, you have genomic data, brain images, I assume digital pathology images from postmortem studies. Are all these data FAIR (findable, accessible, interoperable, and reusable) compliant?

I mentioned the three types of data. The laboratory research data are all very much under control; publications usually mandate that you must share your data if you publish in their journal.

Everything that is collected by the NIA-funded Alzheimer’s center is FAIR compliant, because FAIR compliance is mandated by the funders. A lot of work goes into coming up with the structure of the data. We have to transmit the data to the National Alzheimer’s Coordinating Center (NACC) run out of the University of Washington. But the real-world data you’re not allowed to share. It’s very sensitive, protected health information. We don’t even know how to share that data beyond our organization. Even within our organization, investigators have to be part of a research protocol that allows access to the data. It needs training and so on. So, the real struggle is to be able to take that electronic health record (EHR) data and think, can we share it? If we share it, how? That’s really ongoing work.

Is the curation step for Alzheimer’s center data something that you do, or is the curation being performed by the national center?

Every center is responsible for their own curation, but it’s very structured. So, there are electronic data capture forms that we’ve implemented in REDCap. All the data gets entered in a structured way in fields. It’s designed to be structured and curated from the start.

It sounds like you really have focused on making sure all the data is structured.

That didn’t happen overnight! Actually, when I took over about three or four years back, that was the goal. I have a deep background in informatics, so I made sure that the data was linked from the different cores of the center. We made sure that for every datapoint, there’s a common identifier being used to track whether a participant is involved in another study. It took a lot of time to define that metadata. For example, to link our Alzheimer’s center’s clinical diagnosis and cognitive testing data with imaging, we curated metadata. The metadata included fields such as the person, the date the imaging was performed, and the kind of image—MRI or PET or other imaging—and who the PI of the study was so that we could work through data use and data sharing agreements if necessary.

Would you say the data curation was a significant effort?

Yes. Right now, it’s all structured, and we’ve defined it, but the process of getting there was, yeah, painful. The retroactive data needed all this mapping, which was a lot of work, because we tried matching with names, for example. But we had to come up with fuzzy algorithms for matching names, because names are sometimes spelled differently… O’Donnell may or may not include an apostrophe, for example. So, we used a combination of name and date of birth, but again, we had to come up with rules for fuzzy matching, because there are lots of data entry errors. Then, we do a manual review of the whole set of approximate matches.

If another project was starting similar to the one you’re involved in and they came to you for advice, how would you guide that researcher in the beginning phases?

People always jump straight to “How am I going to create my database?” I ask them, “What is it that you want to use later on?” Focus on that part. What elements of the data are you interested in capturing? Then, why is it that you’re interested? Really think about why you want to capture that piece of data and how you will use it, rather than going straight to am I going to do this in Access or REDCap or whatever. Thinking deeply about what data you want to capture and why and how you’ll use it later should define what you want to capture.

There’s a lot of problems when people generate datasets but only think about their initial use of the data. Do you have situations where you have to think downstream, because there are secondary and tertiary consumers of your data?

Definitely. Certainly the more you can think longer term about it, the better you are positioned to deal with what comes a year from now or two years from now. In terms of the basic research data, the ‘omics data, the National Library of Medicine has created these databases to deposit your molecular array data and your sequencing data and so on. This was not the case when I graduated. Of course, at that time there was no ‘omics data available, but as that data came along, a lot of thought was put into that. I would say that data is very well curated. People nowadays have to think about and document your experimental metadata before you’re doing your experiment. It has to be deposited to GEO or the other sequencing archives and follow that format. With other types of data, that is something that is not standardized at all. People just started generating digital pathology images. Now we have committees being formed to create standards for not just the file formats but also for the metadata for each of the images.

Right. So, in your research, you have the data all FAIR compliant and purpose-built. Are you using AI and machine learning to try to get insights into this data?

We are a data science group, and we use a lot of machine learning and AI techniques. In a recent example of work led by Dr. Steven E. Arnold, we developed a prognostic plasma biomarker model for patients with mild cognitive impairment to predict their decline in five years. These kinds of models may be helpful, because it’s a blood draw rather than getting a spinal tap for CSF. The input to the model is your APOE genotype, the biggest risk factor for Alzheimer’s disease, your demographics plus the measurement of 10 plasma biomarkers, and it tells you the probability of decline in the next few years. These kinds of prognostic biomarkers help both the patients and their families to plan their future and for providers to manage their care.

It sounds like you’ve got everything really under control and working. Are you facing any real challenges in the work that you’re doing right now?

First, this is not just my or even my data science team’s challenge. It’s a lot of multidisciplinary researchers, neurologists, neuropsychologists, neuropathologists, epidemiologists, data statisticians, and data scientists working together. As I mentioned, we’ve also started working with EHRs, i.e. real-world data. Most of our challenges are with that data because a majority of the time is spent cleaning and transforming the data into usable features that we can put into our model. Training and working with the algorithm itself is also very challenging, because any AI algorithm is as good as its training set, and you want to make sure that your training set is not biased.

Do you generate all your own training sets?

Yeah, we’re working with our healthcare records, but I work a lot with epidemiologists. Dr. Deborah Blacker is a geriatric psychiatrist at MGH and an epidemiologist at the Harvard School of Public Health, who is helping us to select the people to include in our training set. Again, often with this training, you can get into this hole where it works really well on your training set but is not generalizable to other health systems. We started reaching out to researchers at other universities to see if they can run our algorithm in their healthcare system and see how generalizable it is. Then, we may work with the Kaiser health system to see if it’s different for patients in more of a closed health system where they get all their care in that setting versus a specialty hospital like ours where they might come for their surgery but receive primary care somewhere else.

Are there other examples where the AI is uncovering novel insights?

Another example of AI work is using deep learning techniques on electronic health records, natural language processing on the clinician notes, combining it with other structured data to detect cognitive concerns in patients who come to our healthcare system, taking into account their other comorbidities, such as diabetes or hypertension, which are risk factors for dementia. Some patients have genotypes in our biobank. We’re trying to combine all these data to detect a concern and also use features of their interaction with the healthcare system.

For instance, do they have a lot of missed appointments? This can often be a sign if it’s combined with certain notes from their physician, such as the patient complaining about short-term memory loss. Doctors in our memory clinic have also noticed that if you have diabetes for example, the blood work might get really out of whack because the patient forgets to take their medications. So, out-of-whack blood work is often a signal.

We are using all of these data together in a big deep learning model to help identify those with concerns. Dementia is under-diagnosed, and there are predictors of that under-diagnosis, such as education, socio-economic status, and social connections. The goal is to screen these electronic health records to detect patients with cognitive concerns so that they can be referred to specialist care to manage their dementia. Although we don’t have effective disease modifying treatments yet, sometimes there are modifiable risk factors like your blood pressure, and you can make sure those are under control.

Is it hard to get data from the electronic health record?

Oh, sure, yeah. The focus there is on the clinical care of the patient and not on producing data for us researchers. So, certainly there’s a contrast with what we get from our Alzheimer’s center where the sole purpose of the diagnosis is really research and education. We don’t want to change that practice, but perhaps a recognition of the potential utility of EHRs for research would be helpful. One ER physician, when she was working with the data and saw the other research side, said, “Oh, I’m going to do a little better work in recording these things, because it might be useful for someone’s research.” Recording is often driven by billing, and a diagnosis may be recorded for billing purposes. But, for dementia, a non-specialist may not even enter any diagnosis (ICD) code, which is the structured data. A good PCP will enter in the charts that a patient is complaining of memory issues though.

Based on where you see your Alzheimer’s research now, should we be hopeful that we’re really starting to use technologies to uncover and better understand this horrible condition?

Yeah. One of our studies that we are working on is using electronic health records in the repurposing of FDA-approved drugs for dementia and Alzheimer’s. The idea is that we are performing in silico trials on our electronic healthcare records. We follow patients from when they started taking a drug of interest to when they developed cognitive impairment. Controls are patients with a similar risk who may be on a different drug, and we are really comparing whether the patients taking the particular drug have a lower risk of getting dementia than those that were not taking that drug. This is a promising approach, because these drugs have been on the market, and their safety profiles are well known. So, the time to bring them to the patients is a lot shorter than a clinical trial for a new drug.

In the next three to five years, could we see potential new treatments for Alzheimer’s based on repurposing some kind of compound that we already know about?

I’ll just give you some examples. We did one study on patients with diabetes, led by Dr. Mark Albers, of comparing metformin with sulfonylureas. We found a reduced risk of Alzheimer’s for people who were on metformin compared to those taking sulfonylureas. Then in vitro studies showed that metformin changes Alzheimer’s-associated gene expression in a different way than sulfonylurea does. So, this is very early research, but one example. We’re also working with BCG, which is actually a tuberculosis vaccine. Again, it was repurposed, and now it’s used in the US for treating bladder cancer patients. It’s a pretty effective drug for bladder cancer patients. There is some initial data from Israel that shows that it might reduce the risk of Alzheimer’s. Then, we are looking at a lot of rheumatoid arthritis drugs and other chronic inflammatory disease treatments to see if they can be repurposed for Alzheimer’s disease. Yeah, so very promising.

Just changing the slope of decline for something like Alzheimer’s could make a big difference. Could it be like people taking daily aspirin to reduce the risk of heart attack?

It could be, and we would have to do clinical trials. Initial EHR studies of Metformin show a hazard ratio of 0.8, a 20% reduction in risk, but it’s not a cure, for sure. Alzheimer’s disease is defined by the two key pathologies of amyloid beta plaques and tau tangles. But there is this underappreciated metric of the metabolic health of the patient and your body’s reaction to inflammatory stress, which is related to your general cardiovascular and your cerebrovascular health, and this where current FDA-approved drugs either alone in combination with Alzheimer’s pathology directed drugs may help reduce the age of onset or slow the rate of decline.

Fantastic. If one is interested in participating in these studies, can they get involved?

Sure. If you go to, you can reach out to participate, and study involvement usually means coming annually for blood work and evaluation.



Get updates from BioTeam in your inbox.

Trends from the Trenches eBook January 2022

The eBook of BioTeam and Bio-IT World's Most Trending Articles