Listen to our newest podcast with Ruth Marinshaw, CTO of Research Computing at Stanford University.

Close this search box.

Applying AI to Real-World Data to Improve Pharmaceutical Development

Stan Gloss, founding partner at BioTeam, interviewed Sean Liu, Director at aNovartis, about how artificial intelligence and machine learning will continue to impact pharma and healthcare, especially through application to real-world data. This article was originally published by Bio-IT World in January 2022.

Xiong Sean Liu is a Director at Novartis where he has been working to make inroads for AI in pharma. In Liu’s view, AI as a technology is already successful, but he also believes it is still in its early days for healthcare and there are still significant challenges for use. Data organization, for example, must be solved so that AI models can be put to best use. Liu and his team recently explored how electronic health records could be better populated and curated to facilitate clinical trials.

Liu recently sat down with Stan Gloss, founding partner at BioTeam, to discuss how AI and ML are going to continue impacting pharma and healthcare, especially through application to real-world data. Bio-IT World was invited to listen in.

Editor’s Note: Trends from the Trenches is a regular column from BioTeam, offering a peek into some of their most interesting case studies. A life sciences IT consulting company at the intersection of science, data and technology, BioTeam builds innovative scientific data ecosystems that close the gap between what scientists want to do with data—and what they can do. Learn more at

Stan Gloss: Tell us a little bit about the work you do at Novartis.

Sean Liu: I’m currently a director of data science and AI at Novartis and part of the Novartis AI Center. My focus is on developing AI techniques to enable the optimization of drug development, from early discovery to clinical trials and through real-world evidence and post-marketing research. We try to identify business opportunities but also develop new AI techniques that could help enable solutions to optimize the whole process. I’m quite interested in using real-world data to optimize clinical trials and the whole cycle.

What does an approach to AI analysis of real-world data look like? What’s the philosophy around making this transition?

I think there’re two sides of the problem. One is a business problem. Before we apply AI, usually the business people have something in mind that they want to achieve. On the other side, the AI technology is booming. Every other day you hear about a new type of AI technology. How do we reconcile these two sides?

I always like to work with domain experts. I want somebody to complement my expertise. My background is in data science AI, so no problem for me to keep track of trends and know what’s going on and even do hands-on work collaborating with my team. That’s a natural thing for the past decade.

When I’m trying to solve the next big problem, I always want to know the issues. Real-world data is mainly anything outside the traditional randomized controlled trials, such as electronic health records, national health surveys, and claims data. Those are very familiar to outcome research people or epidemiology researchers, but historically, the pharma clinical trial and the epidemiology folks do not necessarily link together because they are in different settings. We need to form partnerships between different domain experts to facilitate this transition.

What kinds of AI applications are being impacted by real-world data?

We recently published an article (Drug Discovery Today, DOI: 10.1016/j.drudis.2020.12.013) reviewing AI in drug development. In the different stages of drug development—early discovery, clinical trials, and post-marketing research—we see more and more applications trying to target and use electronic health records (EHRs) to inform drug discovery. We surveyed the past 20 years of research and found that a lot of EHR data has been applied in the drug discovery stage for identifying biomarkers—not so much about novel targets but merely about the biomarkers.

One of my research topics is how to use EHR data to optimize clinical trials because there’s very little research on this. We found in our article that trial recruitment is a big use of AI and RWD, how to better recruit patients that could benefit from trials.

When it comes to our post-marketing research, a natural application is adverse events reporting, because FDA requires every drug developer to report adverse events in the post-marketing research stage. Historically that’s been very manual work: people reading a lot of documents and manually generating those kinds of reports. But now with techniques like natural language processing (NLP), AI can help humans scan tens of thousands of records to quickly find information about the adverse events and facilitate the timely generation of a very nice report. These are high-level applications, which are popular, of course.

Can you talk about the challenge of translating data from a system designed for patient records and not necessarily to support research? How do you make the jump from medical record to valuable data for drug discovery?

Let’s first talk about the content. In EHRs, there’s a lot of demographic information and prior history and medication information, and those could be in structured or unstructured formats. The unstructured format is usually referring to the doctor’s clinical notes, like discharge summaries. Those notes usually contain a lot of information about how people respond to the same type of treatment, but they’re not easily processed by humans because they are in a heterogeneous format. There are pros and cons: the EHRs have a lot of information about patients and their medical history, but the format is very limiting.

Now the trend is that there are certain EHR vendors, like Flatiron and Optima, that do a lot of basic work integrating different heterogeneous records from different hospitals or medical systems into one bigger system that we can use to do analysis. There’re also national and international trends about sharing those real-world data. For example, the observational health data sciences and informatics (OHDSI) consortium is working to define data standards for structuring EHR data. If we exchange data from different organizations, what’s the common data format? These kinds of efforts facilitate data formatting and data sharing issues.

EHRs have good content, and there are facilitators to enable sharing of the data. The timing is right for the establishment of these AI models that can be used to study the specifics of drug applications.

You are still dealing with humans in the process. How do you deal with inconsistencies around, say, ICD-9 codes being entered or not entered?

Those ICD codes are very useful to help identify patient cohorts for a disease indication. But in reality, you’re right, many times there is no ICD code there. Maybe in the doctor’s notes there’s mention of the diagnosis or medical history, but it simply was not coded. One research topic in the academic field now is to rely on NLP techniques to process medical notes and classify patients by disease conditions. This could potentially complement ICD-9 code inconsistency.

Right. But again, clinicians take notes not to support research but to support patient care. There’s no real incentive for the clinician to make things compliant for your research. The onus is on you to extract the best information you can from the medical record.

Pharma is now emphasizing patient-centric research. Previously when people thought about the pharmaceutical industry, they thought about molecules and clinical trials, but probably less about the patients. Now there’s been a big shift. Every pharma now is talking about putting patients in the center of everything we do. Now everything is patient centric.

But you’re right. The health providers’ focus is on the patients, their specialty clinics, how to prescribe medicine, etc. But now there’s also, at a high level, collaboration between pharma, payers, and providers. There are organizations that do social networking and many physicians participating in clinical trials. I think from an educational perspective, those providers or physicians are quite aware of these kinds of collaborations with pharma.

But how do we enable AI? EHR vendors are helping. Of course it’s commercial with licensing, and there are limitations to data accessibility because there’s always personal confidential information, but because these vendors have internal data science teams to do these activities, they have taken care of much of the work.

The result is a very analyzable database with very structured data. Clinical notes are usually not very accessible. The vendors tend to keep that data to themselves because of confidentiality. Those databases could facilitate a lot of very basic research questions.

Can’t they go through and help you get the data that you need de-identified and strike a balance between the data you need and the privacy of the patient?

The providers definitely need to do de-identification following regulations. When it gets to the hands of the analyst, those data have already passed the ablation stage. In a sense, this is collaborative work. I mean the AI modelers, data scientists, they don’t necessarily do things from scratch because the providers have done their part. There’s also open source EHRs like MIMIC-III, which is accessible as long as people have done necessary training. Those databases are also de-identified, so every data provider must have done de-identification to release their data.

One trend that I’m seeing in the pharmaceutical space is this concept of digital companions that help the patient track medication usage. You could imagine taking an antidepressant, and the companion asks how you are feeling. Or maybe with a psoriasis drug, it reminds you to take your medication and also snap a picture of a lesion on your forearm. You could imagine combining all of this data together to create a better digital experience, more than just taking a medication.

That’s absolutely true. That fits into the bigger trend called digital health. The digital health is also part of the real-world data. Those patients get all kinds of assistance. They get sensors, wearables, and apps to facilitate their medication adherence and monitoring. This is also a big trend, and I think every pharma is looking into this space.

So that’s another source of real-world data. It’s not just the electronic medical record, it’s the patient, the wearables, and the internet of things approach. This could be a very data rich area for work, right?

Yes, exactly. Along with that, there are digital biomarkers and patient-reported outcomes. This opens up a lot more opportunity for technical people to participate because traditionally those electrical and computer engineering scientists worked on mobiles or on apps, but not necessarily those linked to health applications. Now there are grand opportunities for them. Especially during the pandemic, a lot of things have needed to be done remotely, which has accelerated growth in this space.

It sounds like there’s the formation of these integrated multidisciplinary teams, that people aren’t just at the periphery of these initiatives, but they’re actually connected and work together as a team.

That’s absolutely true. Now there’s a multi-disciplinary research environment, and there are also more technical people willing to participate in the health domain. It makes recruiting or collaboration in healthcare even better than before.

Maybe we can talk a little bit about the applications and the technologies with AI. It sounds nice, but it can’t be as easy as it sounds. How would you characterize the state of AI from your perspective?

AI is still probably in the early stages for health care. There’s a lot of success and great news coming out, like with image recognition being better than the radiologists in diagnosing cancers. But there’s also a lot of challenges; how do we generalize one finding across all patients, all disease areas? We are far from there. How do we even organize the data to enable these AI models? A lot of data engineering work has to be done first. People commonly say that making the data ready for machine learning probably takes up 80% of the time, and the other 20% is for real work in the algorithm and modeling and model development. There is a very bright future because every industry sector has realized the potential for AI to optimize their processes, and there is a lot of capital funding in those areas.

I think one key is how to land AI. People are talking about landing AI in business settings. That’s a challenge because I think on the algorithm side, there’s been a lot of advancement since the start of big data. For example, deep learning is an advanced version of machine learning that can better recognize the patterns embedded in large data with higher accuracy, so it’s a better pattern recognizer.

If we feed the machine a sufficient amount of data with signals in it, then those deep learning algorithms are most likely to capture the patterns. But how do we interpret the results? How do people trust the results from one study and scale? Those are some of the challenges. People need to invest time and energy.

One of the other challenges with AI and machine learning is the potential to introduce errors. You can have batch effects, for instance if you only acquired your data from a wealthy population. What do you have to think about to prevent these errors?

That’s a great question. I think one way to do that is through causality modeling because many studies are based on a limited dataset to draw conclusions. Those are usually about correlation, not necessarily causality. In the real-world data space, people are thinking about using causality to integrate all kinds of heterogeneous data or possible conditions and try to infer drivers. They are integrating heterogeneous data types, whether it’s clinical data, imaging, omics data, or text-based data. Those could all fit into multi-modality models. Then people can elaborate causal inference techniques like causal diagrams that infer relationships among datasets.

You can imagine asking an algorithm to analyze samples from pathology imaging data, but maybe we didn’t catch that somebody was circling areas of the actual image or there was some metadata that leaked in, like a watermark. All these things have to be considered when you’re trying to set up your analysis. Someone told me that they did an analysis and achieved 99% results matching and knew immediately that something was very wrong; the too-good result was not a good result. If it’s so easily picked up, then we shouldn’t have been doing machine learning at all.

Yes. I’ve heard similar comments. Not everybody believes that AI and machine learning can solve everything. I think usually those people are very thoughtful and have very strong domain knowledge, and they like to think how to build trustworthy AI. In fact, the topic of reducing bias in AI is being widely studied now. AI is like two sides of a coin: benefits and risks. I think it’s really up to the technology adopters, how they learn to pursue, and also up to regulators and developers or decision makers to set up the environment. I would suggest starting with something small. If you wanted to process clinical trial documents, previously a human read that page by page. Now if we apply NLP and we can quickly extract the key topics or trends within those documents, humans do not have to read page by page.

I think those kinds of AI applications would be very helpful. As you can see AI also happens at many stages. It’s just like human cognition. The very basic one helps you to read, to see, and at the next level it helps you to reason. And then the third level is where there could be complete automation like introducing imaging robots into the scene. Another level is real human-like robots doing the work. I would encourage exploring at a very basic level to help humans read and see using images and text. Those are good starting places.

Could we fall back into data silos as we go into AI and machine learning? Trusting other people’s data in AI and machine learning is paramount. If you find datasets from the literature, but they didn’t capture certain information or perform the experiment a specific way, the data scientist may want to go back and generate a training set they believe in. How do we avoid that?

Great fundamental question. This probably requires several different perspectives. One is at the cultural level, promoting socialization between different research groups. Some groups, they do experiments, they have lots of data, but their capability to digest the data is probably limited versus the computer scientists, who know a lot of algorithms. The computer scientists know how to unlock the secrets, but they do not have access to the data or they do not have enough background to define a valid problem. Organizational leaders should facilitate the creation of this environment.

The second thing, when it comes to the trustability of data, I think we will always feel the need to do experiments for ourselves. If you’ve got open source data from the literature, from GitHub, then you can test on that. But probably once you get to the hands-on work, sooner or later, you will realize limitations, just like you mentioned. When that happens, we probably have to ask ourselves to what level we can leverage the existing data. Maybe this portion is exactly what I’m looking for. Then some data fits in the middle, and some is totally unusable. The next question once you’ve made this assessment, is whether it’s worth the cost to generate more trustable data. If you are able to generate data yourself, great, just do it. But if you can’t, then you’re back to collaboration, and you have to look for somebody to get this data for you.

So where do you think we’ll be three to five years from now? Are we going to be better off?

From my personal perspective, AI is already a big success, and it hasn’t been systematically applied yet. It’s just beginning in the healthcare domain. I would say in three to five years, if we are able to apply today’s AI technology algorithms on the data we could potentially collect, there are plenty of problems that could be solved. I would say it’s definitely worth a lot of AI investment to really get hands dirty. Let’s try existing techniques on the data we have, see what patterns we can find, and see how we can optimize existing problems.

I’m optimistic. On the other hand, I think there’s also more to be done with knowledge sharing and training the culture. Some people have misconceptions about AI; some are totally optimistic and think AI can simply play magic. We should also improve AI education, unfold the black box to see what’s inside. Many people just read news articles and successful commercial stories, and then they tend to believe AI is going to solve everything. That’s not an ideal way to think of AI. We need to find a balance, promote the population understanding of what AI can do, where it is being used. Make it a common practice. But think of it like an X-ray: it can help you diagnose, but it does not solve your disease problem. AI has this limitation as well. It helps you to better understand the data, diagnose, and see trends, but it’s not necessarily a panacea. We have to invent other major complementary tools.



Get updates from BioTeam in your inbox.

Trends from the Trenches eBook January 2022

The eBook of BioTeam and Bio-IT World's Most Trending Articles