Stan Gloss, founding partner at BioTeam, interviewed Saira Kazmi, Executive Director of Enterprise Data Engineering and Analytics at CVS Health, about creating data assets and managing the data life cycle at the enterprise level for a major healthcare player. This article was originally published by Bio-IT World in March 2022.
As Executive Director of Enterprise Data Engineering and Analytics at CVS Health, Saira Kazmi uses her background in computer science and engineering in both academic and industry positions to inform her new role. She learned about the importance of iterative testing under the tightest timelines from the industry and got a first-hand view of the engineering rigor required to enable bioinformatics from her academic career. Now she is focused on creating data assets and managing the data life cycle at the enterprise level for a major healthcare player.
Kazmi recently sat down with Stan Gloss, founding partner at BioTeam, to discuss her vision for data engineering and implementation in her new role. Bio-IT World was invited to listen in.
Editor’s Note: Trends from the Trenches is a regular column–and now podcast–from BioTeam, offering a peek into some of their most interesting case studies. A life science IT consulting firm at the intersection of science, data, and technology, BioTeam builds innovative scientific data ecosystems that close the gap between what scientists want to do with data—and what they can do. Learn more at www.bioteam.net.
Stan Gloss: How did you arrive at CVS Health? How has your background in both academia and industry shaped your philosophy around data engineering?
Saira Kazmi: My graduate studies involved bioinformatics research. The goal was to enable computational and molecular biologists to make sense of large amounts of data generated from multiple sources to extract biologically relevant insights. I loved the engineering work, but I realized I was not fond of the research publication process. Therefore, I started to explore opportunities in the industry and joined Thomson Reuters as a lead engineer with a hands-on engineering and delivery role. The application was developed in Java, used the Hadoop MapReduce framework, and had a front-end visualization and search component. This full-stack experience helped me understand the complexity and the rigorous process that enables a live production system with tens of thousands of users. I owned the product end-to-end from design to production deployment with the help of a cross-functional team with both on and off-shore members.
The engineering work that makes sure an application works seamlessly for a large user base was exciting and rewarding. I learned about the processes that need to be in place to ensure that system issues and errors are tracked with a graceful recovery. You have to make sure any system changes are tracked and reversible. I learned that testing becomes an essential part of the software development process. The testing becomes more and more rigorous with the higher level of user and business impact of the system in production. In some cases, you are managing SLAs of a few milliseconds and downtime of a few hours to recover and be back online after any changes or failures.
Since then, I’ve been in several roles in the data engineering space, ranging from data management and architecture at The Jackson Laboratory to building the computer vision platform that evolved into the Machine Learning COE at The Hartford. These roles allowed me to think about the big picture and understand the value of data as a core asset for an enterprise.
Where do you see differences between academia and industry?
It is very different in the industry where pipelines are deployed in production instead of research pipelines built to prove a concept or a theory. The quality of code, testing, and the amount of investment, as a result, is vastly different. You have to think about the customers and how they will use the system. The big change from academia to industry is about building resilient systems and managing and recovering from failures. You can see the direct impact of your work on the business value, which is very rewarding. In research, that feeling is lost because engineering may not be the most essential or valued.
Have you seen the difference in academic research being more project-oriented, whereas data is the product on the commercial side?
Well, the industry also works in a project space. All work is funded through a business need or a use case. But you’re right. Foundational data assets are leveraged across multiple projects when working for a large enterprise. These data assets must be managed as an independent asset or a product. This avoids duplication of effort, duplicate investments, and provides a single source of truth or master data for the organization.
For example, think about the customers of an organization. You need foundational data elements that are consistent across all touchpoints within the organization. The data matures, and the quality improves when used and validated across different use cases. A master data asset may be developed and matured, providing great ROI and business impact.
If we say data is a product, we‘re ultimately saying data has a life cycle. What does that mean to you?
I think all data has a life cycle regardless of whether it is managed as part of a project or as a product. For a project, the data life cycle begins and ends with the narrow value proposition of the project. When you think about data as an asset, this lifecycle encompasses the people, processes, and tools, starting from data generation: for example instruments, IoT devices, web interfaces, applications, and other users and systems. The data may only be effectively used if data and metadata are tracked at the right grain across all the storage systems and hops. For example, if data originated from an instrument, we would want to record who was running it. What was the date of collection? Which software version was installed and used? That information is important and must be kept for downstream analytics and business understanding.
Other important data points—reference data, glossaries, or business terms—give meaning to physical data assets. This information becomes the core driver for analytics, business process understanding, and decision making. Apart from these elements, tracking business feedback, operational metrics, and insights makes this a well-rounded asset resulting in reliable, trusted, validated business decisions.
Another management component is a strict metadata capture policy and data governance program. These policies and programs allow for meeting audit and compliance needs for regulated industries—for example, requirements to delete the data related to a customer. Metadata capture enables tracking and knowledge of where your data lives, who has access to it, the acceptable use cases, and the ability to archive or delete the information when needed for business and compliance needs.
You talked about data as an asset. Is there a key philosophy around making data an asset for an organization?
It has to do with leadership philosophy and sponsorship. Some focus on the immediate use of that asset, aligning to a specific user case or a project. Others will have a long-term view of making additional investments early on and enable the foundations to scale and accelerate future needs. If you don’t have leadership buy-in and vision for the long term, you may end up in the project-focused space where the assets are built for purpose. The project-focused view could be the better option in the innovation space or the case of constrained funding.
How do you make that case?
Data science teams are very well positioned to help with that. Suppose you are a data science or research team leveraging the same data across multiple projects. You are very well positioned to make the case to the business to think about this holistically rather than managing duplicate copies from the source system and never contributing back with additional insights to the core asset. When you think about data as a product or an enterprise asset, it does require more funding, more governance, and more investments early on. Still, if tied to ROI or business value across the organization, it does get traction and, in return, accelerates innovation and agility of an organization.
You’ve mentioned that there are five key enablers to making data an asset. Can you tell me about those five areas?
Sure. First is setting conventions and standards across your organization, defining data management best practices, setting minimum metadata capture requirements, and establishing standard naming and formatting practices through a governing body for each domain.
The second one is defining standards around data transformations. This allows for lineage mapping from raw to conformed, published, or reporting layers. Documentation and consistency of the process are essential.
Third, having a metadata management system will allow data discovery across the enterprise. It becomes easier for innovation work and new projects to find the existing assets and the subject matter experts.
Fourth is having governance around the ingestion of new and third-party data and building accelerators to harmonize with internal data assets.
Finally, publishing data validation and quality control metrics will allow the users to understand the value or relevance of data for their needs.
Great. You talk about the one very important word that is central to transformation for any product—the processes. How do you turn your raw data into the usable product, what is that process?
I think it’s good to understand the users and their use cases. Raw data itself may be used as a data product, especially by data science teams interested in validating predictions and finding answers when anomalies are found.
Having data that has been processed into consumable form is valuable and will only be trusted if the metadata around transformations is captured and available along with the data. Information is essentially unusable without it.
The ability to search and find data with standardized business glossaries and subject matter experts make it a reliable, validated, and certified asset for the organization.
This touches on some of my big fears moving forward. Everybody talking about AI and machine learning, but maybe not appreciating the need for trustworthy data. How do we meet the needs of an end user who may be using AI and machine learning algorithms?
In my experience, the data life cycle and the AI life cycle go hand in hand. AI is not usable without trusted and high-quality data. Organizations are responsible for ensuring that the algorithms are not biased and that enough data is captured and available to meet the accuracy and support for each prediction made by the algorithms. For example, if you have an unbalanced class distribution, you will need to capture enough data to allow for high certainty in predicting classes with lower frequencies.
Having data available with the lineage is very critical. AI strategy cannot exist without a solid data management strategy. I have never deployed an algorithm without understating the complete data lifecycle from the source system to consumable formats.
In practice, the implementation of an augmentation framework is often valued by businesses over automated prediction services. This way, business experts have the best data to support their workflows and decisions.
Are we at the point where AI is mainstream?
Yes. We are at a point where AI and ML can be reliably used in production use cases. In the last four or five years, I’ve seen a drastic change from AI/ML use in novelty use cases to supporting and optimizing critical business operations.
What gets you most excited about working in this space and what makes you get up every morning and go at it?
I feel very lucky to be in this space because there’s so much innovation and creativity. I also see the opportunity to make an impact; for example, there is an opportunity to improve the quality of everyday lives in healthcare. I feel blessed to be on this path to make a difference