Stan Gloss, founding partner at BioTeam, interviewed Tom Messina about leveraging cloud compute and storage resources to maximize innovation while minimizing costs. This article was originally published by Bio-IT World in December 2021.
Tom Messina has worked in various roles for Johnson & Johnson dating back to 1999 and is currently an IT Director in J&J’s Pharmaceutical R&D division, Janssen. BioTeam had a chance to speak with Tom recently to appreciate how J&J is leveraging cloud compute and storage resources to maximize innovation and growth while controlling costs.
Editor’s Note: Trends from the Trenches is a regular column from BioTeam, offering a peek into some of their most interesting case studies. A life science IT consulting firm at the intersection of science, data and technology, BioTeam builds innovative scientific data ecosystems that close the gap between what scientists want to do with data—and what they can do. Learn more at www.bioteam.net.
Stan Gloss: Tell me about what you do as IT Director at J&J.
Tom Messina: We call my team R&D Advanced Computing. We focus on a few different verticals. HPC is probably our flagship vertical, but we’ve grown across a number of other areas. We have a lot of work going on in all walks of Digital Imaging. We also have a vertical that we call Analytics Engineering, which is a bit more focused on modern-day data science type activities—model management, development, training, collaboration, reproducibility—basically platforms to support a multitude of data science workbench activities to support very far reaching data sciences organizational needs.
My team is responsible for driving out the technology solutions in those spaces, providing technology landscape, strategy road mapping, doing lots of solution architecture and design, and more often than not, hands-on implementation works as well.
Does J&J leverage the cloud as part of that solution mix?
Yes. Through all those verticals, our focus is basically on building out platforms that are highly reusable. For my team, we do all of it in the cloud, 100%. My team probably represents a significant portion of J&J’s cloud usage and associated costs, which I hear about often. We also drive a lot of the influence as to how J&J is positioned in the cloud space.
What percentage of the compute do you think is being done in the cloud out of overall compute?
We are probably nearing 25% in-cloud versus 75% on-prem. The goal is to get to 50-50by 2025.
What are the challenges of actually running that hybrid strategy?
Well, the hybrid strategy is a J&J strategy, whereas my team has gone all cloud. Because of what my team does, the platforms we’re building, the flexibility that we need, the growth that we experience, and the rapidly changing business landscape, going all cloud made lot of sense. I can speak primarily from that perspective. As of four years ago, my team has been all cloud.
I do see the challenges that some of our other peer groups in the organization face. Things like, “my data’s on-prem and I want to run my compute in the cloud” or vice versa, I hear those the most. But we’ve overcome a lot of that with our partners, because we’ve built everything for the cloud. Even when you talk about networking or data challenges, we take an approach where we say “let’s push it all there”. In cases where there’s client applications, we create them to stream from the cloud and give that local feel, but everything’s running in the cloud. In a lot of cases, it’s on a more performant machine than what they have locally, because in the cloud we have much more flexibility. We can give them as big of an environment as they need. And then if their data’s right there, it works out really well.
Are there specific use cases that you find map really nicely to the cloud or can every use case go there?
Anything that is a naturally elastic use case seems to work really well. This is part of the reason why 10 years ago we decided we should start putting our HPC solutions on the cloud because they ebb and flow so much. There’s so many peaks and valleys that it made a lot of sense and still does.
Now we’re probably getting to a point where we’ve got so much critical mass, 16 HPC clusters running in the cloud across a number of different business partners. Each one is very elastic. You might have one partner whose bill is $100,000 in a month, and then the next month it’s $10,000, just because of the nature of the experiments that they’re running.
It probably would be helpful to have a minimum set of compute that is just always there, bought and paid for, which is where your on-prem justification comes in. Of course for us, being all cloud, what we try to do most is to get that baseline as cost efficient as possible. We do everything in AWS. We haven’t really gone into Azure or GCP yet, at least in my group. We use AWS Savings Plans heavily. We use Spot as much as we possibly can.
Does that work out for you?
Most of the time, yes. We do run into situations where certain node types are not as available as other ones, especially things like GPUs. But we’re working on expanding the zones in which we are willing to run. This is part of the J&J challenge and how we’ve set up the networking. By default, you only get a few zones. You might only get two zones in North Virginia when there’s actually six to use. So we’re working with our networking team to get accessibility to all the zones. That way we can have the best chance of running in the deepest Spot pools possible.
The other thing is we’re being a little bit more flexible with what node types we’re using. Does it have to be exactly P3.4xlarge, or can you run on P3.xlarge, P3.2xlarge, P3.4xlarge, or P3.8xlarge and have more pools to tap into? We’re working through orchestrating all of that because we have a Spot-first goal until we find reasons we shouldn’t be using Spot. Then we’ll make sure we’ve got enough in our Savings Plans for those nodes. And if it doesn’t make sense at that point, we will do on-demand, but on-demand is always the last resort.
Do you use reserved instances?
Yeah. Savings Plans is Amazon’s new Reserved Instances. They’re phasing out Reserved Instances.
Do you like that? Is that a good thing?
I think so. They give you more flexibility than reservations. With reservations, a lot of times you have to lock yourself into a very specific situation; Savings Plans are more like just paying for compute in general. You don’t have to specify instance types or sizes or those sorts of things, so you have a lot more flexibility.
What’s J&J’s connectivity strategy to the cloud.
We have a direct connect with the cloud providers. There are some choke points still that we’re working through, but it is a direct connect and that creates that hardened gateway. That way we can launch with J&J IPs, and it looks like it’s an extension of the J&J network.
What are those choke points?
Sometimes you’ve got network transfers running fine, and then all of a sudden they drop by 80% performance. We’ve got to go back and figure out what happened. It happens occasionally, and it can be painful and widespread. Sometimes you’re doing transfers to AWS and it happens. You test data transfer within J&J and it’s much faster, so you know there’s something going on in that connectivity between J&J and AWS that changed.
Then when you get on AWS and are running on a node, there are guardrails in terms of the types of machine images that you can use. Typically, we can only use J&J created images. There’s a little flexibility around that, and then there’s also security and limitations around what we’re able to do. So we just need to make sure that we work within that and still have all the flexibilities we need to meet business needs.
How do you handle permissions and SSO?
That’s one of the nice things about having our Shared Services Organization create these guardrails. They’ve hooked up the single sign-on for the whole enterprise, so when you go out to a portal, it automatically authenticates you, and then you can click and choose any of your AWS accounts. You’re automatically in, which is really nice. It’s all tied into active directory. When you deploy an application, you still have to make sure that application can work with Ping ID, or whatever it is to make sure that if you want single sign-on capabilities, the app can handle it. But we’ve been able to have those needs met for both internally deployed as well as externally facing applications. We’ve got some capabilities where even if an application’s externally facing, we can enable SSO. Sometimes it’s just for the J&J employees,and then the external partners need to go through another mechanism or they just have local authentication and that works well.
What are examples of elastic use cases?
Anything that has peaks and valleys of usage, jobs that can ebb and flow. We do a lot of our PK/PD modeling and simulation, computational chemistry, pharmacogenomics, data sciences and image analysis in the cloud. Just about any situation that requires large-scale HPC, we tend to get involved with.
What’s imaging to you?
We do a lot of Digital Pathology imaging and everything that comes along with it. All of the storage doesn’t go up and down, but it does grow. Then you have all of the apps that are on top of it to enable collaboration, annotations, peer reviews, primary reviews. And then all the associated image analysis that has to happen with those images. They require both CPU and GPU capability to do advanced machine learning, deep learning image analysis execution. We’ve got a couple platforms that facilitate those needs. Also for Radiology-based imaging, like CT scans, PET, MRIs, etc.we currently provide some solutions and are planning for a large scale platform to facilitate everything from storage through large-scale image analysis.
A lot of those imaging applications are heavy users of GPUs. I know that in Amazon, those are highly sought after nodes. Do you guys ever have a situation where you can’t get the nodes that you need and how do you deal with that?
Yeah, we’ve run into it a bit more recently just because they’re becoming much more sought after. Again, we’re trying to expand our flexibility. Amazon has a nice feature where if you provide flexibilities, it will look into the deepest Spot pools for you and then automatically deploy to those pools. We’re trying to get smarter about our orchestration. But even as we’re doing that, you have to run some things on demand because the business partners, the scientists, they can’t handle too many of those failures. So, we have created specific Spot queues, where they can send short running jobs, for example.
If you know your job’s going to run in six hours or less, submit it to the Spot queue; if not, you can go for on-demand, which could be a reservation as well. We’re trying to be as flexible as we can, but also as cost conscious as possible.
Have you ever had spot instances fail in the middle, and not gracefully? How do you deal with that?
Yeah. In some cases, it’s actually okay because the submission program may automatically resubmit. A computational modeler may submit 10,000 or 20,000 short-running R jobs or R tasks, which are good to run on Spot because even if you start losing some, they run so quickly that you just resubmit it back to the grid and they run somewhere else. It becomes an issue if you’ve got longer running activities that you’re trying to run on Spot, and 22 hours into the job, you lose the node.
We try to resubmit everything, but you’ve got to keep a close eye on it because that could be a day lost waiting for results. Some of the applications have checkpointing and are graceful in terms of how they can resubmit, so you don’t have to restart from the beginning for example. We’re starting to investigate those aspects a little bit more.
What are the real cost drivers? Is it compute? Is it storage? Is it, egress? How are you optimizing?
I think it’s mostly compute. Based on the numbers I’ve seen I want to say it’s probably like 70% compute, 20% storage, and 10% egress.
We’ve got some specific cost optimization efforts underway. Dedicated efforts looking at cost optimization in the largest accounts that we’re running and understanding what those teams are doing to be smart about good cloud hygiene. There’s a whole checklist of things that we go through to make sure that we’re appropriately sizing our instances, to make sure that we’re being smart about things like reservations or Savings Plans, and Spot. And then we’ve got some recommendation engines that actually go through the accounts, look at what’s running, the CPU utilization, the memory utilization, the networking, and make recommendations based on whether you are optimized—is your node too big for what you really need?
Costs can run away in the cloud if you don’t really manage it well. You have users who think they can just have access to infinite resources.
Absolutely.
Are there other efforts you’re making to educate end users and avoid unnecessary cloud costs?
I think in the old days we had more of a problem with this. Now, as long as a node is under the proper automation, if it is not running for 20 or 30 minutes, it just goes down on its own, and you no longer pay for that.
In the areas that we’ve been more involved in as of the last four or five years, we’ve done a better job of getting control over those costs. We will do things like start out with some limits and quotas.
We can get a feel for growth and ask what the future looks like. How do we evolve to make sure that we’re being smart about efficiently using these resources and containing our costs?
We really haven’t gotten to the point where we’ve restricted people from growing. That’s not the idea. We’re actually to the point now though, where some of the growth is so significant that we’re looking at the business to say you guys have to start funding some of this. IT has funded it for many years, and it’s at a point where, because of the growth, it’s just not possible for IT to continue to fund it on its own. Last year, I think we established that wherever you ended up in 2020, that’s your baseline. If you’re going to grow beyond that for 2021, please put that money aside. We’re going to need it come June update or November update.
That’s one of the ways that we’re sharing the responsibility because it’s not intended to limit the size. It’s just more intended to say we all have a role here and to make sure we’re smart about how we’re using these resources.
Based on your experience today, what do you wish you knew about deploying and managing data in the cloud when you first started?
Think about cloud hygiene from the start. I know it’s tough because when you’re getting going with something new, all you’re thinking about is enablement. But I would say as early as possible start baking in the idea of optimization and clean up. Make that a habit because it’s just part of living in the cloud. Like you said, otherwise things can run away. I think it took us a while to get to that point of maturity.
You mentioned that you’re primarily using AWS, but you also have Azure and you have been looking into Google. Why are you looking at other cloud vendors besides AWS? How hard is it to juggle between cloud providers?
My group is pure AWS at the moment, but I can speak for J&J in general on this.
There are certain aspects of each cloud provider that differentiate them from each other in terms of capabilities. IT has partnered with the business across many different areas of J&J to understand these differentiated points and utilize areas of best fit. There’s also the aspect of not putting all your eggs in one basket. Have leverage points. Even though AWS has been a tremendous partner and has consistently worked with us to drive our costs down, we still want to be spread out a bit.
To your question of how to decide where to go, the way I’m looking at it, how do I scale a team for multi-cloud? I honestly don’t know the answer to that because I can’t even imagine having folks have to understand two clouds. There’s so much to understand in AWS alone that a lot of times things will come up and we’ll realize we weren’t even aware a capability was out there. Having to keep track of more than one provider is a little daunting. I think we will have to do it someday. But it is a real challenge with limited resources.
One of the problems can be taking your on-prem HPC people and getting them to transition to the cloud, because things are very different in the cloud than they would be on-prem. A lot of on-prem people are very resistant. Do you find that at J&J?
I think we do. With our HPC work, we started using the cloud a decade ago, so we started almost fresh as a team. We didn’t have those folks that were just so used to doing it on-prem. We had enough forward-looking individuals. But it totally is different, at least the infrastructure automation and orchestration.
Once you stand up a cluster, there’s a lot of similarities. Altair Grid Engine on-prem looks like Altair Grid Engine in the cloud. But to get to that point, you’ve got to do work in Altair NavOps Launch templates for auto scaling up and down, how you launch these nodes, etc. There’s this DevOps aspect that is hard for people to wrap their brains around. I can understand that. But there’s so much power there that it makes a lot of sense for us.
It’s amazing to me that in 2021 there’s still companies that have not taken a full plunge on the cloud, who still view it as experimental.
Yeah, I think costs are probably the biggest concern for folks. That cost conversation is a tricky one when you really get down to it, but I think if you’re smart about how you contain your costs in the cloud, you can get a lot of compute for a really efficient price. You just have to be knowledgeable about how to do it.
For most clients that I talk to, it’s more a capability play than a cost play. I think the point you made very clear today is that can be reasonable with good hygiene.
That is exactly right. For us from day one, it was always about capability. It was always about the value proposition back to the business. How much compute time or waiting time are we saving you? And about being able to run very elastically versus in a limited capacity and in an on-prem mode.
Is there any other advice you would give a new startup company thinking of going to the cloud?
I recommend significant iteration. Experimentation and iteration, trying things out, tearing things down frequently and then getting to a point where you can start to design these things for longer term scalability.
Cloud is the perfect place to experiment. On-prem, you don’t get the ability to quickly build up and tear down. With what we’ve been able to do, the experimentation has been amazing. Just launch, try something. Didn’t work out? Tear it down, revamp, go again. But keep an eye on those cloud hygiene aspects because the earlier you can apply them the easier it’s going to be to manage your situation as the growth comes, because a lot of times it comes quick.