01 Jul 2013 Fun and Games with Genomic Analysis At Home
We have a lot of freedom as BioTeam employees. One of the biggest advantages is that we’re a virtual firm, we work from home or a coffee shop or wherever we happen to be – we have no offices to report to and no rush hour to deal with.
A large proportion of our recent work involves working with and developing Next Gen Sequencing (NGS) tools and infrastructure for researchers. Since this research platform pushes all boundaries of big data and requires very powerful computational environments to process and interpret the data, we’re often pushing the limits of current technology to their extreme.
The confluence of these on-the-job parameters was highlighted recently while pressure testing the prototype of BioTeam’s latest research solution, the SlipStream Appliance: Galaxy Edition, to generate some benchmarks for the official product release at the Galaxy Community Conference in Oslo, Norway. SlipStream represents the integration of years of BioTeam experience, knowledge, and best-practices in the NGS research space with our significant experience in architecting HPC infrastructure for NGS data management and analysis. BioTeam tends to make products that solve problems that we encounter repeatedly as consultants. SlipStream solves the many problems associated with installing, maintaining, and supporting the Galaxy Analysis platform and the associated hardware needed to perform the analyses. In this case, we noticed that many groups had a hard time either utilizing their institutional HPC resources or spent significant resources (monetary and personnel) in maintaining their own solutions for running Galaxy instances. So, we sought to provide a small, desktop-sized powerful compute solution complete with an easy to maintain IT management system at a price point that is well within the affordability for most small laboratories. Thus, Slipstream provides a very powerful compute infrastructure (16 cores, 16TB storage, 384GB RAM) along with a pre-installed fully production instance of Galaxy and the majority of the Galaxy tools, all for less than <$20,000. With this platform, scientists can get back to doing their research, do it quickly and efficiently, and not have to devote resources to managing IT infrastructure.
The prototype SlipStream appliance is currently in my home office with Time Warner providing my high-speed internet connection (50Mb) to the rest of the SlipStream development team. From there, we set up the base machine, automated the software installation and configuration, and have been hitting the machine hard to generate the aforementioned performance benchmarking. The benchmarking (results to follow in a subsequent post) focused on the more computationally intensive analysis tools, primarily NGS mapping tools, on large datasets (5-125Gb).
After beating on SlipStream for a while, one of our collaborators in Boston wondered if the data transfers were killing my neighbors internet connections by saturating the neighborhood’s network. Out of curiosity, I looked at what my comparative bandwidth usage for June was (see the result in the graph to left). It turns out that my usage was crazy high, 623GB of data transferred last month (my apologies if you live near me and you’ve had a harder time streaming Funny Cat Videos this month)!
The Slipstream appliance barely broke a sweat during the Galaxy benchmark analyses (though my electricity bill is up a bit for this month) and so far I’ve had no phone calls from Time Warner complaining about the bandwidth.
In our experience, a small-mid size NGS lab might see tens of GBs of data generated in a month, far less than was handled by the SlipStream appliance sitting in a home office. If I can use this great tool in this manner from my house, a small laboratory should be able to effectively use it in almost any research environment.
We firmly believe that IT infrastructure and management should not be a bottleneck for scientific progress. There are many important uses for centrally managed shared HPC resources, but they often represent generically managed systems that don’t support customized research environments. SlipStream can provide a bridging function between core high performance computing resources and the local laboratory. We’re very excited to see how this tool helps advance scientific discovery!