20 Jul 2013 Internet2 LifeSci Focused Tech Workshop Writeup
Internet2 Focused Technical Workshop:
Network Issues for Life Science Research
I’ve long known about the existence of the ultra high speed research networks that link “big science” sites, academic institutions and super computing installations but to me personally they’ve always been a bit of a mystery – firstly because they tend to only be installed in big academic sites and I do most of my consulting work in biotech/pharma and secondly because I believed (mistakenly) that these networks were off limits to commercial and industry type people.
These super high speed networks are used and operated by people who think nothing of tossing around casual mentions of 40-gig and 100-gig optical network links. One of my immediate impressions while at the meeting was feeling how “bandwidth poor” I am in my own professional work! Even in the big pharma and biotech companies I get exposure to are just now doing widespread deployment of 10-Gig beyond the datacenter core and they *might* occasionally have 40-gig links within a single line card or perhaps a 40-gig link between two redundant core switches at the heart of the datacenter.
The people at this meeting speak casually about linking buildings together with 40-gig and wiring their network core at 100-gig. Well maybe not ALL of them but those speeds are absolutely mainstream within this community. I did feel jealous.
Image source: Kuang-Ching Wang , Clemson University via his talk slides on “SDN Use Cases for Life Sciences Research“
The first Internet2 person I met was at this year’s Bio-IT World Conference & Expo where a few I2 technical and business development types were in attendance. Michael Sullivan and I had a nice conversation after one of my presentations and we swapped a few emails after the event.
Michael was also the first person to tip me to the existence of the technical workshop I’m writing about in this post. He mentioned working with ES.net on putting together a life science specific meeting of a type they call “Focused Technical Workshops” and he asked if I’d be willing to do a basic overview of the BioIT landscape for an audience that was expected to be fairly heavy on network experts with relatively little exposure to the biological sciences.
Image source: Steve Tuecke’s Globus Online talk
Why this post?
I’m not going to recap the meeting. The event site itself is pretty comprehensive and contains info on the attendees, the schedule and links to downloadable presentations. They will also be posting video of the session and some sort of official post-event writeup. I’ll update this post once those become available.
For me personally there were three main takeaways at this meeting that I thought deserved to have wider exposure in the professional community that I associate with. Rather than spam a bunch of my email contacts with emails and big PDF attachments I figured I’d toss up a blog post with some links and just let people know about that.
ScienceDMZ Best Practices & Reference Architectures
Over the past year or two I’ve had many conversations with IT, researchers and network engineers regarding how to move large amounts of scientific data within an organization (without trashing other users or critical services like VOIP connections) and into organizations (without melting down firewalls and IDS appliances).
It turns out the high energy physics people have been dealing with these issues for years and there is an emerging concept of how to construct “Science DMZ” infrastructures that allow for high-rate scientific data movement in ways that can still be policed and monitored for security, bad actors and intrusion attempts.
Elii Dart gave a great “ScienceDMZ 101” talk, at the time I wrote this post the link to his PDF was broken, he has kindly supplied me with a larger slide deck taken from a 60minute Science DMZ Presentation he gave last month. CLICK HERE FOR THE SCIENCE DMZ SLIDE DECK (PDF).
The concept of a “ScienceDMZ” is covered well over at http://fasterdata.es.net/science-dmz/ and there are some links to recent presentations and tutorials over at http://fasterdata.es.net/science-dmz/learn-more/
Ok just go ahead right now and visit http://fasterdata.es.net/ and bookmark it. This site should be widely shared among Bio-IT HPC and enterprise networking types. Tons of information about TCP and very specific advice on tuning hosts and even routers for faster performance on higher speed links.
Of particular interest is the “Say No to SCP” page. We should be sharing this URL with anyone who has ever told us that genomes should be freighted around with scp or rsync etc. And for those of us who can’t avoid using SCP should check out the PSC site that hosts a patched version of OpenSSH called “hpn-ssh” that engineer around a few of the more problematic issues.
BRO Security Monitoring on very fast networks
I met Robin Sommer in the audience. He shared with me a few links about a project called “Bro Network Security Monitor” that he is involved with. Bro is one of a class of technologies and methods that are emerging to deal with network security in the era of 100-Gig optical networks. Robin provided a few more intro links including a nice arsechnica writeup called “Securing supercomputing networks (without disrupting 60Gbps data flows)” and “Using ICSI’s Open Source Bro Platform to Protect the Blue Waters Supercomputer“.
The NIH/NCBI is probably one of the leaders in this space, particularly given how much data NCBI has to publish and the fact that the NIH is undergoing a network refresh at the moment. The NCBI may be moving to more of a “remote IO” type of access model now that people recognize how hard it is to move mass chunks of data just to get at the snippets a researcher is actually interested in. More than a few mentions of bioinformatics postdocs who had tow download 100+ terabytes of data just so they could get at the few gigabytes of data about a particular chromosome or gene that they cared about. I enjoyed Don Preuss’s talk and his slides can be found here.
I’ve mentioned in conversation and on stage at events that “Anyone who is not using Aspera is using GridFTP” when it comes to high-scale data movement. It was great to see people from Globus and Asperasoft all in the audience (and presenting) at the workshop.
Michelle Munson, always a fantastic tech-heavy speaker gave a good update on Aspera but her slides are not online yet via the event pages.