29 Aug 2011 Why we built a Backblaze
BioTeam’s Backblaze 2.0 Project – 135 Terabytes for $12,000
- Part I – Why you should never build a Backblaze pod
- Part II – Why we built a Backblaze pod (this post)
- Part III – Our real-world Backblaze pod costs
- Part IV – Backblaze pod assembly & integration pictures
- Part V – Backblaze Initial Performance Data
- Part VI – Backblaze pod software & configuration (future post)
- Part VII – Backblaze pod ongoing impressions (future post)
Why we built a backblaze pod
After the last blog post explaining all of the sensible reasons for why you should never build a backblaze pod it’s time now to talk about why we did decide to build one.
- Disruptive stuff. Anyone familiar with the price of Tier-1 and Tier-2 storage sold into the enterprise these days should recognize the potentially disruptive nature of “100TB usable for $12,000” storage devices. Even with all the potential downsides and negatives accounted for, it’s possible that systems like these could enable all sorts of previously unforeseen applications and use-cases.
- Hands-on is always better. BioTeam is a tiny company; we don’t really have a marketing budget or PR team. Sure we could pontificate from afar about systems like this but it’s always better to get hands-on with the stuff one talks about. We wanted to play with one of these pods ourselves for a number of different reasons.
- Ridiculous detail. The sheer amount of detail that Backblaze provides has already allowed many others to follow in their footsteps. From CAD designs to updated Hitachi drive model numbers, the data needed to “build your own” is all there. We knew that all the info we’d need was already published. The backblaze folks are also quite responsive to the comment threads left on their blog so we felt confident we could ask questions and get a timely response if needed.
- Protocase.com. The fact that protocase.com is now selling “everything but the drives” kits was the deciding factor for us. Even a little bit of pre-purchase research on the parts and components that backblaze recommends indicates that many items might be hard to acquire or expensively priced for low-volume orders. Even with all the parts on-hand, a certain amount of custom wiring looks to be required and when it comes to IT products used in business/scientific settings we draw the line at DIY chassis wiring. The up-charge and/or profit margin that Protocase adds to the kit order is more than worth it, simply because it removes the need to rig custom wiring.
- Valid scientific/business use-case. We had an actual use case that was a good fit for what the backblaze provides.
- Willing client. We had a local client who was willing to pay for all the hardware purchase and shipping costs. We worked out what was effectively a barter arrangement – the client would pay for all hardware costs and BioTeam would provide build, configuration and integration services in exchange for the “hands-on” “kick-the-tires” access we desired.
Our Use Case
Our use case involves very large amounts of public-domain, open-access scientific data that is published and freely available online. This data is downloaded from the internet and brought “inhouse” for analysis, data-mining and other activities. The volume of data is immense (think dozens and dozens of terabytes …).
This data is downloaded and stored on a more “traditional” enterprise storage platform which makes it available to various informatics and research pipelines and tools. This data plays a vital role in research activities. Since this data is public domain and very very large, an intentional choice was made not to expend resources to back up, replicate or otherwise “protect” the non-unique and non-critical data.
However, we still have a “risk” in this scenario. Even though the scientific data is totally non-unique and easily available again for re-download from the internet — the sheer volume involved presents a non-trivial risk of research disruption lasting for days or more likely weeks if the data was lost and had to be downloaded again from the internet.
A simple pencil-and-paper exercise: “how many days/weeks would it take to re-download all of this public domain data over our internet circuit?” was enough to justify this particular backblaze storage project.
That’s it basically (or at least all I can talk about in public). We are using the backblaze pod plus NAS appliance software from www.openfiler.com to build a “last resort” storage pool for scientific data that is not valuable enough to spend lots of money on a more traditional storage solution yet large enough in terabyte terms to represent a significant time-risk should an event occur that would require all this data be re-downloaded again via the internet.
We see this $12,000 appliance as a simple hedge against interrupting ongoing research activities. Totally worth it.
How would you use one?
What would YOU do with 100TB of usable space for short money? I’ve been thinking through a number of different scenarios …
Offsite backup? Got data that is not worth putting on tape or sending to iron mountain? Maybe you do that already but want one more level of “just in case protection“? Put one or more of these pods at the end of a Comcast Business IP circuit or Verizon FIOS link and replicate via rsync or even the slick new data movement/sync stuff from Aspera. Heck, this box is only 4U deep and would not be all that expensive to colocate if needed.
End-run around cloud storage provider SLAs? Got data in the cloud? Worried by recent data loss events at AWS and elsewhere? What if your cloud storage provider only offers 180 days of file retention and your CIO is demanding more? Something like this could allow for one more level of “just in case” protection for files & data or even just an annual or semi-annual data dump.
Object storage? I have not been hands-on enough with private cloud software from Eucalyptus or OpenNebula but if they really do come out with usable clones of AWS S3 object stores than running such a system across a few pods seems reasonable (as long as the cloud storage layer handles 3x replication and redundancy across the pods).
Tier-Z storage? More and more software like StorNext is capable of policy-driven full data lifecycle activities like automatic file movement, archive, storage tiering and replication. Especially with systems like StorNext that know how to replicate and move data between and among datacenters and storage tiers it seems possible that a number of backblaze pods in geographically distinct facilities could act as a sink for the “storage tier of last resort”. That said, however, if you have the money to afford StorNext licensing you can probably also afford storage kit with less operational downsides than the backblaze pod.