I remember in the mid-1990s the question of how many minutes away from a diversion airport a two-engine passenger jet should be allowed to fly in the event of an engine failure. Staying in the air long enough is one of those high-availability functions that really matters. In the case of the Boeing 777, it was the first aircraft to enter service with a 180-minute extended operations certification (ETOPS)1. This meant that longer over-water and remote terrain routes were immediately possible.
The question was “can a two-engine passenger aircraft be as safe as a four engine aircraft for long haul flights?” The short answer is yes. Reducing the points of failure from four engines to two, while meeting strict maintenance requirements and maintaining redundant systems, reduces the probability of a failure. The 777 and many other aircraft have proven to be safe for these longer flights. Recently, the 777 has received FAA approval for a 330-minute ETOPS rating2, which allows airlines to offer routes that are longer, straighter and more economical.
What does this have to do with a datacenter? It turns out that some hyperscale datacenters house hundreds of thousands of servers, each with its own boot drive. Each of these boot drives is a potential point of failure, which can drive up acquisition and operating costs and the odds of a breakdown. Datacenter managers need to control CapEx, so for the sheer volume of server boot drives they commonly use the lowest cost 2.5-inch notebook SATA hard drives. The problem is that these commodity hard drives tend to fail more often. This is not a huge issue with only a few servers. But in a datacenter with 200,000 servers, LSI has found through internal research that, on average, 40 to 200 drives fail per week! (2.5″ hard drive, ~2.5 to 4-year lifespan, which equates to a conservative 5% failure rate/year).
Traditionally, a hyperscale datacenter has a sea of racks filled with servers. LSI approximates that, in the majority of large datacenters, at least 60% of the servers (Web servers, database servers, etc.) use a boot drive requiring no more than 40GB of storage capacity since it performs only boot-up and journaling or logging. For higher reliability, the key is to consolidate these low-capacity drives, virtually speaking. With our Syncro™ MX-B Rack Boot Appliance, we can consolidate the boot drives for 24 or 48 of these servers into a single mirrored array (using LSI MegaRAID technology), which makes 40GB of virtual disk space available to each server.
Combining all these boot drives with fewer larger drives that are mirrored helps reduce total cost of ownership (TCO) and improves reliability, availability and serviceability. If a rack boot appliance drive fails, an alert is sent to the IT operator. The operator then simply replaces the failed drive, and the appliance automatically copies the disk image from the working drive. The upshot is that operations are simplified, OpEx is reduced, and there is usually no downtime.
Syncro MX-B not only improves reliability by reducing failure points; it also significantly reduces power requirements (up to 40% less in the 24-port version, up to 60% less in the 48-port version) – a good thing for the corporate utility bill and climate change. This, in turn, reduces cooling requirements, and helps make hardware upgrades less costly. With the boot drives disaggregated from the servers, there’s no need to simultaneously upgrade the drives, which typically are still functional during server hardware upgrades.
In the case of both commercial aircraft and servers, less really can be more (or at least better) in some situations. Eliminating excess can make the whole system simpler and more efficient.
To learn more, please visit the LSI® Shared Storage Solutions web page: http://www.lsi.com/solutions/Pages/SharedStorage.aspx