I started working years ago to engage large datacenters, learn what their problems are and try to craft solutions for their problems. It’s taken years, but we engaged them, learned, changed how we thought about storage and began creating solutions that are being deployed at scale.
We’ve started to do the same with the Chinese Internet giants. They’re growing at an incredible rate. They have similar problems, but it’s surprising how different their solution approaches are. Each one is unique. And we’re constantly learning from these guys.
So to wrap up the blog series on my interview with CIO & CEO magazine, here are the last two questions to explain a bit more.
CEO & CIO: Please use examples to tell the stories about the forward-looking technologies and architectures that LSI has jointly developed with Internet giants.
While our host bus adapters (HBAs) and MegaRAID® solutions have been part of the hyperscale Internet companies’ infrastructure since the beginning, we have only recently worked very closely with them to drive joint innovation. In 2009 I led the first LSI engagement with what we then called “mega datacenters.” It took a while to understand what they were doing and why. By 2010 we realized there were specialized needs, and began to imagine new hardware products that worked with these datacenters. Out of this work came the realization that flash was important for efficiency and capability, and the “invention” of LSI® Nytro™ product portfolio. (More are in the pipeline). We have worked closely with hyperscale datacenters to evolve and tune these solutions, to where Nytro products have become the backbone of their main revenue platforms. Facebook has been a vitally important partner in evolving our Nytro platform – teaching us what was truly needed, and now much of their infrastructure runs on LSI products. These same products are a good fit for other hyperscale customers, and we are slowly winning many of the large ones.
Looking forward, we are partnered with several Internet giants in the U.S. and China to work on cold storage solutions, and more importantly shared DAS (Distributed DAS: D-DAS) solutions. We have been demonstrating prototypes. These solutions enable pooled architectures and rack scale architecture, and can be made to work tightly with software-defined datacenters (SDDCs). They simplify management and resource allocation – making task deployment more efficient and easier. Shared DAS solutions increase infrastructure efficiency and improves lifecycle management of components. And they have the potential to radically improve application performance and infrastructure costs.
Looking further into the future, we see even more radical changes in silicon supporting transport protocols and storage models, and in rack scale architectures supporting storage and pooled memory. And cold storage is a huge though, some would say, boring problem that we are also focused on – storing lots of data for free and using no power to do it… but I really can’t talk about any of that.
CEO & CIO: LSI maintains good contact with big Internet companies in China. What are the biggest differences between dealing with these Internet enterprises and dealing with traditional partners?
Yes, we have a very good relationship with large Chinese Internet companies. In fact, I will be visiting Tencent, Alibaba and Baidu in a few weeks. One of the CTOs I would like to say is a friend. That is, we have fun talking together about the future.
These meetings have evolved. The first meetings LSI had about two years ago were sales calls, or support for OEM storage solutions. These accomplished very little. Once we began visiting as architects speaking to architects, real dialogs began. Our CEO has been spending time in China meeting with these Internet companies both to learn, and to make it clear that they are important to us, and we want a chance to solve their problems. But the most interesting conversations have been the architectural ones. There have been very clear changes in the two years I have traveled within China – from standard enterprise to hyperscale architectures.
We’ve received fascinating feedback on architecture, use, application profiles, platforms, problems and goals. We have strong engagement with the U.S. Internet giants. At the highest level, the Chinese Internet companies have similar problems and goals. But the details quickly diverge because of revenue per user, resources, power availability, datacenter ownership and Internet company age. The use of flash is very different.
The Chinese Internet giants are at an amazing change point. Most are ready for explosive growth of infrastructure and deployment of cloud services. Most are changing from standard OEM systems and architectures to self-designed hyperscale systems after experimenting with Scorpio and microserver deployments. Several, like JD.com (an Amazon-like company) are moving from hosted to self-built infrastructure. And there seems to be a general realization that the datacenter has changed from a compute-centric model to a dataflow model, where storage and network dictate how much work gets done more than the CPU does. These giants are leveraging their experience and capability to move very quickly, and in a few cases are working to create true pooled rack level architectures much like Facebook and Google have started in the U.S. In fact, Baidu is similar to Facebook in this approach, but is different in its longer term goals for the architecture.
The Chinese companies are amazingly diverse, even within one datacenter, and arguments on architectural direction are raging within these Internet giants – it’s healthy and exciting. However, the innovations that are coming are similar to those developed by large U.S. Internet companies. Personally I have found these Internet companies much more exciting and satisfying to work with than traditional OEMs. The speed and cadence of advancement, the recognition of problems and their importance, the focus on efficiency and optimization have been much more exciting. And the youthful mentality and view to problems, without being burdened by “the way we’ve always done this” has been wonderful.
Also see these blogs of mine over the past year, where you can read more about some of these changes:
“Postcard from Shenzhen: China’s hyperscale datacenter growth, mixed with a more traditional approach”
“China in the clouds, again”
“China: A lot of talk about resource pooling, a better name for disaggregation”
Or see them (and others) all here.
Summary: So it’s taken years, but we engaged U.S. Internet giants, learned about their problems, changed how we thought about storage and began creating solutions that are now being deployed at scale. And we’re constantly learning from these guys. Constantly, because their problems are constantly changing.
We’ve now started to do the same with the Chinese Internet giants. They have similar problems, and will need similar solutions, but they are not the same. And just like the U.S. Internet giants, each one is unique.
Tags: Alibaba, Amazon, Baidu, CEO & CIO Magazine, China, cloud services, cold storage, D-DAS, DAS, datacenter, datacenter ecosystem, direct attached storage, distributed DAS, Facebook, flash, flash storage, Google, HBA, host bus adapter, hyperscale datacenter, Internet, JD.com, MegaRAID, OEM, original equipment manufacturer, Scorpio, Tencent
Pushing your enterprise cluster solution to deliver the highest performance at the lowest cost is key in architecting scale-out datacenters. Administrators must expand their storage to keep pace with their compute power as capacity and processing demands grow.
safijidsjfijdsifjiodsjfiosjdifdsoijfdsoijfsfkdsjifodsjiof dfisojfidosj iojfsdiojofodisjfoisdjfiodsj ofijds fds foids gfd gfd gfd gfd gfd gfd gfd gfd gfd gfdg dfg gfdgfdg fd gfd gdf gfd gdfgdf g gfd gdfg dfgfdg fdgfdgBeyond price and capacity, storage resources must also deliver enough bandwidth to support these growing demands. Without enough I/O bandwidth, connected servers and users can bottleneck, requiring sophisticated storage tuning to maintain reasonable performance. By using direct attached storage (DAS) server architectures, IT administrators can
Beyond price and capacity, storage resources must also deliver enough bandwidth to support these growing demands. Without enough I/O bandwidth, connected servers and users can bottleneck, requiring sophisticated storage tuning to maintain reasonable performance. By using direct attached storage (DAS) server architectures, IT administrators can reduce the complexities and performance latencies associated with storage area networks (SANs). Now, with LSI 12Gb/s SAS or MegaRAID® technology, or both, connected to 12Gb/s SAS expander-based storage enclosures, administrators can leverage the DataBolt™ technology to clear I/O bandwidth bottlenecks. The result: better overall resource utilization, while preserving legacy drive investments. Typically a slower end device would step down the entire 12Gb/s SAS storage subsystem to 6Gb/s SAS speeds. How does Databolt technology overcome this? Well, without diving too deep into the nuts and bolts, intelligence in the expander buffers data and then transfers it out to the drives at 6Gb/s speeds in order to match the bandwidth between faster hosts and slower SAS or SATA devices.
So for this demonstration at AIS, we are showcasing two Hadoop Distributed File System (HDFS) servers. Each server houses the newly shipping MegaRAID 9361-8i 12Gb/s SAS RAID controller connected to a drive enclosure featuring a 12Gb/s SAS expander and 32 6Gb/s SAS hard drives. One has a DataBolt-enabled configuration, while the other is disabled.
For the benchmarks, we ran DFSIO, which simulates MapReduce workloads and is typically used to detect performance network bottlenecks and tune hardware configurations as well as overall I/O performance.
The primary goal of the DFSIO benchmarks is to saturate storage arrays with random read workloads in order to ensure maximum performance of a cluster configuration. Our tests resulted in MapReduce Jobs completing faster in 12Gb/s mode, and overall throughput increased by 25%.
Many of you may have heard of a poem written by Robert Fulgham 25 years ago called “All I Really Need to Know I Learned in Kindergarten.” In it he provides such pearls of wisdom like “Play fair,” “Clean up your own mess,” “Don’t take things that aren’t yours” and “Flush.” By now you’re wondering what any of this has to do with storage technology. Well the #1 item on the kindergarten knowledge list is “Share Everything.” And from my perspective that includes DAS (direct-attached storage).
Sharable DAS has been a primary topic of discussion at this year’s annual LSI Accelerating Innovation Summit (AIS). During one keynote session I proposed a continuum of data sharing, spanning from traditional server-based DAS to traditional external NAS and SAN with multiple points in between – including external DAS, simple pooled storage, advanced pooled storage, shared storage and HA (high-availability) shared storage. Each step along the continuum adds incremental features and value, giving datacenter architects the latitude to choose – and pay for – only the level of sharing absolutely required, and no more. This level of choice is being very warmly received by the market as storage requirements vary widely among Web-cloud, private cloud, traditional enterprise, and SMB configurations and applications.
Sharable DAS pools storage for operational benefits and efficiencies
Sharable DAS, with its inherent storage resource pooling, offers a number of operational benefits and efficiencies when applied at the rack level:
LSI rolls out proof-of-concept Rack Scale architecture using sharable DAS
In addition to just talking about sharable DAS at AIS, we also rolled out a proof-of-concept Rack Scale architecture employing sharable DAS. In it we configured 20 servers with 12Gb/s SAS RAID controllers, a prototype 40-port 12Gb/s SAS switch (that’s 160 12Gb/s SAS lanes) and 10 JBODs with 12Gb/s SAS for a total of 200 disk drives – all in a single rack. The drives were configured as a single storage resource pool with our media sharing (ability to spread volumes across multiple disk drives and aggregate disk drive bandwidth) and distributed RAID (ability to disperse data protection across multiple disk drives) features. This configuration pools the server storage into a single resource, delivering substantial, tangible performance and availability improvements, when compared to 20 stand-alone servers. In particular, the configuration:
I’m sure you’ll agree with me that Rack Scale architecture with sharable DAS is clearly a major step forward in providing a wide range of storage solutions under a single architecture. This in turn provides a multitude of operational efficiencies and performance benefits, giving datacenter architects wide latitude to employ what is needed – and only what is needed.
Now that we’ve tackled the #1 item on the kindergarten learning list, maybe I’ll set my sights on another item, like “Take a nap every afternoon.”
You might be surprised to find out how big the infrastructure for cloud and Web 2.0 is. It is mind-blowing. Microsoft has acknowledged packing more than 1 million servers into its datacenters, and by some accounts that is fewer than Google’s massive server count but a bit more than Amazon.
Facebook’s server count is said to have skyrocketed from 30,000 in 2012 to 180,000 just this past August, serving 900 million plus users. And the social media giant is even putting its considerable weight behind the Open Compute effort to make servers fit better in a rack and draw less power. The list of mega infrastructures also includes Tencent, Baidu and Alibaba and the roster goes on and on.
Even more jaw-dropping is that almost 99.9% of these hyperscale infrastructures are built with servers featuring direct-attached storage. That’s right – they do the computing and store the data. In other words, no special, dedicated storage gear. Yes, your Facebook photos, your Skydrive personal cloud and all the content you use for entertainment, on-demand video and gaming data are stored inside the server.
Direct-attached storage reigns supreme
Everything in these infrastructures – compute and storage – is built out of x-86 based servers with storage inside. What’s more, growth of direct-attached storage is many folds bigger than any other storage deployments in IT. Rising deployments of cloud, or cloud-like, architectures are behind much of this expansion.
The prevalence of direct-attached storage is not unique to hyperscale deployments. Large IT organizations are looking to reap the rewards of creating similar on-premise infrastructures. The benefits are impressive: Build one kind of infrastructure (server racks), host anything you want (any of your properties), and scale if you need to very easily. TCO is much less than infrastructures relying on network storage or SANs.
With direct-attached you no longer need dedicated appliances for your database tier, your email tier, your analytics tier, your EDA tier. All of that can be hosted on scalable, share-nothing infrastructure. And just as with hyperscale, the storage is all in the server. No SAN storage required.
Open Compute, OpenStack and software-defined storage drive DAS growth
Open Compute is part of the picture. A recent Open Compute show I attended was mostly sponsored by hyperscale customers/suppliers. Many big-bank IT folks attended. Open Compute isn’t the only initiative driving growing deployments of direct-attached storage. So is software-defined storage and OpenStack. Big application vendors such as Oracle, Microsoft, VMware and SAP are also on board, providing solutions that support server-based storage/compute platforms that are easy and cost-effective to deploy, maintain and scale and need no external storage (or SAN including all-flash arrays).
So if you are a network-storage or SAN manufacturer, you have to be doing some serious thinking (many have already) about how you’re going to catch and ride this huge wave of growth.
Tags: Alibaba, Amazon, Baidu, cloud computing, DAS, direct attached storage, enterprise, enterprise IT, Google, hyperscale, Microsoft, Open Compute, OpenStack, Oracle, SAP, Tencent, VMware
Optimizing the work per dollar spent is a high priority in datacenters around the world. But there aren’t many ways to accomplish that. I’d argue that integrating flash into the storage system drives the best – sometimes most profound – improvement in the cost of getting work done.
Yea, I know work/$ is a US-centric metric, but replace the $ with your favorite currency. The principle remains the same.
I had the chance to talk with one of the execs who’s responsible for Google’s infrastructure last week. He talked about how his fundamental job was improving performance/$. I asked about that, and he explained “performance” as how much work an application could get done. I asked if work/$ at the application was the same, and he agreed – yes – pretty much.
You remember as a kid that you brought along a big brother as authoritative backup? OK – so my big brother Google and I agree – you should be trying to optimize your work/$. Why? Well – it could be to spend less, or to do more with the same spend, or do things you could never do before, or simply to cope with the non-linear expansion in IT demands even as budgets are shrinking. Hey – that’s the definition of improving work/$… (And as a bonus, if you do it right, you’ll have a positive green impact that is bound to be worth brownie points.)
Here’s the point. Processors are no longer scaling the same – sure, there are more threads, but not all applications can use all those threads. Systems are becoming harder to balance for efficiency. And often storage is the bottleneck. Especially for any application built on a database. So sure – you can get 5% or 10% gain, or even in the extreme 100% gain in application work done by a server if you’re willing to pay enough and upgrade all aspects of the server: processors, memory, network… But it’s almost impossible to increase the work of a server or application by 200%, 300% or 400% – for any money.
I’m going to explain how and why you can do that, and what you get back in work/$. So much back that you’ll probably be spending less and getting more done. And I’m going to explain how even for the risk-averse, you can avoid risk and get the improvements.
More work/$ from general-purpose DAS servers and large databases
Let me start with a customer. It’s a bank, and it likes databases. A lot. And it likes large databases even more. So much so that it needs disks to hold the entire database. Using an early version of an LSI Nytro™ MegaRAID® card, it got 6x the work from the same individual node and database license. You can read that as 600% if you want. It’s big. To be fair – that early version had much more flash than our current products, and was much more expensive. Our current products give much closer to 3x-4x improvement. Again, you can think of that as 300%-400%. Again, slap a Nytro MegaRAID into your server and it’s going to do the work of 3 to 4 servers. I just did a web search and, depending on configuration, Nytro MegaRAIDs are $1,800 to $2,800 online. I don’t know about you, but I would have a hard time buying 2 to 3 configured servers + software licenses for that little, but that’s the net effect of this solution. It’s not about faster (although you get that). It’s about getting more work/$.
But you also want to feel safe – that you’re absolutely minimizing risk. OK. Nytro MegaRAID is a MegaRAID card. That’s overwhelmingly the most common RAID controller in the world, and it’s used by 9 of the top 10 OEMs, and protects 10’s to 100‘s of millions of disks every day. The Nytro version adds private flash caching in the card and stores hot reads and writes there. Writes to the cache use a RAID 1 pair. So if a flash module dies, you’re protected. If the flash blocks or chip die wear out, the bad blocks are removed from the cache pool, and the cache shrinks by that much, but everything keeps operating – it’s not like a normal LUN that can’t change size. What’s more, flash blocks usually finally wear out during the erase cycle – so no data is lost. And as a bonus, you can eliminate the traditional battery most RAID cards use – the embedded flash covers that – so no more annual battery service needed. This is a solution that will continue to improve work/$ for years and years, all the while getting 3x-4x the work from that server.
More work/$ from SAN-attached servers (without actually touching the SAN)
That example was great – but you don’t use DAS systems. Instead, you use a big iron SAN. (OK, not all SANs are big iron, but I like the sound of that expression.) There are a few ways to improve the work from servers attached to SANs. The easiest of course is to upgrade the SAN head, usually with a flash-based cache in the SAN controller. This works, and sometimes is “good enough” to cover needs for a year or two. However, the server still needs to reach across the SAN to access data, and it’s still forced to interact with other servers’ IO streams in deeper queues. That puts a hard limit on the possible gains.
Nytro XD caches hot data in the server. It works with virtual machines. It intercepts storage traffic at the block layer – the same place LSI’s drivers have always been. If the data isn’t hot, and isn’t cached, it simply passes the traffic through to the SAN. I say this so you understand – it doesn’t actually touch the SAN. No risk there. More importantly, the hot storage traffic never has to be squeezed through the SAN fabric, and it doesn’t get queued in the SAN head. In other words, it makes the storage really, really fast.
We’ve typically found work from a server can increase 5x to 10x, and that’s been verified by independent reviewers. What’s more, the Nytro XD solution only costs around 4x the price of a high-end SAN NIC. It’s not cheap, but it’s way cheaper than upgrading your SAN arrays, it’s way cheaper than buying more servers, and it’s proven to enable you to get far more work from your existing infrastructure. When you need to get more work – way more work – from your SAN, this is a really cost-effective approach. Seriously – how else would you get 5x-10x more work from your existing servers and software licenses?
More work/$ from databases
A lot of hyperscale datacenters are built around databases of a finite size. That may be 1, 2 or even 4 TBytes. If you use Apple’s online services for iTunes or iCloud, or if you use Facebook, you’re using this kind of infrastructure.
If your datacenter has a database that can fit within a few TBytes (or less), you can use the same approach. Move the entire LUN into a Nytro WarpDrive® card, and you will get 10x the work from your server and database software. It makes such a difference that some architects argue Facebook and Apple cloud services would never have been possible without this type of solution. I don’t know, but they’re probably right. You can buy a Nytro WarpDrive for as little as a low-end server. I mean low end. But it will give you the work of 10. If you have a fixed-size database, you owe it to yourself to look into this one.
More work/$ from virtualized and VDI (Virtual Desktop) systems
Virtual machines are installed on a lot of servers, for very good reason. They help improve the work/$ in the datacenter by reducing the number of servers needed and thereby reducing management, maintenance and power costs. But what if they could be made even more efficient?
Wall Street banks have benchmarked virtual desktops. They found that Nytro products drive these results: support of 2x the virtual desktops, 33% improvement in boot time during boot storms, and 33% lower cost per virtual desktop. In a more general application mix, Nytro increases work per server 2x-4x. And it also gives 2x performance for virtual storage appliances.
While that’s not as great as 10x the work, it’s still a real work/$ value that’s hard to ignore. And it’s the same reliable MegaRAID infrastructure that’s the backbone of enterprise DAS storage.
A real example from our own datacenter
Finally – a great example of getting far more work/$ was an experiment our CIO Bruce Decock did. We use a lot of servers to fuel our chip-design business. We tape out a lot of very big leading-edge process chips every year. Hundreds. And that takes an unbelievable amount of processing to get what we call “design closure” – that is, a workable chip that will meet performance requirements and yield. We use a tool called PrimeTime that figures out timing for every signal on the chip across different silicon process points and operating conditions. There are 10’s to 100’s of millions of signals. And we run every active design – 10’s to 100’s of chips – each night so we can see how close we’re getting, and we make multiple runs per chip. That’s a lot of computation… The thing is, electronic CAD has been designed to try not to use storage or it will never finish – just /tmp space, but CAD does use huge amounts of memory for the data structures, and that means swap space on the order of TBytes. These CAD tools usually don’t need to run faster. They run overnight and results are ready when the engineers come in the next day. These are impressive machines: 384G or 768G of DRAM and 32 threads. How do you improve work/$ in that situation? What did Bruce do?
He put LSI Nytro WarpDrives in the servers and pointed /tmp at the WarpDrives. Yep. Pretty complex. I don’t think he even had to install new drivers. The drivers are already in the latest OS distributions. Anyway – like I said – complex.
The result? WarpDrive allowed the machines to fully use the CPU and memory with no I/O contention. With WarpDrive, the PrimeTime jobs for static timing closure of a typical design could be done on 15 vs. 40 machines. That’s each Nytro node doing 260% of the work vs. a normal node and license. Remember – those are expensive machines (have you priced 768G of DRAM and do you know how much specialized electronic design CAD licenses are?) So the point wasn’t to execute faster. That’s not necessary. The point is to use fewer servers to do the work. In this case we could do 11 runs per server per night instead of just 4. A single chip design needs more than 150 runs in one night.
To be clear, the Nytro WarpDrives are a lot less expensive than the servers they displace. And the savings go beyond that – less power and cooling. Lower maintenance. Less admin time and overhead. Fewer Licenses. That’s definitely improved work/$ for years to come. Those Nytro cards are part of our standard flow, and they should probably be part of every chip company’s design flow.
So you can improve work/$ no matter the application, no matter your storage model, and no matter how risk-averse you are.
Optimizing the work per dollar spent is a high – maybe the highest – priority in datacenters around the world. And just to be clear – Google agrees with me. There aren’t many ways to accomplish that improvement, and almost no ways to dramatically improve it. I’d argue that integrating flash into the storage system is the best – sometimes most profound – improvement in the cost of getting work done. Not so much the performance, but the actual work done for the money spent. And it ripples through the datacenter, from original CapEx, to licenses, maintenance, admin overhead, power and cooling, and floor space for years. That’s a pretty good deal. You should look into it.
For those of you who are interested, I already wrote about flash in these posts:
What are the driving forces behind going diskless?
LSI is green – no foolin’
Tags: Bruce Decock, DAS, datacenter, direct attached storage, enterprise IT, flash, Google, hyperscale datacenter, Nytro MegaRAID, Nytro WarpDrive, Nytro XD, PrimeTime, RAID, SAN, server storage, storage area network, VDI, virtual desktop infrastructure, work per dollar
When I am out on the road in Europe, visiting customers and partners, one common theme that comes up on a daily basis is that high-availability systems are essential to nearly all businesses regardless of size or industry. Sadly, all too often we see what can happen when systems running business-critical applications such as transaction processing, Web servers or electronic commerce are not accessible – potentially lost revenue and lost productivity, leading to dramatically downward-spiralling customer satisfaction.
To reduce this risk, the industry focus has been on achieving the best level of high availability, and for the enterprise market segment this has often meant installing and running storage area network (SAN) solutions. SANs can offer users a complete package – scalability, performance, centralised management and the all-important uptime or high availability.
Drawbacks of SAN
But for all its positives, the SAN also has its downsides. To ensure continuous application availability, server clustering and shared-node connections that build redundancy into a cluster and eliminate single points of failure are crucial. The solution is not only extremely complex, it can have a hefty price tag, amounting to tens of thousands of dollars, and can be hard for many smaller to medium-sized businesses to afford.
When considering budgets and storage needs, many businesses have shied away from investing in a SAN and opted for a far simpler direct attached storage (DAS) solution – mainly because it can be far easier to implement and considerably cheaper. Historically, however, the biggest problem with this was that DAS could not offer high availability, and recovery from a server or storage failure could take several hours or even days.
Combining the simplicity of DAS with the high availability of SAN storage
As businesses work to reduce storage costs, simplify deployment, and increase agility and uptime in the face of massive data growth, storage architects are often looking for a way to combine the best of both worlds: the simplicity of DAS storage and the high availability of SAN storage. The goal for many is to create a system that is not only cheaper than a regular SAN but also offers full redundancy, less management complexity and guarantees uptime for the business in case a server goes down.
LSI has pioneered an HA-DAS solution, Syncro™ CS, that costs approximately 30% less than traditional HA entry-level SAN solutions, depending on the solution/configuration. It reduces complexity by providing fully redundant, shared-node storage and application failover, without requiring storage networking hardware. Syncro CS solutions are also designed to reduce latency compared to SAN-based solutions, helping to accelerate storage I/O performance and speed applications.
The good news for businesses that rely on DAS is that they have an option, Syncro CS, to now more easily upgrade their DAS infrastructure to help achieve high availability, with easier management and lower cost. The result is a much simpler failover solution that provides more affordable business continuity and reduces downtime.
I want to warn you, there is some thick background information here first. But don’t worry. I’ll get to the meat of the topic and that’s this: Ultimately, I think that PCIe® cards will evolve to more external, rack-level, pooled flash solutions, without sacrificing all their great attributes today. This is just my opinion, but other leaders in flash are going down this path too…
I’ve been working on enterprise flash storage since 2007 – mulling over how to make it work. Endurance, capacity, cost, performance have all been concerns that have been grappled with. Of course the flash is changing too as the nodes change: 60nm, 50nm, 35nm, 24nm, 20nm… and single level cell (SLC) to multi level cell (MLC) to triple level cell (TLC) and all the variants of these “trimmed” for specific use cases. The spec “endurance” has gone from 1 million program/erase cycles (PE) to 3,000, and in some cases 500.
It’s worth pointing out that almost all the “magic” that has been developed around flash was already scoped out in 2007. It just takes a while for a whole new industry to mature. Individual die capacity increased, meaning fewer die are needed for a solution – and that means less parallel bandwidth for data transfer… And the “requirement” for state-of-the-art single operation write latency has fallen well below the write latency of the flash itself. (What the ?? Yea – talk about that later in some other blog. But flash is ~1500uS write latency, where state of the art flash cards are ~50uS.) When I describe the state of technology it sounds pretty pessimistic. I’m not. We’ve overcome a lot.
We built our first PCIe card solution at LSI in 2009. It wasn’t perfect, but it was better than anything else out there in many ways. We’ve learned a lot in the years since – both from making them, and from dealing with customer and users – about our own solutions and our competitors. We’re lucky to be an important player in storage, so in general the big OEMs, large enterprises and the hyperscale datacenters all want to talk with us – not just about what we have or can sell, but what we could have and what we could do. They’re generous enough to share what works and what doesn’t. What the values of solutions are and what the pitfalls are too. Honestly? It’s the hyperscale datacenters in the lead both practically and in vision.
If you haven’t nodded off to sleep yet, that’s a long-winded way of saying – things have changed fast, and, boy, we’ve learned a lot in just a few years.
Most important thing we’ve learned…
Most importantly, we’ve learned it’s latency that matters. No one is pushing the IOPs limits of flash, and no one is pushing the bandwidth limits of flash. But they sure are pushing the latency limits.
PCIe cards are great, but…
We’ve gotten lots of feedback, and one of the biggest things we’ve learned is – PCIe flash cards are awesome. They radically change performance profiles of most applications, especially databases allowing servers to run efficiently and actual work done by that server to multiply 4x to 10x (and in a few extreme cases 100x). So the feedback we get from large users is “PCIe cards are fantastic. We’re so thankful they came along. But…” There’s always a “but,” right??
It tends to be a pretty long list of frustrations, and they differ depending on the type of datacenter using them. We’re not the only ones hearing it. To be clear, none of these are stopping people from deploying PCIe flash… the attraction is just too compelling. But the problems are real, and they have real implications, and the market is asking for real solutions.
Of course, everyone wants these fixed without affecting single operation latency, or increasing cost, etc. That’s what we’re here for though – right? Solve the impossible?
A quick summary is in order. It’s not looking good. For a given solution, flash is getting less reliable, there is less bandwidth available at capacity because there are fewer die, we’re driving latency way below the actual write latency of flash, and we’re not satisfied with the best solutions we have for all the reasons above.
If you think these through enough, you start to consider one basic path. It also turns out we’re not the only ones realizing this. Where will PCIe flash solutions evolve over the next 2, 3, 4 years? The basic goals are:
One easy answer would be – that’s a flash SAN or NAS. But that’s not the answer. Not many customers want a flash SAN or NAS – not for their new infrastructure, but more importantly, all the data is at the wrong end of the straw. The poor server is left sucking hard. Remember – this is flash, and people use flash for latency. Today these SAN type of flash devices have 4x-10x worse latency than PCIe cards. Ouch. You have to suck the data through a relatively low bandwidth interconnect, after passing through both the storage and network stacks. And there is interaction between the I/O threads of various servers and applications – you have to wait in line for that resource. It’s true there is a lot of startup energy in this space. It seems to make sense if you’re a startup, because SAN/NAS is what people use today, and there’s lots of money spent in that market today. However, it’s not what the market is asking for.
Another easy answer is NVMe SSDs. Right? Everyone wants them – right? Well, OEMs at least. Front bay PCIe SSDs (HDD form factor or NVMe – lots of names) that crowd out your disk drive bays. But they don’t fix the problems. The extra mechanicals and form factor are more expensive, and just make replacing the cards every 5 years a few minutes faster. Wow. With NVME SSDs, you can fit fewer HDDs – not good. They also provide uniformly bad cooling, and hard limit power to 9W or 25W per device. But to protect the storage in these devices, you need to have enough of them that you can RAID or otherwise protect. Once you have enough of those for protection, they give you awesome capacity, IOPs and bandwidth, too much in fact, but that’s not what applications need – they need low latency for the working set of data.
What do I think the PCIe replacement solutions in the near future will look like? You need to pool the flash across servers (to optimize bandwidth and resource usage, and allocate appropriate capacity). You need to protect against failures/errors and limit the span of failure, commit writes at very low latency (lower than native flash) and maintain low latency, bottleneck-free physical links to each server… To me that implies:
That means the performance looks exactly as if each server had multiple PCIe cards. But the capacity and bandwidth resources are shared, and systems can remain resilient. So ultimately, I think that PCIe cards will evolve to more external, rack level, pooled flash solutions, without sacrificing all their great attributes today. This is just my opinion, but as I say – other leaders in flash are going down this path too…
What’s your opinion?
Tags: DAS, datacenter, direct attached storage, enterprise IT, flash, hard disk drive, HDD, hyperscale, latency, NAS, network attached storage, NVMe, PCIe, SAN, solid state drive, SSD, storage area network
Big data and Hadoop are all about exploiting new value and opportunities with data. In financial trading, business and some areas of science, it’s all about being fastest or first to take advantage of the data. The bigger the data sets, the smarter the analytics. The next competitive edge with big data comes when you layer in flash acceleration. The challenge is scaling performance in Hadoop clusters.
The most cost-effective option emerging for breaking through disk-to-I/O bottlenecks to scale performance is to use high-performance read/write flash cache acceleration cards for caching. This is essentially a way to get more work for less cost, by bringing data closer to the processing. The LSI® Nytro™ product has been shown during testing to improve the time it takes to complete Hadoop software framework jobs up to a 33%.
Flash cache cards increase Hadoop application performance
Combining flash cache acceleration cards with Hadoop software is a big opportunity for end users and suppliers. LSI estimates that less than 10% of Hadoop software installations today incorporate flash acceleration1. This will grow rapidly as companies see the increased productivity and ROI of flash to accelerate their systems. And Hadoop software adoption is also growing fast. IDC predicts a CAGR of as much as 60% by 20162. Drivers include IT security, e-commerce, fraud detection and mobile data user management. Gartner predicts that Hadoop software will be in two-thirds of advanced analytics products by 20153. Many thousands of Hadoop software clusters are already deployed.
Where flash makes the most immediate sense is with those who have smaller clusters doing lots of in-place batch processing. Hadoop is purpose-built for analyzing a variety of data, whether structured, semi-structured or unstructured, without the need to define a schema or otherwise anticipate results in advance. Hadoop enables scaling that allows an unprecedented volume of data to be analyzed quickly and cost-effectively on clusters of commodity servers. Speed gains are about data proximity. This is why flash cache acceleration typically delivers the highest performance gains when the card is placed directly in the server on the PCI Express® (PCIe) bus.
Combining the best of flash and HDDs to drive higher performance and storage capacity
PCIe flash cache cards are now available with multiple terabytes of NAND flash storage, which substantially increases the hit rate. We offer a solution with both onboard flash modules and Serial-Attached SCSI (SAS) interfaces to enable high-performance direct-attached storage (DAS) configurations consisting of solid state and hard disk drive storage. This couples the low-latency performance benefits of flash with the capacity and cost-per-gigabyte advantages of HDDs.
To keep the processor close to the data, Hadoop uses servers with DAS. And to get the data even closer to the processor, the servers are usually equipped with significant amounts of random access memory (RAM). An additional benefit: Smart implementation of Hadoop and flash components can reduce the overall server footprint and simplify scaling, with some solutions enabling up to 128 devices to share a very high bandwidth interface. Most commodity servers provide 8 or less SATA ports for disks, reducing expandability.
Hadoop is great, but flash-accelerated Hadoop is best. It’s an effective way, as you work to extract full value from big data, to secure a competitive edge.
I remember in the mid-1990s the question of how many minutes away from a diversion airport a two-engine passenger jet should be allowed to fly in the event of an engine failure. Staying in the air long enough is one of those high-availability functions that really matters. In the case of the Boeing 777, it was the first aircraft to enter service with a 180-minute extended operations certification (ETOPS)1. This meant that longer over-water and remote terrain routes were immediately possible.
The question was “can a two-engine passenger aircraft be as safe as a four engine aircraft for long haul flights?” The short answer is yes. Reducing the points of failure from four engines to two, while meeting strict maintenance requirements and maintaining redundant systems, reduces the probability of a failure. The 777 and many other aircraft have proven to be safe for these longer flights. Recently, the 777 has received FAA approval for a 330-minute ETOPS rating2, which allows airlines to offer routes that are longer, straighter and more economical.
What does this have to do with a datacenter? It turns out that some hyperscale datacenters house hundreds of thousands of servers, each with its own boot drive. Each of these boot drives is a potential point of failure, which can drive up acquisition and operating costs and the odds of a breakdown. Datacenter managers need to control CapEx, so for the sheer volume of server boot drives they commonly use the lowest cost 2.5-inch notebook SATA hard drives. The problem is that these commodity hard drives tend to fail more often. This is not a huge issue with only a few servers. But in a datacenter with 200,000 servers, LSI has found through internal research that, on average, 40 to 200 drives fail per week! (2.5″ hard drive, ~2.5 to 4-year lifespan, which equates to a conservative 5% failure rate/year).
Traditionally, a hyperscale datacenter has a sea of racks filled with servers. LSI approximates that, in the majority of large datacenters, at least 60% of the servers (Web servers, database servers, etc.) use a boot drive requiring no more than 40GB of storage capacity since it performs only boot-up and journaling or logging. For higher reliability, the key is to consolidate these low-capacity drives, virtually speaking. With our Syncro™ MX-B Rack Boot Appliance, we can consolidate the boot drives for 24 or 48 of these servers into a single mirrored array (using LSI MegaRAID technology), which makes 40GB of virtual disk space available to each server.
Combining all these boot drives with fewer larger drives that are mirrored helps reduce total cost of ownership (TCO) and improves reliability, availability and serviceability. If a rack boot appliance drive fails, an alert is sent to the IT operator. The operator then simply replaces the failed drive, and the appliance automatically copies the disk image from the working drive. The upshot is that operations are simplified, OpEx is reduced, and there is usually no downtime.
Syncro MX-B not only improves reliability by reducing failure points; it also significantly reduces power requirements (up to 40% less in the 24-port version, up to 60% less in the 48-port version) – a good thing for the corporate utility bill and climate change. This, in turn, reduces cooling requirements, and helps make hardware upgrades less costly. With the boot drives disaggregated from the servers, there’s no need to simultaneously upgrade the drives, which typically are still functional during server hardware upgrades.
In the case of both commercial aircraft and servers, less really can be more (or at least better) in some situations. Eliminating excess can make the whole system simpler and more efficient.
To learn more, please visit the LSI® Shared Storage Solutions web page: http://www.lsi.com/solutions/Pages/SharedStorage.aspx
One of the big challenges that I see so many IT managers struggling with is how are they supposed to deal with the almost exponential growth of data that has to be stored, accessed, and protected, with IT budgets that are flat or growing at rates lower than the nonstop increases in storage volumes.
I’ve found that it doesn’t seem to matter if it is a departmental or small business datacenter, or a hyperscale datacenter with many thousands of servers. The data growth continues to outpace the budgets.
At LSI we call this disparity between the IT budget and the needs growth the “data deluge gap.”
Of course, smaller datacenters have different issues than the hyperscale datacenters. However, no matter the datacenter size, concerns generally center on TCO. This, of course, includes both CapEx and OpEx for the storage systems.
It’s a good feeling to know that we are tackling these datacenter growth and operations issues head-on for many different environments – large and small.
LSI has developed and is starting to provide a new shared DAS (sDAS) architecture that supports the sharing of storage across multiple servers. We call it the LSI® SyncroTM architecture and it really is the next step in the evolution of DAS. Our Syncro solutions deliver increased uptime, help to reduce overall costs, increase agility, and are designed for ease of deployment. The fact that the Syncro architecture is built on our proven MegaRAID® technology means that our customers can trust that it will work in all types of environments.
Syncro architecture is a very exciting new capability that addresses storage and data protection needs for numerous datacenter environments. Our first product, Syncro MX-B, is targeted at hyperscale datacenter environments including Web 2.0 and cloud. I will be blogging about that offering in the near future. We will soon be announcing details on our Syncro CS product line, previously known as High Availability DAS, for small and medium businesses and I will blog about what it can mean for our customers and users.
Both of these initial versions of the Syncro architecture can be very exciting and I really like to watch how datacenter managers react when they find about these game-changing capabilities.
We say that “with the LSI Syncro architecture you take DAS out of the box and make it sharable and scalable. The LSI Syncro architecture helps make your storage Simple. Smart. On.” Our tag line for Syncro is “The Smarter Way to ON.™” It really is.
To learn more, please visit the LSI Shared Storage Solutions web page: http://www.lsi.com/solutions/Pages/SharedStorage.aspx