Last week at LSI’s annual Accelerating Innovation SummitÂ (AIS) the company took the wraps off a vision that should lead its technical direction for the next few years.
LSI CEO Abhi Talwalkar shared a video of three situations as they might evolve in the future:
I’ll focus on just one of these to show how LSI expects the future to develop.Â In the bicycle accident scenario, a businessman falls to the ground while riding a bicycle in a foreign country.Â Security cameras that have been upgraded to understand what they see notify an emergency services agency which sends an ambulance to the scene.Â The paramedic performs a retinal scan on the victim, using it to retrieve his medical records, including his DNA sequence, from the web.Â
The businessman’s wearable body monitoring system also communicates with the paramedic’s instruments to share his vital signs.Â All of this information is used by cloud-based computers to determine a course of action which, in the video, requires an injection that has been custom-tuned to the victim’s current situation, his medical history, and his genetic makeup.
That’s a pretty tall order, and it will require several advances in the state of the art, but LSI is using this and other scenarios to work with its clients and translate this vision into the products of the future.
What are the key requirements to make this happen? Talwalkar told the audience that we need to create a society that is supported by preventive, predictive and assisted analytics to move in a direction where the general welfare is assisted by all that the Internet and advanced computing have to offer.Â Since data is growing at an exponential rate, he argued that this will require the instant retrieval of interlinked data objects at scale. Everything that is key to solving the task must be immediately available, and must be quickly analyzed to provide a solution to the problem at hand. The key will be the ability to process interlinked pieces of data that have not been previously structured to handle any particular situation.
To achieve this we will need larger-scale computing resources than are currently available, all closely interconnected, that all operate at very high speeds.Â LSI hopes to tap into these needs through its strengths in networking and communications chips for the communications, its HDD and server and storage connectivity array chips and boards for large-scale data, and its flash controller memory and PCIe SSD expertise for high performance.
LSI brought to AIS several of the customers and partners it is working with using to develop these technologies. Speakers from Intel, Microsoft, IBM, Toshiba, Ericsson and others showed how they are working with LSI’s various technologies to improve the performance of their own systems.Â On the exhibition floor booths from LSI and many of its clients demonstrated new technologies that performed everything from high-speed stock market analysis to fast flash management.
It’s pretty exciting to see a company that has a clear vision of its future and is committed to moving its entire ecosystem ahead to make that happen and help companies manage their business more effectively during what LSI calls the â€śDatacentric Era.â€ť LSI has certainly put a lot of effort into creating a vision and determining where its talents can be brought to bear to improve our lives in the future.
Tags: Abhi Talkwalkar, AIS, chips, communications, connectivity, data, Datacentric Era, Ericsson, flash, flash memory, hard disk drive, HDD, IBM, Intel, large-scale data, Microsoft, Networking, server, Storage, Toshiba
The lifeblood of any online retailer is the speed of its IT infrastructure. Shoppers arenâ€™t infinitely patient. Sluggish infrastructure performance can make shoppers wait precious seconds longer than they can stand, sending them fleeing to other sites for a faster purchase. Our federal governmentâ€™s halting rollout of the Health Insurance Marketplace website is a glaring example of what can happen when IT infrastructure isnâ€™t solid. A few bad user experiences that go viral can be damaging enough. Tens of thousands can be crippling. Â
In hyperscale datacenters, any number of problems including network issues, insufficient scaling and inconsistent management can undermine end usersâ€™ experience. But one that hits home for me is the impact of slow storage on the performance of databases, where the data sits. With the database at the heart of all those online transactions, retailers can ill afford to haveÂ their tier of database servers operating at anything less than peak performance.
Slow storage undermines database performance
Typically,Â Web 2.0 and e-commerce companies run relational databases (RDBs) on these massive server-centric infrastructures. (Take a look at my blog last week to get a feel for the size of these hyperscale datacenter infrastructures). If you are running that many servers to support millions of users, you are likely using some kind of open-sourced RDB such as MySQL or other variations. Keep in mind that Oracle 11gR2 likely retails around $30K per core but MSQL is free. But the performance of both, and most other relational databases, suffer immensely when transactions are retrieving data from storage (or disk). You can only throw so much RAM and CPU power at the performance problem â€¦ sooner rather than later you have to deal with slow storage.
Almost everyone in industry â€“ Web 2.0, cloud, hyperscale and other providers of massive database infrastructures â€“ is lining up to solve this problem the best way they can. How? By deploying flash as the sole storage for database servers and applications. But is low-latency flash enough? For sheer performance it beats rotational disk hands down. But â€¦ even flash storage has its limitations, most notably when you are trying to drive ultra-low latencies for write IOs. Most IO accesses by RDBs, which do the transactional processing, are a mix or read/writes to the storage. Specifically, the mix is 70%/30% reads/writes. These are also typically low q-depth accesses (less than 4). It is those writes that can really slow things down.
PCIe flash reduces write latencies
The good news is that the right PCIe flash technology in the mix can solve the slowdowns. Some interesting PCIe flash technologies designed to tackle this latency problem are on display atÂ AISÂ this week. DRAM and in particular NVDRAM are being deployed as a tier in front of flash to really tackle those nasty write latencies.
Among other demos, weâ€™re showing how a Nytroâ„˘ 6000 series PCIe flash cardÂ helps solve the MySQL database performance issues. The typical response time for a small data read (this is what the database will see for a Database IO) from an HDD is 5ms. Flash-based devices such as the Nytro WarpDriveÂ® card can complete the same read in less than 50ÎĽs on average during testing, an improvement of several orders-of-magnitude in response time. This response time translates to getting much higher transactions out of the same infrastructure â€“ but with less space (flash is denser) and a lot less power (flash consumes a lot lower power than HDDs).
Weâ€™re also showing the Nytro 7000 series PCIe flash cards. They reach even lower write latencies than the 6000 series and very low q-depths.Â The 7000 series cards also provide DRAM buffering while maintaining data-integrity even in the event of a power loss.
For online retailers and other businesses, higher database speeds mean more than just faster transactions. They canÂ help keep those cash registers ringing.
Tags: AIS, database, DRAM, e-commerce, flash, flash memory, hard disk drive, HDD, hyperscale datacenter, latency, MySQL, NVDRAM, Nytro 6000, Nytro 7000, Nytro WarpDrive, Oracle, PCIe flash, relational database, storage latency, web 2.0, write latency
I often think about green, environmental impact, and what weâ€™re doing to the environment. One major reason I became an engineer was to leave the world a little better than when I arrived. Iâ€™ve gotten sidetracked a few times, but Iâ€™ve tried to help, even if just a little.
The good people in LSIâ€™s EHS (Environment, Health & Safety) asked me a question the other day about carbon footprint, energy impact, and materials use. Which got me thinking â€¦ OK â€“ I know most people in LSI donâ€™t really think of ourselves as a â€śgreen techâ€ť company. But we are â€“ really. No foolinâ€™. We are having a big impact on the global power consumption and material consumption of the IT industry. And I mean that in a good way.
There are many ways to look at this, both from what we enable datacenters to do, to what we enable integrators to do, all the way to hard core technology improvements and massive changes in what itâ€™s possible to do.
Back in 2008 I got to speak at the AlwaysOn GoingGreen conference. (I was lucky enough to be just after Elon Muskâ€“ heâ€™s a lot more famous now with Tesla doing so well.
http://www.smartplanet.com/video/making-the-case-for-green-it/305467Â (at 2:09 in video)
The massive consumption of IT equipment, all the ancillary metal, plastic wiring, etc. that goes with them, consumes energy as its being shipped and moved halfway around the world, and, more importantly, then gets scrapped out quickly. This has been a concern for me for quite a while. I mean â€“ think about that. As an industry we are generating about 9 million servers a year, about 3 million go intoÂ hyperscale datacenters (or hyperscale if you prefer). Many of those are scrapped on a 2, 3 or 4 year cycle â€“ so in steady state, maybe 1 million to 2 million a year are scrapped. Worse â€“ there is amazing use of energy by that many servers (even as they have advanced the state of the art unbelievably since 2008). And frankly, you and I are responsible for using all that power. Did you know thousands of servers are activated every time you make a GoogleÂ® query from your phone?
I want to take a look at basic silicon improvements we make, the impact of disk architecture improvement, SSDs, system and improvements, efficiency improvements, and also where weâ€™re going in the near future with eliminating scrap in hard drives and batteries. In reality, itâ€™s the massive pressure on work/$ that has made us optimize everything â€“ being able to do much more work at a lower cost, when a lot of cost is the energy and material that goes into the products that forces our hand. But the result is a real, profound impact on our carbon footprint that we should be proud of.
Sure we have a general silicon roadmap where each node enables reduced power, even as some standards and improvements actually increase individual device power. For example, our transition from 28nm semi process to 14 FinFET can literally cut the power consumption of a chip in half. But thatâ€™s small potatoes.
How about Ethernet? Itâ€™s everywhere â€“ right? Did you know servers often have 4 ethernet ports, and that there are a matching 4 ports on a network switch? LSI pioneered something called Energy Efficient Ethernet (EEE). Weâ€™re also one of the biggest manufacturers of Ethernet PHYs â€“ the part that drives the cable â€“ and we come standard in everything from personal computers to servers to enterprise switches. The savings are hard to estimate, because they depend very much on how much traffic there is, but you can realistically save Watts per interface link, and there are often 256 links in a rack. Â 500 Watts per rack is no joke, and in some datacenters it adds up to 1 or 2 MegaWatts.
How about something a little bigger and more specific? Hard disk drives. Did you know a typicalÂ hyperscale datacenter has between 1 million and 1.5 million disk drives? Each one of those consumes about Â 9 Watts, and most have 2 TBytes of capacity. So for easy math, 1 million drives is about 9 MegaWatts (!?) and about 2 Exabytes of capacity (remember â€“ data is often replicated 3 or more times). Data capacities in these facilities are needed to grow about 50% per year. So if we did nothing, we would need to go from 1 million drives to 1.5 million drives: 9 MegaWatts goes to 13.5 MegaWatts. Wow! Instead â€“ our high linearity, low noise PA and read channel designs are allowing drives to go to 4 TBytes per drives. (Sure the chip itself may use slightly more power, but thatâ€™s not the point, what it enables is a profound difference.) So to get that 50% increase in capacity we could actually reduce the number of drives deployed, with a net savings of 6.75 MegaWatts. Consider an average US home, with air conditioning, uses 1 kiloWatt. Thatâ€™s almost 7,000 homes. In reality â€“ they wonâ€™t get deployed that way â€“ but it will still be a huge savings. Instead of buying another 0.5 million drives they would buy 0.25 million drives with a net savings of 2.2 MegaWatts. Thatâ€™s still HUGE! (way to go, guys!) How many datacenters are doing that? Dozens. So thatâ€™s easily 20 or 30 MegaWatts globally. Did I say we saved them money too? A lot of money.
SSDs donâ€™t always get the credit they deserve. Yes, they really are fast, and they are awesome in your laptop, but they also end up being much lower power than hard drives. Our controllers were in about half the flash solutions shipped last year. Think tens of millions. If you just assume they were all laptop SSDs (at least half were not) then thatâ€™s another 20 MegaWatts in savings.
Did you know that in a traditional datacenter, about 30% of the power going into the building is used for air conditioning? It doesnâ€™t actually get used on the IT equipment at all, but is used to remove the heat that the IT equipment generates. We design our solutions so they can accommodate 40C ambient inlet air (thatâ€™s a little over 100Fâ€¦ hot). What that means is that the 30% of power used for the air conditioners disappears. Gone. Thatâ€™s not theoretical either. Most of the large social media, search engine, web shopping, and web portal companies are using our solutions this way. Thatâ€™s a 30% reduction in the power of storage solutions globally. Again, its MegaWatts in savings. And mega money savings too.
But letâ€™s really get to the big hitters: improved work per server. Yep â€“ we do that. In fact adding a Nytroâ„˘ MegaRAIDÂ® solution will almost always give you 4x the work out of a server. Itâ€™s a slam dunk if youâ€™re running a database. You heard me â€“ 1 server doing the work that it previously took 4 servers to do. Not only is that a huge savings in dollars (especially if you pay for software licenses!) but itâ€™s a massive savings in power. You can replace 4 servers with 1, saving at least 900 Watts, and that lone server thatâ€™s left is actually dissipating less power too, because itâ€™s actively using fewer HDDs, and using flash for most traffic instead. If you go a step further and use Nytro WarpDrive Flash cards in the servers, you can get much more â€“ 6 to 8 times the work. (Yes, sometimes up to 10x, but letâ€™s not get too excited). If you think thatâ€™s just theoretical again, check your FacebookÂ® account, or download something from iTunesÂ®. Those two services are the biggest users of PCIeÂ® flash in the world. Why? It works cost effectively. And in case you havenâ€™t noticed those two companies like to make money, not spend it. So again, weâ€™re talking about MegaWatts of savings. Arguably on the order of 150 MegaWatts. Yea â€“ thatâ€™s pretty theoretical, because they couldnâ€™t really do the same work otherwise, but still, if you had to do the work in a traditional way, it would be around that.
Itâ€™s hard to be more precise than giving round numbers at these massive scales, but the numbers are definitely in the right zone. I can say with a straight face we save the world 10â€™s, and maybe even 100â€™s of MegaWatts per year. But no one sees that, and not many people even think about it. Still â€“ Iâ€™d say LSI is a green hero.
Hey â€“ weâ€™re not done by a long shot. Letâ€™s just look at scrap. If you read my earlier post on false disk failure, youâ€™ll see some scary numbers. (http://blog.lsi.com/what-is-false-disk-failure-and-why-is-it-a-problem/ ) A normalÂ hyperscale datacenter can expect 40-60 disks per day to be mistakenly scrapped out. Thatâ€™s around 20,000 disk drives a year that should not have been scrapped, from just one web company. Think of the material waste, shipping waste, manufacturing waste, and eWaste issues. Wow â€“ all for nothing. Weâ€™re working on solutions to that. And batteries.Â Ugly, eWaste, recycle only, heavy metal batteries. They are necessary for RAID protected storage systems. And much of the worldâ€™s data is protected that way â€“ the battery is needed to save meta-data and transient writes in the event of a power failure, or server failure. We ship millions a year. (Sorry, mother earth). But weâ€™re working diligently to make that a thing of the past. And that will also result in big savings for datacenters in both materials and recycling costs.
Can we do more? Sure. I know I am trying to get us the core technologies that will help reduce power consumption, raise capability and performance, and reduce waste. But weâ€™ll never be done with that march of technology. (Which is a good thing if engineering is your careerâ€¦)
I still often think about green, environmental impact, and what weâ€™re doing to the environment. And I guess in my own small way, I am leaving the world a little better than when I arrived. And I think we at LSI should at least take a moment and pat ourselves on the back for that. You have to celebrate the small victories, you know? Even as the fight goes on.
I want to warn you, there is some thick background information here first. But donâ€™t worry. Iâ€™ll get to the meat of the topic and thatâ€™s this: Ultimately, I think thatÂ PCIeÂ® cards will evolve to more external, rack-level, pooled flash solutions, without sacrificing all their great attributes today. This is just my opinion, but other leaders in flash are going down this path tooâ€¦
Iâ€™ve been working on enterprise flash storage since 2007 â€“ mulling over how to make it work. Endurance, capacity, cost, performance have all been concerns that have been grappled with. Of course the flash is changing too as the nodes change: 60nm, 50nm, 35nm, 24nm, 20nmâ€¦ and single level cell (SLC) to multi level cell (MLC) to triple level cell (TLC) and all the variants of these â€śtrimmedâ€ť for specific use cases. The spec â€śenduranceâ€ť has gone from 1 million program/erase cycles (PE) to 3,000, and in some cases 500.
Itâ€™s worth pointing out that almost all the â€śmagicâ€ť that has been developed around flash was already scoped out in 2007. It just takes a while for a whole new industry to mature. Individual die capacity increased, meaning fewer die are needed for a solution â€“ and that means less parallel bandwidth for data transferâ€¦ And the â€śrequirementâ€ť for state-of-the-art single operation write latency has fallen well below the write latency of the flash itself. (What the ?? Yea â€“ talk about that later in some other blog. But flash is ~1500uS write latency, where state of the art flash cards are ~50uS.) When I describe the state of technology it sounds pretty pessimistic. Â Iâ€™m not. Weâ€™ve overcome a lot.
We built our first PCIe card solution at LSI in 2009. It wasnâ€™t perfect, but it was better than anything else out there in many ways. Weâ€™ve learned a lot in the years since â€“ both from making them, and from dealing with customer and users â€“ both of our own solutions and our competitors.Â Weâ€™re lucky to be an important player in storage, so in general the big OEMs, large enterprises and theÂ hyperscale datacenters all want to talk with us â€“ not just about what we have or can sell, but what we could have and what we could do. Theyâ€™re generous enough to share what works and what doesnâ€™t. What the values of solutions are and what the pitfalls are too. Honestly? Itâ€™s theÂ hyperscale datacenters in the lead both practically and in vision.
If you havenâ€™tÂ nodded off to sleep yet, thatâ€™s a long-winded way of saying â€“ things have changed fast, and, boy, weâ€™ve learned a lot in just a few years.
Most important thing weâ€™ve learnedâ€¦
Most importantly, weâ€™ve learned itâ€™s latency that matters. No one is pushing the IOPs limits of flash, and no one is pushing the bandwidth limits of flash. But they sure are pushing the latency limits.
PCIe cards are great, butâ€¦
Weâ€™ve gotten lots of feedback, and one of the biggest things weâ€™ve learned is â€“ PCIe flash cards are awesome. They radically change performance profiles of most applications, especially databases allowing servers to run efficiently and actual work done by that server to multiply 4x to 10x (and in a few extreme cases 100x). So the feedback we get from large users is â€śPCIe cards are fantastic. Weâ€™re so thankful they came along. Butâ€¦â€ť Thereâ€™s always a â€śbut,â€ť right??
It tends to be a pretty long list of frustrations, and they differ depending on the type of datacenter using them. Weâ€™re not the only ones hearing it. To be clear, none of these are stopping people from deploying PCIe flashâ€¦ the attraction is just too compelling. But the problems are real, and they have real implications, and the market is asking for real solutions.
Of course, everyone wants these fixed without affecting single operation latency, or increasing cost, etc. Thatâ€™s what weâ€™re here for though â€“ right? Solve the impossible?
A quick summary is in order. Itâ€™s not looking good. For a given solution, flash is getting less reliable, there is less bandwidth available at capacity because there are fewer die, weâ€™re driving latency way below the actual write latency of flash, and weâ€™re not satisfied with the best solutions we have for all the reasons above.
If you think these through enough, you start to consider one basic path. It also turns out weâ€™re not the only ones realizing this. Where will PCIe flash solutions evolve over the next 2, 3, 4 years? The basic goals are:
One easy answer would be â€“ thatâ€™s a flash SAN or NAS. But thatâ€™s not the answer. Not many customers want a flash SAN or NAS â€“ not for their new infrastructure, but more importantly, all the data is at the wrong end of the straw. The poor server is left sucking hard. Remember â€“ this is flash, and people use flash for latency. Today these SAN type of flash devices have 4x-10x worse latency than PCIe cards. Ouch. You have to suck the data through a relatively low bandwidth interconnect, after passing through both the storage and network stacks. And there is interaction between the I/O threads of various servers and applications â€“ you have to wait in line for that resource. Itâ€™s true there is a lot of startup energy in this space. Â It seems to make sense if youâ€™re a startup, because SAN/NAS is what people use today, and thereâ€™s lots of money spent in that market today. However, itâ€™s not what the market is asking for.
Another easy answer is NVMe SSDs. Right? Everyone wants them â€“ right? Well, OEMs at least. Front bay PCIe SSDs (HDD form factor or NVMe â€“ lots of names) that crowd out your disk drive bays. But they donâ€™t fix the problems. The extra mechanicals and form factor are more expensive, and just make replacing the cards every 5 years a few minutes faster. Wow. With NVME SSDs, you can fit fewer HDDs â€“ not good. They also provide uniformly bad cooling, and hard limit power to 9W or 25W per device. But to protect the storage in these devices, you need to have enough of them that you can RAID or otherwise protect. Once you have enough of those for protection, they give you awesome capacity, IOPs and bandwidth, too much in fact, but thatâ€™s not what applications need â€“ they need low latency for the working set of data.
What do I think the PCIe replacement solutions in the near future will look like? You need to pool the flash across servers (to optimize bandwidth and resource usage, and allocate appropriate capacity). You need to protect against failures/errors and limit the span of failure,Â commit writes at very low latency (lower than native flash) and maintain low latency, bottleneck-free physical links to each serverâ€¦ To me that implies:
That means the performance looks exactly as if each server had multiple PCIe cards. But the capacity and bandwidth resources are shared, and systems can remain resilient. So ultimately, I think that PCIe cards will evolve to more external, rack level, pooled flash solutions, without sacrificing all their great attributes today. This is just my opinion, but as I say â€“ other leaders in flash are going down this path tooâ€¦
Whatâ€™s your opinion?
Tags: DAS, datacenter, direct attached storage, enterprise IT, flash, hard disk drive, HDD, hyperscale, latency, NAS, network attached storage, NVMe, PCIe, SAN, solid state drive, SSD, storage area network
Big data and Hadoop are all about exploiting new value and opportunities with data. In financial trading, business and some areas of science, itâ€™s all about being fastest or first to take advantage of the data. The bigger the data sets, the smarter the analytics. The next competitive edge with big data comes when you layer in flash acceleration. The challenge is scaling performance in Hadoop clusters.
The most cost-effective option emerging for breaking through disk-to-I/O bottlenecks to scale performance is to use high-performance read/write flash cache acceleration cards for caching. This is essentially a way to get more work for less cost, by bringing data closer to the processing. The LSIÂ® Nytroâ„˘ product has been shown during testing to improve the time it takes to complete Hadoop software framework jobs up to a 33%.
Combining flash cache acceleration cards with Hadoop software is a big opportunity for end users and suppliers. LSI estimates that less than 10% of Hadoop software installations today incorporate flash acceleration1. Â This will grow rapidly as companies see the increased productivity and ROI of flash to accelerate their systems.Â And use of Hadoop software is also growing fast. IDC predicts a CAGR of as much as 60% by 20162. Drivers include IT security, e-commerce, fraud detection and mobile data user management. Gartner predicts that Hadoop software will be in two-thirds of advanced analytics products by 20153. There are many thousands of Hadoop software clusters already employed.
Where flash makes the most immediate sense is with those who have smaller clusters doing lots of in-place batch processing. Hadoop is purpose-built for analyzing a variety of data, whether structured, semi-structured or unstructured, without the need to define a schema or otherwise anticipate results in advance. Hadoop enables scaling that allows an unprecedented volume of data to be analyzed quickly and cost-effectively on clusters of commodity servers. Speed gains are about data proximity. This is why flash cache acceleration typically delivers the highest performance gains when the card is placed directly in the server on the PCI ExpressÂ® (PCIe) bus.
PCIe flash cache cards are now available with multiple terabytes of NAND flash storage, which substantially increases the hit rate. We offer a solution with both onboard flash modules and Serial-Attached SCSI (SAS) interfaces to create high-performance direct-attached storage (DAS) configurations consisting of solid state and hard disk drive storage. This couples the low latency performance benefits of flash with the capacity and cost per gigabyte advantages of HDDs.
To keep the processor close to the data, Hadoop uses servers with DAS. And to get the data even closer to the processor, the servers are usually equipped with significant amounts of random access memory (RAM). An additional benefit, smart implementation of Hadoop and flash components can reduce the overall server footprint required. Scaling is simplified, with some solutions providing the ability to allow up to 128 devices which share a very high bandwidth interface. Most commodity servers provide 8 or less SATA ports for disks, reducing expandability.
Hadoop is great, but flash-accelerated Hadoop is best. Itâ€™s an effective way, as you work to extract full value from big data, to secure a competitive edge.
It may sound crazy, but hard disk drives (HDDs) do not actually have a delete command. Now we all know HDDs have a fixed capacity, so over time the older data must somehow get removed, right? Actually it is not removed, but overwritten. The operating system (OS) uses a reference table to track the locations (addresses) of all data on the HDD. This table tells the OS which spots on the HDD are used and which are free. When the OS or a user deletes a file from the system, the OS simply marks the corresponding spot in the table as free, making it available to store new data.
The HDD is told nothing about this change, and it does not need to know since it would not do anything with that information. When the OS is ready to store new data in that location, it just sends the data to the HDD and tells it to write to that spot, directly overwriting the prior data. It is simple and efficient, and no delete command is required.
However, with the advent of NAND flash-based solid state drives (SSDs) a new problem emerged. In my blog, Gassing up your SSD, I explain how NAND flash memory pages cannot be directly overwritten with new data, but must first be erased at the block level through a process called garbage collection (GC). I further describe how the SSD uses non-user space in the flash memory (over provisioning or OP) to improve performance and longevity of the SSD. In addition, any user space not consumed by the user becomes what we call dynamic over provisioning â€“ dynamic because it changes as the amount of stored data changes. When less data is stored by the user, the amount of dynamic OP increases, further improving performance and endurance. The problem I alluded to earlier is caused by the lack of a delete command. Without a delete command, every SSD will eventually fill up with data, both valid and invalid, eliminating any dynamic OP. The result would be the lowest possible performance at that factory OP level. So unlike HDDs, SSDs need to know what data is invalid in order to provide optimum performance and endurance.
Keeping your SSD TRIM
A number of years ago, the storage industry got together and developed a solution between the OS and the SSD by creating a new SATA command called TRIM. It is not a command that forces the SSD to immediately erase data like some people believe. Actually the TRIM command can be thought of as a message from the OS about what previously used addresses on the SSD are no longer holding valid data. The SSD takes those addresses and updates its own internal map of its flash memory to mark those locations as invalid. With this information, the SSD no longer moves that invalid data during the GC process, eliminating wasted time rewriting invalid data to new flash pages. It also reduces the number of write cycles on the flash, increasing the SSDâ€™s endurance. Another benefit of the TRIM command is that more space is available for dynamic OP.
Today, most current operating systems and SSDs support TRIM, and all SandForce Drivenâ„˘ member SSDs have always supported TRIM. Note that most RAID environments do not support TRIM, although some RAID 0 configurations have claimed to support it. I have presented on this topic in detail previously. You can view the presentation in full here. In my next blog I will explain how there may be an alternate solution using SandForce Driven member SSDs.
Iâ€™ve been travelling to China quite a bit over the last year or so. Iâ€™m sitting in Shenzhen right now (If you know Chinese internet companies, youâ€™ll know who Iâ€™m visiting). The growth is staggering. Iâ€™ve had a bit of a trains, planes, automobiles experience this trip, and thatâ€™s exposed me to parts of China I never would have seen otherwise. Just to accommodate sheer population growth and the modest increase in wealth, there is construction everywhere â€“ a press of people and energy, constant traffic jams, unending urban centers, and most everything is new. Very new. It must be exciting to be part of that explosive growth. What a market. Â I mean â€“ come on â€“ there are 1.3 billion potential users in China.
The amazing thing for me is the rapid growth ofÂ hyperscale datacenters in China, which is truly exponential. Their infrastructure growth has been 200%-300% CAGR for the past few years. Itâ€™s also fantastic walking into a building in China, say Baidu, and feeling very much at home â€“ just like you walked into Facebook or Google. Itâ€™s the same young vibe, energy, and ambition to change how the world does things. And itâ€™s also the same pleasure â€“ talking to architects who are super-sharp, have few technical prejudices, and have very little vanity â€“ just a will to get to business and solve problems. Polite, but blunt. Weâ€™re lucky that they recognize LSI as a leader, and are willing to spend time to listen to our ideas, and to give us theirs.
Even their infrastructure has a similar feel to the USÂ hyperscale datacenters. The same only different. Â ;-)
A lot of these guys are growing revenue at 50% per year, several getting 50% gross margin. Those are nice numbers in any country. One has $100â€™s of billions in revenue. Â And theyâ€™re starting to push out of China. Â So far their pushes into Japan have not gone well, but other countries should be better. They all have unique business models. â€śWeâ€ť in the US like to say things like â€śAlibaba is the Chinese eBayâ€ť or â€śSina Weibo is the Chinese Twitterâ€ťâ€¦. But thatâ€™s not true â€“ they all have more hybrid business models, unique, and so their datacenter goals, revenue and growth have a slightly different profile. And there are some very cool services that simply are not available elsewhere. (You listening AppleÂ®, GoogleÂ®, TwitterÂ®, FacebookÂ®?) But they are all expanding their services, products and user base.Â Interestingly, there is very little public cloud in China. So there are no real equivalents to Amazonâ€™s services or Microsoftâ€™s Azure. I have heard about current development of that kind of model with the government as initial customer. Weâ€™ll see how that goes.
100â€™s of thousands of servers. Theyâ€™re not the scale of Google, but they sure are the scale of Facebook, Amazon, Microsoftâ€¦. Itâ€™s a serious market for an outfit like LSI. Really itâ€™s a very similar scale now to the US market. Close to 1 million servers installed among the main 4 players, and exabytes of data (weâ€™ve blown past mere petabytes). Interestingly, they still use many co-location facilities, but that will change. More important â€“ theyâ€™re all planning to probably double their infrastructure in the next 1-2 years â€“ they have to â€“ their growth rates are crazy.
Often 5 or 6 distinct platforms, just like the USÂ hyperscale datacenters. Database platforms, storage platforms, analytics platforms, archival platforms, web server platformsâ€¦. But they tend to be a little more like a rack of traditional servers that enterprise buys with integrated disk bays, still a lot of 1G Ethernet, and they are still mostly from established OEMs. In fact I just ran into one OEMâ€™s American GM, who I happen to know, in Tencentâ€™s offices today. The typical servers have 12 HDDs in drive bays, though they are starting to look at SSDs as part of the storage platform. They do use PCIeÂ® flash cards in some platforms, but the performance requirements are not as extreme as you might imagine. Reasonably low latency and consistent latency are the premium they are looking for from these flash cards â€“ not maximum IOPs or bandwidth â€“ very similar to their American counterparts. I thinkÂ hyperscale datacenters are sophisticated in understanding what they need from flash, and not requiring more than that. Enterprise could learn a thing or two.
Some server platforms have RAIDed HDDs, but most are direct map drives using a high availability (HA) layer across the server center â€“ HadoopÂ® HDFS or self-developed Hadoop like platforms. Some have also started to deploy microserver archival â€śbit buckets.â€ť A small ARMÂ® SoC with 4 HDDs totaling 12 TBytes of storage, giving densities like 72 TBytes of file storage in 2U of rack. While I can only find about 5,000 of those in China that are the first generation experiments, itâ€™s the first of a growing wave of archival solutions based on lower performance ARM servers. The feedback is clear – theyâ€™re not perfect yet, but the writing is on the wall. (If youâ€™re wondering about the math, thatâ€™s 5,000 x 12 TBytes = 60 Petabytesâ€¦.)
Yes, itâ€™s important, but maybe more than weâ€™re used to. Itâ€™s harder to get licenses for power in China. So itâ€™s really important to stay within the envelope of power your datacenter has. You simply canâ€™t get more. That means they have to deploy solutions that do more in the same power profile, especially as they move out of co-located datacenters into private ones. Annually, 50% more users supported, more storage capacity, more performance, more services, all in the same power. Thatâ€™s not so easy. I would expect solar power in their future, just as Apple has done.
Hereâ€™s where it gets interesting. They are developing a cousin to OpenCompute thatâ€™s called Scorpio. Itâ€™s Tencent, Alibaba, Baidu, and China Telecom so far driving the standard. Â The goals are similar to OpenCompute, but more aligned to standardized sub-systems that can be co-mingled from multiple vendors. There is some harmonization and coordination between OpenCompute and Scorpio, and in fact the Scorpio companies are members of OpenCompute. But where OpenCompute is trying to change the complete architecture of scale-out clusters, Scorpio is much more pragmatic â€“ some would say less ambitious. Theyâ€™ve finished version 1 and rolled out about 200 racks as a â€śtest caseâ€ť to learn from. Baidu was the guinea pig. Thatâ€™s around 6,000 servers. They werenâ€™t expecting more from version 1. Theyâ€™re trying to learn. Theyâ€™ve made mistakes, learned a lot, and are working on version 2.
Even if itâ€™s not exciting, it will have an impact because of the sheer size of deployments these guys are getting ready to roll out in the next few years. They see the progression as 1) they were using standard equipment, 2) theyâ€™re experimenting and learning from trial runs ofÂ Scorpio versions 1 and 2, and then theyâ€™ll work on 3) new architectures that are efficient and powerful, and different.
Information is pretty sketchy if you are not one of the member companies or one of their direct vendors. We were just invited to join Scorpio by one of the founders, and would be the first group outside of China to do so. If that all works out, Iâ€™ll have a much better idea of the details, and hopefully can influence the standards to be better for theseÂ hyperscale datacenter applications. Between OpenCompute and Scorpio weâ€™ll be seeing a major shift in the industry â€“ a shift that will undoubtedly be disturbing to a lot of current players. It makes me nervous, even though Iâ€™m excited about it. One thing is sure â€“ just as the server market volume is migrating from traditional enterprise toÂ hyperscale datacenter (25-30% of the server market and growing quickly), weâ€™re starting to see a migration to ChineseÂ hyperscale datacenters from US-based ones. They have to grow just to stay still. I mean â€“ come on â€“ there are 1.3 billion potential users in Chinaâ€¦.
Tags: Alibaba, Amazon, Apple, ARM, Baidu, China, China Telecom, datacenter, Facebook, Google, Hadoop, hard disk drive, HDD, hyperscale, Microsoft, OpenCompute, Scorpio, Shenzhen, Sina Weibo, solid state drive, SSD, Tencent, Twitter
Thereâ€™s no need to wait for higher speed. Server builders can take advantage of 12Gb/s SAS now. And this is even as HDD and SSD makers continue to tweak, tune and otherwise prepare their 12Gb/s SAS products for market. The next generation of 12Gb/s SAS without supporting drives? What gives?
Itâ€™s simple. LSI is already producing 12Gb/s ROC and IOC solutions, meaning that customers can take advantage of 12Gb/s SAS performance today with currently shipping systems and storage.Â As for the numbers, LSI 12Gb/s SAS enables performance increases of up to 45% in throughput and up to 58% in IOPS when compared to 6Gb/s SAS.
True, 12Gb/s SAS isnâ€™t a Big Bang Disruption in storage systems; rather itâ€™s an evolutionary change, but a big step forward.Â It may not be clear why it matters so much, so I want to briefly explain.Â In latest generation PCIe 3 systems, 6Gb/s SAS is the bottleneck that prevents systems from achieving full PCIe 3 throughput of 6,400 MB/s.
With 12Gb/s SAS, customers will be able to take full advantage of the performance of PCIe 3 systems.Â Earlier this month at CeBIT computer expo in Hanover, Germany, we announced that we are the first to ship production-level 12Gb/s SAS ROC (RAID on Chip) and IOC (I/O Controllers) to OEM customers.Â This convergence of new technologies and the expansion of existing capabilities create significant improvements for datacenters of all kinds.
At CeBIT, we demonstrated our 12Gb/s SAS solutions with the unique DataBoltTM feature and how, with DataBolt, Â systems with 6Gb/s SAS HDDs can achieve 12Gb/s SAS performance.
DataBolt uses bandwidth aggregation to create throughput performance acceleration. Â Most importantly, customers donâ€™t have to wait for the next inflection in drive design to get the highest possible performance and connectivity.
Iâ€™ve spent a lot of time with hyperscale datacenters around the world trying to understand their problems â€“ and I really donâ€™t care what area those problems are as long as theyâ€™re important to the datacenter. What is the #1 Real Problem for manyÂ hyperscale datacenters? Itâ€™s something youâ€™ve probably never heard about, and probably have not even thought about. Itâ€™s called false disk failure. Some hyperscaleÂ datacenters have crafted their own solutions â€“ but most have not.
Why is this important, you ask? Many large datacenters today have 1 million to 4 million hard disk drives (HDDs) in active operation. In anyoneâ€™s book thatâ€™s a lot. Itâ€™s also a very interesting statistical sample size of HDDs.Â Hyperscale datacentersÂ get great pricing on HDDs. Probably better than OEMs get, and certainly better than the $79 for buying 1 HDD at your local Fryâ€™s store. So you would imagine if a disk fails â€“ no one cares â€“ theyâ€™re cheap and easy to replace. But the burden of a failed disk is much more than the raw cost of the disk:
Letâ€™s put some scale to this problem, and youâ€™ll begin to understand the issue.Â One modest size hyperscale datacenter has been very generous in sharing its real numbers. (When I say modest, they are ~1/4 to 1/2 the size of many other hyperscale datacenters, but they are still huge â€“ more than 200k servers). Other hyperscale datacenters I have checked with say â€“ yep, thatâ€™s about right. And one engineer I know at an HDD manufacturer said â€“ â€śwow â€“ I expected worse than that. Thatâ€™s pretty good.â€ť To be clear â€“ these are very good HDDs they are using, itâ€™s just that the numbers add up.
The raw data:
RAIDed SAS HDDs
Non-RAIDed (direct map) SATA drives behind HBAs
Whatâ€™s interesting is the relative failure rate of SAS drives vs. SATA. Itâ€™s about an order of magnitude worse in SATA drives than SAS. Frankly some of this is due to protocol differences. SAS allows far more error recovery capabilities, and because they also tend to be more expensive, I believe manufacturers invest in slightly higher quality electronics and components. I know the electronics we ship into SAS drives is certainly more sophisticated than SATA drives.
False fail? What? Yea, thatâ€™s an interesting topic. It turns out that about 40% of the time with SAS and about 50% of the time with SATA, the drive didnâ€™t actually fail. It just lost its marbles for a while. When they pull the drive out and put it into a test jig, everything is just fine. And more interesting, when they put the drive back into service, it is no more statistically likely to fail again than any other drive in the datacenter. Why? No one knows. I suspect though.
I used to work on engine controllers. Thatâ€™s a very paranoid business. If something goes wrong and someone crashes, you have a lawsuit on your hands. If a controller needs a recall, thatâ€™s millions of units to replace, with a multi-hundred dollar module, and hundreds of dollars in labor for each one replaced. No one is willing to take that risk. So we designed very carefully to handle soft errors in memory and registers. We incorporated ECC like servers use, background code checksums and scrubbing, and all sorts of proprietary techniques, including watchdogs and super-fast self-resets that could get operational again in less than a full revolution of the engine.Â Why? â€“ the events were statistically rare. The average controller might see 1 or 2 events in its lifetime, and a turn of the ignition would reset that state.Â But the events do happen, and so do recalls and lawsuitsâ€¦ HDD controllers donâ€™t have these protections, which is reasonable. It would be an inappropriate cost burden for their price point.
You remember the Toyota Prius accelerator problems? I know that controller was not protected for soft errors. And the source of the problem remained a â€śmystery.â€ťÂ Maybe it just lost its marbles for a while? A false fail if you will. Just sayinâ€™.
Back to HDDs. False fail is especially frustrating, because half the HDDs actually didnâ€™t need to be replaced. All the operational costs were paid for no reason. The disk just needed a power cycle reset. (OK, that introduces all sorts of complex management by the RAID controller or application to manage that 10 second power reset cycle and application traffic created in that time â€“ be we can handle that.)
Daily, this datacenter has to:
And 1/2 of that is for no reason at all.
First â€“ why not rebuild the disk if itâ€™s RAIDed? Usually hyperscale datacenters use clustered applications. A traditional RAID rebuild drives the server performance to ~50%, and for a 2TByte drive, under heavy application load (definition of a hyperscale datacenter) can truly take up to a week.Â 50% performance for a week? In a cluster that means the overall cluster is running ~50% performance.Â Say 200 nodes in a cluster â€“ that means you just lost ~100 nodes of work â€“ or 50% of cluster performance. Itâ€™s much simpler to just take the node offline with the failed drive, and get 99.5% cluster performance, and operationally redistribute the workload across multiple nodes (because you have replicated data elsewhere). But after rebuild, the node will have to be re-synced or re-imaged. There are ways to fix all this. Weâ€™ll talk about them on another day. Or you can simply run direct mapped storage, and unmounts the failed drive.
Next â€“ Why replicate data over the network, and why is that a big deal? For geographic redundancy (say a natural disaster at one facility) and regional locality, hyperscale datacenters need multiple data copies. Often 3 copies so they can do double duty as high-availability copies, or in the case of some erasure coding, 2.2 to 2.5 copies (yea â€“ weird math â€“ how do you have 0.5 copyâ€¦). When you lose one copy, you are down to 2, possibly 1. You need to get back to a reliable number again. Fast. Customers are loyal because of your perfect data retention. So you need to replicate that data and re-distribute it across the datacenter on multiple servers. Thatâ€™s network traffic, and possibly congestion, which affects other aspects of the operations of the datacenter. In this datacenter itâ€™s about 50 hours of 10G Ethernet traffic every day.
To be fair, there is a new standard in SAS interfaces that will facilitate resetting a disk in-situ. And there is the start of discussion of the same around SATA â€“ but thatâ€™s more problematic. Whatever the case, it will be a years before the ecosystem is in place to handle the problems this way.
Whatâ€™s that mean to you?
Well. You can expect something like 1/100 of your drives to really fail this year. And you can expect another 1/100 of your drives to fail this year, but not actually be failed. Youâ€™ll still pay all the operational overhead of not actually having a failed drive â€“ rebuilds, disk replacements, management interventions, scheduled downtime/maintenance time, and the OEM replacement price for that drive â€“ what $600 or so ?â€¦ Depending on your size, thatâ€™s either a donâ€™t care, or a big deal. There are ways to handle this, and theyâ€™re not expensive â€“ much less than the disk carrier you already pay for to allow you to replace that drive â€“ and it can be handled transparently â€“ just a log entry without seeing any performance hiccups. Â You just need to convince your OEM to carry the solution.
Congratulations to Seagate on achieving this monumental milestone of shipping 2 billion hard disk drives (HDDs)! To put this in perspective, if you measured the Earthâ€™s circumference in inches, it is only about 1.57 billion.
Those HDDs have used a lot of parts and intellectual property (IP), and LSI has been a very happy, long-time partner to Seagate providing read channel IP for nearly all of the 2 billion HDDs it has shipped. The read channel is a critical piece of technology inside the HDD that translates the magnetically encoded information on the rotating media to electronic signals that can be understood by the host computer. Advances in the LSI read channel IP over the years have also contributed to the continuing HDD capacity increases we depend upon today.
Seagate reported it took 29 years to ship its first billion drives and only 4 years for the second. This jaw-dropping growth is a stark reminder that the global data deluge continues to swell. Now more than ever, IT architects and managers need smarter ways â€“ down to the silicon â€“ to produce new storage and networking efficiencies and reduce costs.