For many years LSI has known the importance of truly understanding the complexities of interfacing with NAND flash memory to optimize its performance and lifetime. For that reason LSI created a group focused on characterizing NAND flash behavior as it interfaces with LSI flash controllers. I recently spoke to LSI’s expert in this area, Bill Hunt, Engineering Director Flash Analytics at LSI, to better understand what his group produces for LSI and how that translates into better solutions for our customers.
Q: Is all NAND flash created equal?
Bill: Definitely not. NAND flash specs, performance and ratings not only vary from vendor to vendor, they also vary between process geometries, between models within the same NAND family, and over the production life – especially during early production ramp. Also, NAND vendors intentionally create unequal models of the same part to address their different markets, like client and enterprise. Understanding the difference between NAND types is critical to building a robust solution.
Q: How does NAND vary from vendor to vendor?
Bill: There are really two levels of differences among NAND vendors, differences as a result of different architectures and differences when NAND vendors share architectures. For NAND vendors with completely different designs and fab processes, there are many differences in the NAND specifications. Some of the differences between NAND devices include different pin-outs, power requirements, block and page layouts, addressing schemes, timing specifications, commands, and read recovery procedures. I could go on.
Some NAND vendors have common designs and fab processes. But even these devices can have significant operational differences for each vendor. Each device can have unique features enabled with different device trims (editor’s note: manufacturing settings), command, diagnostics, and read recovery steps. Even using standard interfaces like ONFI and Toggle doesn’t guarantee common operations. Each vendor has their own interpretation and implementation of these standards.
Q: How does NAND vary from generation to generation?
Bill: Shrinking the process geometry requires a new device architecture. The new architecture drives changes to the operation and specification of a NAND device. The greatest changes are driven by NAND capacity increases. For example, the size and layout of the planes, blocks, and pages have to be modified to deal with the new architecture and increased capacity. Since the NAND cells are smaller and closer together, the error handling capability also has to be increased. The error-correcting code (ECC) requirements and resulting spare areas increase. The NAND also has to change to deal with increased bad block rates. The data rate and performance of each generation must also improve to keep up with what users are asking for. This drives changes to the interface timing specifications and adds new feature sets. In general NAND endurance gets worse with shrinking geometries and it is critical to understand changes due to new generations to develop more powerful and effective ECC algorithms.
Q: Does LSI have any dedicated facility to evaluate NAND from different suppliers?
Bill: Yes, LSI’s Flash Analytics lab is dedicated to evaluating and characterizing NAND flash that will be used with LSI flash controllers.
Q: What kinds of testing does LSI do in the Flash Analytics lab?
Bill: The Flash Analytics lab has two main functions. First, we integrate NAND devices into solid-state drives (SSDs) with LSI SandForce controllers to ensure they work well together. Second, we characterize NAND devices to see how the NAND flash performs and operates over the lifetime of the device. We do this in various operational modes. It is critical to understand the behavior of the raw NAND to design and develop solutions with the reliability and performance demanded by the market.
Q: Does LSI test flash memory beyond their rated lifetimes?
Bill: Yes. NAND vendors do not always share their own characterization testing results beyond their rated endurance limit, so we gather that data. Typically we perform program-erase cycles on devices until very poor raw bit error rate is achieved or a catastrophic error occurs. We also exceed other specifications, such as retention limits and read disturb limits. Understanding what happens to flash as it ages gives us valuable information on how devices might fail in real-world scenarios.
Q: What type of data is generated from all the tests that are conducted?
Bill: We generate a characterization report for each device we test. This report compares our results to the vendor specifications, including graphs of error rates vs. program-erase cycles for different retention limits and error correction limits. The report also evaluates the effects from read disturb over endurance and retention lifetimes. Other sections include an analysis of the physical location of errors, and read recovery effectiveness. We also evaluate the impact to performance over the life of the drive.
Q: What does LSI do with this data?
Bill: First, we use it to validate the flash vendor specifications. Second, we use it to design and optimize our LSI SandForce flash controller designs. In particular, we use the data to optimize our error recovery and SHIELD technology. We also use it to evaluate possible trade-offs: for example, to trade performance for increasing the endurance of the NAND and extending its life. Last, we share information with customers whenever possible. The goal of collecting this data is to develop the most advanced ECC possible to increase SSD reliability, endurance, and performance.
Q: Is LSI able to generate better products because we collect this data?
Bill: The information we gather in the LSI Flash Analytics lab has certainly helped improve our products. Our testing has improved quality by assuring NAND parts are meeting the vendor specs. When we show our data to the NAND vendor, they are more motivated to share their detailed data with us. Our lab is also equipped to run specific tests to help diagnose problems seen with our products during qualification and production. As an example, we have run tests to run specific tests to evaluate read recovery issues and physical location stress. We also gather raw data during our characterization testing that is used by our product architecture team. The raw data is fed into simulation models and used to optimize our flash channel and SHIELD technology. In a nutshell, our improved understanding of flash memory helps us build better flash controllers – which helps our customers build better SSDs.
Q: Does LSI work closely with NAND vendors on this analysis?
Bill: Yes, we have regular meetings with all of the NAND flash vendors that our flash controllers support. We work closely with NAND vendors to assure we have the latest information. We not only make sure we have the latest roadmaps, datasheets and application notes, but we get clarifications about flash operations, quality and performance. We share the characterization data we collect, and get insight to our results. We also keep the NAND vendors informed about our controller roadmap and features so they assure their products are tracking with us.
Mastering NAND flash memory is critical to flash controller development success
Any NAND Flash memory controller developer would be remiss if the engineers did not perform in-depth testing and characterization to better understand this very complex technology. Also, any company that can support more than one flash vendor must be able to understand the differences between manufacturers to better design and modify the controller to support the widest selection of NAND flash providing the greatest flexibility for their customers.
Ever since SandForce introduced data reduction technology with the DuraWrite™ feature in 2009, some users have been confused about how it works and questioned whether it delivers the benefits we claim. Some even believe there are downsides to using DuraWrite with an SSD. In this blog, I will dispel those misconceptions.
Data reduction technology refresher
Four of my previous blogs cover the many advantages of using data reduction technology like DuraWrite:
In a nutshell, data reduction technology reduces the size of data written to the flash memory, but returns 100% of the original data when reading it back from the flash. This reduction in the required storage space helps accelerate reads and writes, extend the life of the flash and increase the dynamic over provisioning (OP).
What is incompressible data?
Data is incompressible when data reduction technology is unable to reduce the size of a dataset – in which case the technology offers no benefit for the user. File types that are altogether or mostly incompressible include MPEG, JPEG, ZIP and encrypted files. However, data reduction technology is applied to an entire SSD, so the free space resulting from the smaller, compressed files increases OP for all file types, even incompressible files.
The images below help illustrate this process. The image on the left represents a standard SSD 256GB SSD filled to about 80% capacity with a typical operating system, applications and user data. The remaining 20% of free space is automatically used by the SSD as dynamic OP. The image on the right shows how the same data stored on a data reduction-capable SSD can nearly double the available OP for the SSD because the operating system, applications and half of the user data can be reduced in this example.
Why is dynamic OP so important?
OP is the lifeblood of a flash memory-based SSD (nearly all of them available today). Without OP the SSD could not operate. Allocating more space for OP increases an SSD’s performance and endurance, as well as reduces it power consumption. In the illustrations above, both SSDs are storing about 30% of user data as incompressible files like MPEG movies and JPG images. As I mentioned, data reduction technology can’t compress those files, but the rest of the data can be reduced. The result is the SSD with data reduction delivers higher overall performance than the standard SSD even with incompressible data.
Misconception 1: Data reduction technology is a trick
There’s no trickery with data reduction technology. The process is simple: It reduces the size of data differently depending on the content, increasing SSD speed and endurance.
Misconception 2: Users with movie, picture, and audio files will not benefit from data reduction
As illustrated above, as long as an operating system and other applications are stored on the SSD, there will be at least some increase in dynamic OP and performance despite the incompressible files.
Misconception 3: Testing with all incompressible data delivers worst-case performance
Given that a typical SSD stores an operating system, programs and other data files, an SSD test that writes only incompressible data to the device would underestimate the performance of the SSD in user deployments.
Data reduction technology delivers
Data reduction technology, like LSI® SandForce® DuraWrite, is often misunderstood to the point that users believe they would be better off without it. The truth is, with data reduction technology, nearly every user will see performance and endurance gains with their SSD regardless of how much incompressible data is stored.
I started working years ago to engage large datacenters, learn what their problems are and try to craft solutions for their problems. It’s taken years, but we engaged them, learned, changed how we thought about storage and began creating solutions that are being deployed at scale.
We’ve started to do the same with the Chinese Internet giants. They’re growing at an incredible rate. They have similar problems, but it’s surprising how different their solution approaches are. Each one is unique. And we’re constantly learning from these guys.
So to wrap up the blog series on my interview with CIO & CEO magazine, here are the last two questions to explain a bit more.
CEO & CIO: Please use examples to tell the stories about the forward-looking technologies and architectures that LSI has jointly developed with Internet giants.
While our host bus adapters (HBAs) and MegaRAID® solutions have been part of the hyperscale Internet companies’ infrastructure since the beginning, we have only recently worked very closely with them to drive joint innovation. In 2009 I led the first LSI engagement with what we then called “mega datacenters.” It took a while to understand what they were doing and why. By 2010 we realized there were specialized needs, and began to imagine new hardware products that worked with these datacenters. Out of this work came the realization that flash was important for efficiency and capability, and the “invention” of LSI® Nytro™ product portfolio. (More are in the pipeline). We have worked closely with hyperscale datacenters to evolve and tune these solutions, to where Nytro products have become the backbone of their main revenue platforms. Facebook has been a vitally important partner in evolving our Nytro platform – teaching us what was truly needed, and now much of their infrastructure runs on LSI products. These same products are a good fit for other hyperscale customers, and we are slowly winning many of the large ones.
Looking forward, we are partnered with several Internet giants in the U.S. and China to work on cold storage solutions, and more importantly shared DAS (Distributed DAS: D-DAS) solutions. We have been demonstrating prototypes. These solutions enable pooled architectures and rack scale architecture, and can be made to work tightly with software-defined datacenters (SDDCs). They simplify management and resource allocation – making task deployment more efficient and easier. Shared DAS solutions increase infrastructure efficiency and improves lifecycle management of components. And they have the potential to radically improve application performance and infrastructure costs.
Looking further into the future, we see even more radical changes in silicon supporting transport protocols and storage models, and in rack scale architectures supporting storage and pooled memory. And cold storage is a huge though, some would say, boring problem that we are also focused on – storing lots of data for free and using no power to do it… but I really can’t talk about any of that.
CEO & CIO: LSI maintains good contact with big Internet companies in China. What are the biggest differences between dealing with these Internet enterprises and dealing with traditional partners?
Yes, we have a very good relationship with large Chinese Internet companies. In fact, I will be visiting Tencent, Alibaba and Baidu in a few weeks. One of the CTOs I would like to say is a friend. That is, we have fun talking together about the future.
These meetings have evolved. The first meetings LSI had about two years ago were sales calls, or support for OEM storage solutions. These accomplished very little. Once we began visiting as architects speaking to architects, real dialogs began. Our CEO has been spending time in China meeting with these Internet companies both to learn, and to make it clear that they are important to us, and we want a chance to solve their problems. But the most interesting conversations have been the architectural ones. There have been very clear changes in the two years I have traveled within China – from standard enterprise to hyperscale architectures.
We’ve received fascinating feedback on architecture, use, application profiles, platforms, problems and goals. We have strong engagement with the U.S. Internet giants. At the highest level, the Chinese Internet companies have similar problems and goals. But the details quickly diverge because of revenue per user, resources, power availability, datacenter ownership and Internet company age. The use of flash is very different.
The Chinese Internet giants are at an amazing change point. Most are ready for explosive growth of infrastructure and deployment of cloud services. Most are changing from standard OEM systems and architectures to self-designed hyperscale systems after experimenting with Scorpio and microserver deployments. Several, like JD.com (an Amazon-like company) are moving from hosted to self-built infrastructure. And there seems to be a general realization that the datacenter has changed from a compute-centric model to a dataflow model, where storage and network dictate how much work gets done more than the CPU does. These giants are leveraging their experience and capability to move very quickly, and in a few cases are working to create true pooled rack level architectures much like Facebook and Google have started in the U.S. In fact, Baidu is similar to Facebook in this approach, but is different in its longer term goals for the architecture.
The Chinese companies are amazingly diverse, even within one datacenter, and arguments on architectural direction are raging within these Internet giants – it’s healthy and exciting. However, the innovations that are coming are similar to those developed by large U.S. Internet companies. Personally I have found these Internet companies much more exciting and satisfying to work with than traditional OEMs. The speed and cadence of advancement, the recognition of problems and their importance, the focus on efficiency and optimization have been much more exciting. And the youthful mentality and view to problems, without being burdened by “the way we’ve always done this” has been wonderful.
Also see these blogs of mine over the past year, where you can read more about some of these changes:
“Postcard from Shenzhen: China’s hyperscale datacenter growth, mixed with a more traditional approach”
“China in the clouds, again”
“China: A lot of talk about resource pooling, a better name for disaggregation”
Or see them (and others) all here.
Summary: So it’s taken years, but we engaged U.S. Internet giants, learned about their problems, changed how we thought about storage and began creating solutions that are now being deployed at scale. And we’re constantly learning from these guys. Constantly, because their problems are constantly changing.
We’ve now started to do the same with the Chinese Internet giants. They have similar problems, and will need similar solutions, but they are not the same. And just like the U.S. Internet giants, each one is unique.
Tags: Alibaba, Amazon, Baidu, CEO & CIO Magazine, China, cloud services, cold storage, D-DAS, DAS, datacenter, datacenter ecosystem, direct attached storage, distributed DAS, Facebook, flash, flash storage, Google, HBA, host bus adapter, hyperscale datacenter, Internet, JD.com, MegaRAID, OEM, original equipment manufacturer, Scorpio, Tencent
My “Size matters: Everything you need to know about SSD form factors” blog in January spawned some interesting questions, a number of them on Z-height.
What is a Z-height anyway?
For a solid state drive (SSD), Z-height describes its thickness and is generally its smallest dimension. Z-height is a redundant term, since Z is a variable representing the height of an SSD. The “Z” is one of the variables – X, Y and Z, synonymous with length, width and height – that describe the measurements of a 3-dimensional object. Ironically, no one says X-length or Y-width, but Z-height is widely used.
What’s the state of affairs with SSD Z-height?
The Z-height has typically been associated with the 2.5″ SSD form factor. As I covered in my January form factor blog, the initial dimensions of SSDs were modeled after hard disk drives (HDDs). The 2.5” HDD form factor featured various heights depending on the platter count – the more disks, the greater the capacity and the thicker the HDD. The first 2.5” full capacity HDDs had a maximum Z-height of 19mm, but quickly dropped to a 15mm maximum to enable full-capacity HDDs in thinner laptops. By the time SSDs hit high-volume market acceptance, the dimensional requirements for storage had shrunk even more, to a maximum height of 12.5mm in the 2.5” form factor. Today, the Z-height of most 2.5″ SSDs generally ranges from 5.0mm to 9.5mm.
With printed circuit board (PCB) form factor SSDs—those with no outer case—the Z-height is defined by the thickness of the board and its components, which can be 3mm or less. Some laptops have unique shape or height restrictions for the SSD space allocation. For example, the MacBook Air’s ultra-thin profile requires some of the thinnest SSDs produced.
A new standard in SSD thickness
The platter count of an HDD determines its Z-height. In contrast, an SSD’s Z-height is generally the same regardless of capacity. The proportion of SSD form factors deployed in systems is shifting from the traditional, encased SSDs to the new bare PCB SSDs. As SSDs drift away from the older form factors with different heights, consumers and OEM system designers will no longer need to consider Z-height because the thickness of most bare PCB SSDs will be standard.
The introduction of LSI® SF3700 flash controllers has prompted many questions about the PCIe® (PCI Express) interface and how it benefits solid state storage, and there’s no better person to turn to for insights than our resident expert, Jeremy Werner, Sr. Director of Product and Customer Management in LSI’s Flash Components Division (SandForce):
Most client-based SSDs have used SATA in the past, while PCIe was mainly used for enterprise applications. Why is the PCIe interface becoming so popular for the client market?
Jeremy: Over the past few decades, the performance of host interfaces for client devices has steadily climbed. Parallel ATA (PATA) interface speed grew from 33MB/s to 100MB/s, while the performance of the Serial ATA (SATA) connection rose from 1.5Gb/s to 6Gb/s. Today, some solid state drives (SSDs) use the PCIe Gen2 x4 (second-generation speeds with four data communication lanes) interface, supporting up to 20Gb/s (in each direction). Because the PCIe interface can simultaneously read and write (full duplex) and SATA can only read or write at one time (half-duplex), PCIe can potentially double the 20Gb/s speeds in a mixed (read and write) workload, making it nearly seven times faster than SATA.
Will the PCIe interface replace SATA for SSDs?
Jeremy: Eventually the replacement is likely, but it will probably take many years in the single-drive client PC market given two hindrances. First, some single-drive client platforms must use a common HDD and SSD connection to give users the choice between the two devices. And because the 6Gb/s SATA interface delivers much higher speeds than than hard disk drives, there is no immediate need for HDDs to move to the faster PCIe connection, leaving SATA as the sole interface for the client market. And, secondly, the older personal computers already in consumers’ homes that need an SSD upgrade support only SATA storage devices, so there’s no opportunity for PCIe in that upgrade market.
By contrast, the enterprise storage market, and even some higher-end client systems, will migrate quickly to PCIe since they will see significant speed increases and can more easily integrate PCIe SSD solutions available now.
It is noteworthy that some standards, like M.2 and SATA Express, have defined a single connector that supports SATA or PCIe devices. The recently announced LSI SF3700 is one example of an SSD controller that supports both of those interfaces on an M.2 board.
What is meant by the terms “x1, x2, x4, x16” when referencing a particular PCIe interface?
Jeremy: These numbers are the PCIe lane counts in the connection. Either the host (computer) or the device (SSD) could limit the number of lanes used. The theoretical maximum speed of the connection (not including protocol overhead) is the number of lanes multiplied by the speed of each lane.
What is protocol overhead?
Jeremy: PCIe, like many bus interfaces, uses a transfer encoding scheme – a set number of data bits represented by a slightly larger number of bits called a symbol. The additional bits in the symbol constitute the inefficient overhead of metadata required to manage the transmitted user data. PCIe Gen3 features a more efficient data transfer encoding with 128b/132b (3% overhead) instead of the 8b/10b (20% overhead) of PCIe Gen2, increasing data transfer speeds by up to 21%.
What is defined in the PCIe 2.0 and 3.0 specifications, and do end users really care?
Jeremy: Although each PCIe Gen3 lane is faster than PCIe Gen2 (8Gb/s vs 5Gb/s, respectively), lanes can be combined to boost performance in both versions. The changes most relevant to consumers pertain to higher speeds. For example, today consumer SSDs top out at 150K random read IOPS at 4KB data transfer sizes. That translates to about 600MB/s, which is insufficient to saturate a PCIe Gen2 x2 link, so consumers would see little benefit from a PCIe Gen3 solution over PCIe Gen2. The maximum performance of PCIe Gen2 x4 and PCIe Gen3 x2 devices is almost identical because of the different transfer encoding schemes mentioned previously.
Are there mandatory features that must be supported in any of these specifications?
Jeremy: Yes, but nearly all of these features have little impact on performance, so most users have no interest in the specs. It’s important to keep in mind that the PCIe speeds I’ve cited are defined as the maximums, and the spec has no minimum speed requirement. This means a PCIe Gen3 solution might support only a maximum of 5Gb/s, but still be considered a PCIe Gen3 solution if it meets the necessary specifications. So buyers need to be aware of the actual speed rating of any PCIe solution.
Is a PCIe Gen3 SSD faster than a PCIe Gen2 SSD?
Jeremy: Not necessarily. For example, a PCIe Gen2 x4 SSD is capable of higher speeds than a PCIe Gen3 x1 SSD. However, bottlenecks other than the front-end PCIe interface will limit the performance of many SSDs. Examples of other choke points include the bandwidth of the flash, the processing/throughput of the controller, the power or thermal limitations of the drive and its environment, and the ability to remove heat from that environment. All of these factors can, and typically do, prevent the interface from reaching its full steady-state performance potential.
In what form factors are PCIe cards available?
Jeremy: PCIe cards are typically referred to as plug-in products, much like SSDs, graphics cards and host-bus adapters. PCIe SSDs come in many form factors, with the most popular called “half-height, half-length.” But the popularity of the new, tiny M.2 form factors is growing, driven by rising demand for smaller consumer computers. There are other PCIe form factors that resemble traditional hard disk drives, such as the SFF-8639, a 2.5” hard disk drive form factor that features four PCIe lanes and is hot pluggable. What’s more, its socket is compatible with the SAS and SATA interfaces. The adoption of the SATA Express 2.5” form factor has been limited, but could be given a boost with the availability of new capabilities like SRIS (Separate Refclk with Independent SSC), which enables the use of lower cost interconnection cables between the device and host.
Are all M.2 cards the same?
Jeremy: No. All SSD M.2 cards are 22 mm wide (while some WAN cards are 30 mm wide), but the specification allows for different lengths (30, 42, 60, 80, and 110 mm). What’s more, the cards can be single- or double-sided to account for differences in the thickness of the products. Also, they are compatible with two different sockets (socket 2 and socket 3). SSDs compatible with both socket types, or only socket 2, can connect only two lanes (x2), while SSDs compatible with only socket 3 can connect up to four (x4).
In my last few blogs, I covered various aspects of SSD form factors and included many images of the types that Jeremy mentioned above. I also delve deeper into details of the M.2 form factor in my blog “M.2: Is this the Prince of SSD form factors?” One thing about PCIe is certain: It is the next step in the evolution of computer interfaces and will give rise to more SSDs with higher performance, lower power consumption and better reliability.
How did he do that?
Growing up, I watched a little TV. Okay, a lot of TV as I did not have my DVR or iPad and a man who would one day occupy the White House as VP had not yet invented the Internet. Of the many shows I watched, MacGyver was one of my favorites. He would take ordinary objects and use them to solve complicated problems in a way no one could have imagined. Out of all the things he used, his trusty Swiss army knife was the most awesome. With all its blades, tools and accessories, it could solve multiple problems at the same time. It was easy to use, did not take up a lot of space and was very cost-effective.
Nytro MegaRAID – the Swiss Army knife of server storage
LSI has its own multi-function, get-yourself-out-of-a-fix workhorse – the Nytro MegaRAID® card, part of the Nytro product family. It combines caching intelligence, RAID protection and flash on a single PCIe® card to accelerate applications, so it can be deployed to solve problems across a broad number of applications.
A feature for every challenge!
The Nytro MegaRAID card is built on the same trusted technology as the MegaRAID cards deployed in datacenters worldwide. That means, it is enterprise architected and hardened and datacenter tested. Its Swiss Army knife-like features include, as I mentioned, on-board flash storage that can be configured to monitor the flow of data from an application to the attached RAID protected storage, intelligently identify hot, or the most frequently accessed, data, and automatically move a copy of that data to the flash storage to accelerate applications. The next time the application needs that data, the information is fetched from flash, not the much slower traditional hard disk drive (HDD) storage.
Hard drives can lead to slowdowns in another way, too, when the mechanics wear out and fail. When they do, your storage (and application) performance can dramatically decrease – in a RAID storage environment, this is called degraded mode. The good news is that the Nytro MegaRAID card stores much of an application’s frequently used data in its intelligent flash based cache, boosting the performance of a connected HDD in degrade mode by as much as 10x, depending on the configuration. The Swiss Army knife follow-on benefit is that when you replace the failed drive, Nytro MegaRAID speeds RAID storage rebuilds by as much as 4x. RAID rebuilds add to IT admin time, and IT time is money, so that’s money you get to keep in your pocket.
The Nytro MegaRAID card also can be configured so you can use half of its onboard flash as a pair of mirrored boot drives. In big data environments, this mirroring frees up two boot drives for use as data storage to help increase your server storage density (aka available storage capacity), often significantly, while dramatically improving boot time. What’s more, that same flash can be deployed instead as primary storage to complement your secondary HDD storage with higher speeds, providing a superfast repository for key files like virtual desktop infrastructure (VDI) golden images or key database log files.
One MacGyver Swiss Army knife, one Nytro MegaRAID card – both easy-to-use solutions for a number of complex problems.
Tags: application acceleration, big data, data protection, database, flash, flash card, flash-based cache, hard disk drive, HDD, MacGyver, Nytro MegaRAID card, PCIe card, RAID, server storage, Swiss Army knife, VDI, virtual desktop infrastructure
I was asked some interesting questions recently by CEO & CIO, a Chinese business magazine. The questions ranged from how Chinese Internet giants like Alibaba, Baidu and Tencent differ from other customers and what leading technologies big Internet companies have created to questions about emerging technologies such as software-defined storage (SDS) and software-defined datacenters (SDDC) and changes in the ecosystem of datacenter hardware, software and service providers. These were great questions. Sometimes you need the press or someone outside the industry to ask a question that makes you step back and think about what’s going on.
I thought you might interested, so this blog, the first of a 3-part series covering the interview, shares details of the first two questions.
CEO & CIO: In recent years, Internet companies have built ultra large-scale datacenters. Compared with traditional enterprises, they also take the lead in developing datacenter technology. From an industry perspective, what are the three leading technologies of ultra large-scale Internet data centers in your opinion? Please describe them.
There are so many innovations and important contributions to the industry from these hyperscale datacenters in hardware, software and mechanical engineering. To choose three is difficult. While I would prefer to choose hardware innovations as their big ones, I would suggest the following as they have changed our world and our industry and are changing our hardware and businesses:
Autonomous behavior and orchestration
An architect at Microsoft once told me, “If we had to hire admins for our datacenter in a normal enterprise way, we would hire all the IT admins in the world, and still not have enough.” There are now around 1 million servers in Microsoft datacenters. Hyperscale datacenters have had to develop autonomous, self-managing, sometimes self-deploying datacenter infrastructure simply to expand. They are pioneering datacenter technology for scale – innovating, learning by trial and error, and evolving their practices to drive more work/$. Their practices are specialized but beginning to be emulated by the broader IT industry. OpenStack is the best example of how that specialized knowledge and capability is being packaged and deployed broadly in the industry. At LSI, we’re working with both hyperscale and orchestration solutions to make better autonomous infrastructure.
High availability at datacenter level vs. machine level
As systems get bigger they have more components, more modes of failure and they get more complex and expensive to maintain reliability. As storage is used more, and more aggressively, drives tend to fail. They are simply being used more. And yet there is continued pressure to reduce costs and complexity. By the time hyperscale datacenters had evolved to massive scale – 100’s of thousands of servers in multiple datacenters – they had created solutions for absolute reliability, even as individual systems got less expensive, less complex and much less reliable. This is what has enabled the very low cost structures of the cloud, and made it a reliable resource.
These solutions are well timed too, as more enterprise organizations need to maintain on-premises data across multiple datacenters with absolute reliability. The traditional view that a single server requires 99.999% reliability is giving way to a more pragmatic view of maintaining high reliability at the macro level – across the entire datacenter. This approach accepts the failure of individual systems and components even as it maintains data center level reliability. Of course – there are currently operational issues with this approach. LSI has been working with hyperscale datacenters and OEMs to engineer improved operational efficiency and resilience, and minimized impact of individual component failure, while still relying on the datacenter high-availability (HA) layer for reliability.
It’s such an overused term. It’s difficult to believe the term barely existed a few years ago. The gift of Hadoop® to the industry – an open source attempt to copy Google® MapReduce and Google File System – has truly changed our world unbelievably quickly. Today, Hadoop and the other big data applications enable search, analytics, advertising, peta-scale reliable file systems, genomics research and more – even services like Apple® Siri run on Hadoop. Big data has changed the concept of analytics from statistical sampling to analysis of all data. And it has already enabled breakthroughs and changes in research, where relationships and patterns are looked for empirically, rather than based on theories.
Overall, I think big data has been one of the most transformational technologies this century. Big data has changed the focus from compute to storage as the primary enabler in the datacenter. Our embedded hard disk controllers, SAS (Serial Attached SCSI) host bus adaptors and RAID controllers have been at the heart of this evolution. The next evolutionary step in big data is the broad adoption of graph analysis, which integrates the relationship of data, not just the data itself.
CEO & CIO: Due to cloud computing, mobile connectivity and big data, the traditional IT ecosystem or industrial chain is changing. What are the three most important changes in LSI’s current cooperation with the ecosystem chain? How does LSI see the changes in the various links of the traditional ecosystem chain? What new links are worth attention? Please give some examples.
Cloud computing and the explosion of data driven by mobile devices and media has and continues to change our industry and ecosystem contributors dramatically. It’s true the enterprise market (customers, OEMs, technology, applications and use cases) has been pretty stable for 10-20 years, but as cloud computing has become a significant portion of the server market, it has increasingly affected ecosystem suppliers like LSI.
Timing: It’s no longer enough to follow Intel’s ticktock product roadmap. Development cycles for datacenter solutions used to be 3 to 5 years. But these cycles are becoming shorter. Now, demand for solutions is closer to 6 months – forcing hardware vendors to plan and execute to far tighter development cycles. Hyperscale datacenters also need to be able to expand resources very quickly, as customer demand dictates. As a result they incorporate new architectures, solutions and specifications out of cycle with the traditional Intel roadmap changes. This has also disrupted the ecosystem.
End customers: Hyperscale datacenters now have purchasing power in the ecosystem, with single purchase orders sometimes amounting to 5% of the server market. While OEMs still are incredibly important, they are not driving large-scale deployments or innovating and evolving nearly as fast. The result is more hyperscale design-win opportunities for component or sub-system vendors if they offer something unique or a real solution to an important problem. This also may shift profit pools away from OEMs to strong, nimble technology solution innovators. It also has the potential to reduce overall profit pools for the whole ecosystem, which is a potential threat to innovation speed and re-investment.
New players: Traditionally, a few OEMs and ISVs globally have owned most of the datacenter market. However, the supply chain of the hyperscale cloud companies has changed that. Leading datacenters have architected, specified or even built (in Google’s case) their own infrastructure, though many large cloud datacenters have been equipped with hyperscale-specific systems from Dell and HP. But more and more systems built exactly to datacenter specifications are coming from suppliers like Quanta. Newer network suppliers like Arista have increased market share. Some new hyperscale solution vendors have emerged, like Nebula. And software has shifted to open source, sometimes supported for-pay by companies copying the Redhat® Linux model – companies like Cloudera, Mirantis or United Stack. Personally, I am still waiting for the first 3rd-party hardware service emulating a Linux support and service company to appear.
Open initiatives: Yes, we’ve seen Hadoop and its derivatives deployed everywhere now – even in traditional industries like oil and gas, pharmacology, genomics, etc. And we’ve seen the emergence of open-source alternatives to traditional databases being deployed, like Casandra. But now we’re seeing new initiatives like Open Compute and OpenStack. Sure these are helpful to hyperscale datacenters, but they are also enabling smaller companies and universities to deploy hyperscale-like infrastructure and get the same kind of automated control, efficiency and cost structures that hyperscale datacenters enjoy. (Of course they don’t get fully there on any front, but it’s a lot closer). This trend has the potential to hurt OEM and ISV business models and markets and establish new entrants – even as we see Quanta, TYAN, Foxconn, Wistron and others tentatively entering the broader market through these open initiatives.
New architectures and new algorithms: There is a clear movement toward pooled resources (or rack scale architecture, or disaggregated servers). Developing pooled resource solutions has become a partnership between core IP providers like Intel and LSI with the largest hyperscale datacenter architects. Traditionally new architectures were driven by OEMs, but that is not so true anymore. We are seeing new technologies emerge to enable these rack-scale architectures (RSA) – technologies like silicon photonics, pooled storage, software-defined networks (SDN), and we will soon see pooled main memory and new nonvolatile main memories in the rack.
We are also seeing the first tries at new processor architectures about to enter the datacenter: ARM 64 for cool/cold storage and web tier and OpenPower P8 for high power processing – multithreaded, multi-issue, pooled memory processing monsters. This is exciting to watch. There is also an emerging interest in application acceleration: general-purposing computing on graphics processing units (GPGPUs), regular expression processors (regex) live stream analytics, etc. We are also seeing the first generation of graph analysis deployed at massive scale in real time.
Innovation: The pace of innovation appears to be accelerating, although maybe I’m just getting older. But the easy gains are done. On one hand, datacenters need exponentially more compute and storage, and they need to operate 10x to 1000x more quickly. On the other, memory, processor cores, disks and flash technologies are getting no faster. The only way to fill that gap is through innovation. So it’s no surprise there are lots of interesting things happening at OEMs and ISVs, chip and solution companies, as well as open source community and startups. This is what makes it such an interesting time and industry.
Consumption shifts: We are seeing a decline in laptop and personal computer shipments, a drop that naturally is reducing storage demand in those markets. Laptops are also seeing a shift to SSD from HDD. This has been good for LSI, as our footprint in laptop HDDs had been small, but our presence in laptop SSDs is very strong. Smart phones and tablets are driving more cloud content, traffic and reliance on cloud storage. We have seen a dramatic increase in large HDDs for cloud storage, a trend that seems to be picking up speed, and we believe the cloud HDD market will be very healthy and will see the emergence of new, cloud-specific HDDs that are radically different and specifically designed for cool and cold storage.
There is also an explosion of SSD and PCIe flash cards in cloud computing for databases, caches, low-latency access and virtual machine (VM) enablement. Many applications that we take for granted would not be possible without these extreme low-latency, high-capacity flash products. But very few companies can make a viable storage system from flash at an acceptable cost, opening up an opportunity for many startups to experiment with different solutions.
Summary: So I believe the biggest hyperscale innovations are autonomous behavior and orchestration, HA at the datacenter level vs. machine level, and big data. These are radically changing the whole industry. And what are those changes for our industry and ecosystem? You name it: timing, end customers, new players, open initiatives, new architectures and algorithms, innovation, and consumption patterns. All that’s staying the same are legacy products and solutions.
These were great questions. Sometimes you need the press or someone outside the industry to ask a question that makes you step back and think about what’s going on. Great questions.
Tags: Alibaba, Apple Siri, Arista, ARM 64, Baidu, big data, Casandra, CEO & CIO Magazine, China, cloud storage, Cloudera, cold storage, cool storage, datacenter, datacenter ecosystem, Dell, flash, Foxconn, Google File System, Google MapReduce, Hadoop, hard disk drive, HDD, high availability, HP, hyperscale datacenter, Intel, Internet, latency, Microsoft, Mirantis, Nebula, OEM, Open Compute, OpenPower P8, OpenStack, original equipment manufacturer, Quanta, rack scale, RAID, Redhat Linux, SAS, SDDC, SDN, SDS, Serial Attached SCSI, software-defined datacenter, software-defined networks, software-defined storage, solid state drive, SSD, Tencent, TYAN, United Stack, virtual machine, VM, Wistron
I was recently speaking to a customer about data reduction technology and I remembered a conversation I had with my mother when I was a teenager. She used to complain how chaotic my bedroom looked, and one time I told her “I was illustrating the second law of thermodynamics” for my physics class. I was referring to the mess and the tendency of things to evolve towards the state of maximum entropy, or randomness. I have to admit I only used that line once with my mom because it pissed her off and she likened me to an intelligent donkey.
I never expected those early lessons in theoretical physics to be useful in the real world, but as it turns out entropy can be a significant factor in determining solid state drive (SSD) performance. When an SSD employs data reduction technology, the degree of entropy or randomness in the data stream becomes inversely related to endurance and performance—the lower the data entropy, the higher the endurance and performance of the SSD.
Entropy affects data reduction
In this context I am defining entropy as the degree of randomness in data stored by an SSD. Theoretically, minimal or nonexistent entropy would be characterized by data bits of all ones or all zeros, and maximum entropy by a completely random series of ones and zeros. In practice, the entropy of what we often call real-world data falls somewhere in between these two extremes. Today we have hardware engines and software algorithms that can perform deduplication, string substitution and other advanced procedures that can reduce files to a fraction of their original size with no loss of information. The greater the predictability of data – that is, the lower the entropy – the more it can be reduced. In fact, some data can be reduced by 95% or more!
Files such as documents, presentations and email generally contain repeated data patterns with low randomness, so are readily reducible. In contrast, video files (which are usually compressed) and encrypted files (which are inherently random) are poor candidates for data reduction.
A reminder is in order not to confuse random data with random I/O. Random (and sequential) I/Os describe the way data is accessed from the storage media. The mix of random vs. sequential I/Os also influences performance, but in a different way than entropy, described in my blog “Teasing out the lies in SSD benchmarking.”
Why data reduction matters in an SSD
The NAND flash memory inside SSDs is very sensitive to the cumulative amount of data written to it. The more data written to flash, the shorter the SSD’s service life and the sooner its performance will degrade. Writing less data, therefore, means better endurance and performance. You can read more about this topic in my two blogs “Can data reduction technology substitute for TRIM” and “Write Amplification – Part 2.”
Real-world examples in client computing
Take an encrypted text document. The file started out as mostly text with some background formatting data. All things considered, the original text file is fairly simple and organized. The encryption, by design, turns the data into almost completely random gibberish with almost no predictability to the file. The original text file, then, has low entropy and the encrypted file high entropy.
Intel Labs examined entropy in the context of compressibility as background research to support its Intel SSD 520 Series. The following chart summarizes Intel’s findings for the kinds of data commonly found on client storage drives, and the amount of compression that might be achieved:
According to Intel, “75% of the file types observed can be typically compressed 60% or more.” Granted, the kind of files found on drives varies widely according to the type of user. Home systems might contain more compressed audio and video, for example – poor candidates, as we mentioned, for data reduction. But after examining hundreds of systems from a wide range of environments, LSI estimates that the entropy of typical user data averages about a 50%, suggesting that many users would see at least a moderate improvement in performance and endurance from data reduction because most data can be reduced before it is written to the SSD.
Real-world examples in the enterprise
Enterprise IT managers might be surprised at the extent to which data reduction technology can increase workload performance. While gauging the level of improvement with any precision would require data-specific benchmarking, sample data can provide useful insights. LSI examined the entropy of various data types, shown in the chart below. I found the high reducibility of the Oracle® database file very surprising because I had previously been told by database engineers that I should expect higher entropy. I later came to understand these enterprise databases are designed for speed, not capacity optimization. Therefore it is faster to store the data in its raw form rather than use a software compression application to compress and decompress the database on the fly and slow it down.
Putting it all together
The chief goals of PC and laptop users and IT managers have long been, and remain, to maximize the performance and lifespan of storage devices – SSDs and HDDs – and at a competitive price point. The challenge for SSD users is to find a device that delivers on all three fronts. LSI® SandForce® DuraWrite™ technology helps give SSD users exactly what they want. By reducing the amount of data written to flash memory, DuraWrite increases SSD endurance and performance without additional cost – even if it doesn’t help organize your teenager’s bedroom.
It’s the start of the new year, and it’s traditional to make predictions – right? But predicting the future of the datacenter has been hard lately. There have been and continue to be so many changes in flight that possibilities spin off in different directions. Fractured visions through a kaleidoscope. Changes are happening in the businesses behind datacenters, the scale, the tasks and what is possible to accomplish, the value being monetized, and the architectures and technologies to enable all of these.
A few months ago I was asked to describe the datacenter in 2020 for some product planning purposes. Dave Vellante of Wikibon & John Furrier of SiliconANGLE asked me a similar question a few weeks ago. 2020 is out there – almost 7 years. It’s not easy to look into the crystal ball that far and figure out what the world will look like then, especially when we are in the midst of those tremendous changes. For some context I had to think back 7 years – what was the datacenter like then, and how profound have the changes been over the past 7 years?
And 7 years ago, our forefathers…
It was a very different world. Facebook barely existed, and had just barely passed the “university only” membership. Google was using Velcro, Amazon didn’t have its services, cloud was a non-existent term. In fact DAS (direct attach storage) was on the decline because everyone was moving to SAN/NAS. 10GE networking was in the future (1GE was still in growth mode). Linux was not nearly as widely accepted in enterprise – Amazon was in the vanguard of making it usable at scale (with Werner Vogels saying “it’s terrible, but it’s free, as in free beer”). Servers were individual – no “PODs,” and VMware was not standard practice yet. SATA drives were nowhere in datacenters.
An enterprise disk drive topped out at around 200GB in capacity. Nobody used the term petabyte. People, including me, were just starting to think about flash in datacenters, and it was several years later that solutions became available. Big data did not even exist. Not as a term or as a technology, definitely not Hadoop or graph search. In fact, Google’s seminal paper on MapReduce had just been published, and it would become the inspiration for Hadoop – something that would take many years before Yahoo picked it up and helped make it real.
Analytics were statistical and slow, and you had to be very explicitly looking for something. Advertising on the web was a modest business. Cold storage was tape or MAID, not vast pools of cheap disks in the cloud at absurdly low price points. None of the Chinese web-cloud guys existed… In truth, at LSI we had not even started looking at or getting to know the web datacenter guys. We assumed they just bought from OEMs…
No one streamed mainstream media – TV and movies – and there were no tablets to stream them to. YouTube had just been purchased by Google. Blu-ray was just getting started and competing with HD-DVD (which I foolishly bought 7 years ago), and integrated GPS’s in your car were a high-tech growth area. The iPhone or Android had not launched, Danger’s Sidekick was the cool phone, flip phones were mainstream, there was no App store or the billions of sales associated with that, and a mobile web browser was virtually useless.
Dell, IBM, and HP were the only real server companies that mattered, and the whole industry revolved around them, as well as EMC and NetApp for storage. Cisco, Lenovo and Huawei were not server vendors. And Sun was still Sun.
7 years from now
So – 7 years from now? That’s hard to predict, so take this with a grain of salt… There are many ways things could play out, especially when global legal, privacy, energy, hazardous waste recycling, and data retention requirements come into play, not to mention random chaos and invention along the way.
Compute-centric to dataflow-centric
Major applications are changing (have changed) from compute-centric to dataflow architectures. That is big data. The result will probably be a decline in the influence of processor vendors, and the increased focus on storage, network and memory, and optimized rack-level architectures. A handful of hyperscale datacenters are leading the way, and dragging the rest of us along. These types of solutions are already being deployed in big enterprise for specialized use cases, and their adoption will only increase with time. In 7 years, the main deployment model will echo what hyperscale datacenters are doing today: disaggregated racks of compute, memory and storage resources.
The datacenter is now being viewed as a profit growth enabler, rather than a cost center. That implies more compute = more revenue. That changes the investment profile and the expectations for IT. It will not be enough for enterprise IT departments to minimize change and risk because then they would be slowing revenue growth.
Customers and vendors
We are in the early stages of a customer revolt. Whether it’s deserved or not is immaterial, though I believe it’s partially deserved. Large customers have decided (and I’m doing broad brush strokes here) that OEMs are charging them too much and adding “features” that add no value and burn power, that the service contracts are excessively expensive and that there is very poor management interoperability among OEM offerings – on purpose to maintain vendor lockin. The cost structures of public cloud platforms like Amazon are proof there is some merit to the argument. Management tools don’t scale well, and require a lot of admin intervention. ISVs are seen as no better. Sure the platforms and apps are valuable and critical, but they’re really expensive too, and in a few cases, open source solutions actually scale better (though ISVs are catching up quickly).
The result? We’re seeing a push to use whitebox solutions that are interoperable and simple. Open source solutions – both software and hardware – are gaining traction in spite of their problems. Just witness the latest Open Compute Summit and the adoption rate of Hadoop and OpenStack. In fact many large enterprises have a policy that’s pretty much – any new application needs to be written for open source platforms on scale-out infrastructure.
Those 3 OEMs are struggling. Dell, HP and IBM are selling more servers, but at a lower revenue. Or in the case of IBM – selling the business. They are trying to upsell storage systems to offset those lost margins, and they are trying to innovate and vertically integrate to compensate for the changes. In contrast we’re seeing a rapid increase planned from self-built, self-architected hyperscale datacenters, especially in China. To be fair – those pressures on price and supplier revenue are not necessarily good for our industry. As well, there are newer entrants like Huawei and Cisco taking a noticeable chunk of the market, as well as an impending growth of ISV and 3rd party full rack “shrink wrapped” systems. Everybody is joining the party.
Storage, cold storage and storage-class memory
Stepping further out on the limb, I believe (but who really knows) that by 2020 storage as we know is no longer shipping. SMB is hollowed out to the cloud – that is – why would any small business use anything but cloud services? The costs are too compelling. Cloud storage is stratified into 3 levels: storage-class memory, flash/NVM and cool/cold bulk disk storage. Cold storage is going to be a very, very important area. You need to save that data, but spend zero power, and zero $ on storing it. Just look at some of the radical ideas like Facebook’s Blu-ray jukebox to address that, which was masterminded by a guy I really like – Gio Coglitore – and I am very glad is getting some rightful attention. (http://www.wired.com/wiredenterprise/2014/02/facebook-robots/)
I believe that pooled storage class memory is inevitable and will disrupt high-performance flash storage, probably beginning in 2016. My processor architect friends and I have been daydreaming about this since 2005. That disruption’s OK, because flash use will continue to grow, even as disk use grows. There is just too much data. I’ve seen one massive vendor’s data showing average servers are adding something like 0.2 hard disks per year and 0.1 SSDs per year – and that’s for the average server including diskless nodes that are usually the most common in hyperscale datacenters. So growth in spite of disruption and capacity growth.
Data will be pooled, and connected by fabric as distributed objects or key/value pairs, with erasure coding. In fact, Object store (key/value – whatever) may have “obsoleted” block storage. And the need for these larger objects will probably also obsolete file as we’re used to it. Sure disk drives may still be block based, though key/value gives rise to all sorts of interesting opportunities to support variable size structures, obscure small fault domains, and variable encryption/compression without wasting space on disk platters. I even suspect that disk drives as we know them will be morphing into cold store specialty products that physically look entirely different and are made from different materials – for a lot of reasons. 15K drives will be history, and 10K drives may too. In fact 2” drives may not make sense anymore as the laptop drive and 15K drive disappear and performance and density are satisfied by flash.
Enterprise becomes private cloud that is very similar structurally to hyperscale, but is simply in an internal facility. And SAN/NAS products as we know them will be starting on the long end of the tail as legacy support products. Sure new network based storage models are about to emerge, but they’re different and more aligned to key/value.
Rack-scale architectures will have taken over clustered deployments. That means pooled resources. Processing will be pools of single socket SoC servers enabling massive clusters, rather than lots of 2- socket servers. These SoCs might even be mobile device SoCs at some point or at least derived from that – the economics of scale and fast cadence of consumer SoCs will make that interesting, maybe even inevitable. After all, the current Apple A7 in the iphone 5S is a dual core, 64-bit V8 ARM at 1.4GHz and the whole iPhone costs as much as mainstream server processor chips. In a few years, an 8 or 16 core equivalent at 1.5GHz or 2GHz is not hard to imagine, and the cost structure should be excellent.
Rapidly evolving open source applications will have morphed into eventually consistent dataflow tasks. Or they will be emerging in-memory applications working on vast data structures in the pooled storage class memory at the rack or larger scale, which will add tremendous monetary value to businesses. Whatever the evolutionary paths – the challenge for the next 10 years is optimizing dataflow as the amount used continues to exponentially grow. After all – data has value in aggregate, so why would you throw anything away, even as the amount we generate increases?
Clusters will be autonomous. Really autonomous. As in a new term I love: “emergent.” It’s when you can start using big data analytics to monitor the datacenter, and make workload/management and data placement decisions in real time, automatically, and the datacenter begins to take on un-predicted characteristics. Deployment will be autonomous too. Power on a pod of resources, and it just starts working. Google does that already.
Layer 2 datacenter network switches will either be disappearing or will have migrated to a radically different location in the rack hierarchy. There are many ways this can evolve. I’m not sure which one(s) will dominate, but I know it will look different. And it will have different bandwidth. 100G moving to 400G interconnect fabric over fiber.
So there you have it. Guaranteed correct…
Different applications and dataflow, different architectures, different processors, different storage, different fabrics. Probably even a re-alignment of vendors.
Predicting the future of the datacenter has not been easy. There have been, and are so many changes happening. The businesses behind them. The scale, the tasks and what is possible to accomplish, the value being monetized, and the architectures and technologies to enable all of these. But at least we have some idea what’s ahead. And it’s pretty different, and exciting.
Tags: 10 gigabit ethernet, 2020, Amazon, Apple, China, Cisco, cloud storage, cold storage, datacenter, Dell, EMC, Facebook, flash, Google, Hadoop, HP, Huawei, hyperscale datacenter, IBM, iPhone, kaleidoscope, Lenovo, NAS, NetApp, non-volatile memory, NVM, Open Compute, OpenStack, rack scale architecture, SAN, SoC, Sun, VMware, YouTube
A major reason enterprise customers see high latency and poorer than expected performance when implementing flash technology is that the flash partition is not aligned on a sector boundary that allows the flash device to access its data efficiently. When creating a Logical Volume (LVM), things can even get more complicated. Proper partition alignment is critical to performance when implementing flash in your enterprise.
An aligned partition is one that starts on a sector number that’s evenly divisible by 4k, or 8k, or a starting sector that is divisible by eight. Aligned input-out (IO) operations will start at sector 8 for 4k alignments, 16 for an 8k alignments, and so forth, with sector 2048 for 1M alignments.
If a flash partition is unaligned – its IO operations start at a sector number not divisible by eight – the device will perform two IOs over adjacent blocks instead of one. These extra IOs will degrade performance of the flash device. In our testing, we have seen up to 4x performance gains by properly aligning the flash device.
Out with the old … in with the new
There are many articles, websites, and Linux system administrators best practice documents describing how to create a logical volume (LVM) – an abstraction of a number of flash devices into a single storage volume that enables dynamic volume resizing and makes it easier to replace, re-partition and back up individual devices in Linux. However, most of these practices were developed before the advent of PCIe® flash devices. I have worked with customers who have used these old practices of creating LVMs and some of them are seeing very poor performance when implementing flash in their environments.
My conversations with customers and documents I’ve read on creating LVMs have revealed that the first step in creating a LVM – to create a physical volume (PV) – needs refinement. The reason is the PV create process can use a raw device, a partitioned device, or a mix. I would suggest getting into the habit of aligning all flash devices on a physical sector boundary so that all PVs are aligned. The PV command is typically specified as either “pvcreate /dev/sdX,” which allocates the whole device (non-partitioned) to the PV, or “pvcreate /dev/sdX1,” which uses a partition to create the PV. If the PV is created using a mix of raw devices and partitioned devices, or multiple partitioned devices, is there alignment over all the PVs? Maybe! Maybe not! That’s the problem!
Aligning for higher speed
I recommend a new approach to creating LVMs when using flash technology. My suggestion is to align each of the flash devices on a 1M boundary before creating the PV. Here are the steps to help make sure that you are using boundary-aligned devices when creating a LVM:
echo “2048,,8e” | sfdisk – uS /dev/sdX – force
Implementing flash in the enterprise is a great way to produce low latencies while providing high IOPs and throughput. By following these steps, you will successfully set up an LVM over multiple flash devices that are aligned on a proper boundary to get the best performance.