Customer dilemma: I just purchased PCIe® flash cards to increase performance of my enterprise applications that run on Linux® and Unix®. How do I set them up to get the best performance?
Good question. I wish there were a simple answer but each environment is different. There is no cookie-cutter configuration that fits all, though a few questions will reveal how the PCIe flash cards should be configured for optimum performance.
Most of the popular relational and non-relational databases run on many different operating systems. I will be describing Linux-specific configurations, but most of them should also work with Unix systems that are supported by the PCIe flash card vendor. I’m a database guy, but these same principals and techniques that I’ll be covering apply to other applications like mail servers, web servers, application servers and, of course, databases.
Aligning PCIe flash devices
The most important step to perform on each PCIe flash card is to create a partition that is aligned on a specific boundary (such as 4k or 8k) so each read and write to the flash device will require only one physical input/output (IO) operation. If the card is not partitioned on such a boundary, then reads and writes will span the sector groups, which doubles the IO latency for each read or write request.
To align a partition, I use the sfdisk command to start a partition on a 1M boundary (sector 2048). Aligning to a 1M boundary resolves the dependency to align to a 4k, 8k, or even a 64k boundary. But before I do this, I need to know how I am going to use this device. Will this be a standalone partition? Part of a logical volume? Or part of a RAID group?
Which one is best?
If I were deploying the PCIe flash device for database caching (for example, the Oracle database has provided this caching functionality for years using the Database Smart Flash Cache feature, and Facebook created the open source Flashcache used in MySQL databases), I would use a single-partitioned PCIe flash card if I knew the capacity would meet my needs now and over the next 5 years. If I selected this configuration, the sfdisk command to create the partition would be:
echo “2048,,” | sfdisk –uS /dev/sdX –force
This single partitioning is also required with the Oracle® Automatic Storage Management system (ASM). Oracle has provided ASM for many years and I will go over how to use this storage feature in Part 3 of this series.
If I need to deploy multiple PCIe flash cards for database caching, I would create Logical Volume Manager (LVM) over all the flash devices to simplify administration. The sfdisk command to create a partition for each PCIe flash card would be:
echo “2048,,8e” | sfdisk –uS /dev/sdX –force
“8e” is the system partition type for creating a logical volume.
Neither of these solutions needs fault tolerance since they will be used for write-thru caching. My recent blog “How to optimize PCIe flash cards – a new approach to creating logical volumes” covers this process in detail.
If I want to use the PCIe flash card for persisting data, I would need to make the PCIe flash cards fault tolerant, using two or more cards to build the RAID array and eliminate any single point of failure. There are a number of ways to create a RAID over multiple PCIe flash cards, two of which are:
But what type of RAID setup is best to use?
Oracle coined the term S.A.M.E. – Stripe And Mirror Everything – in 1999 and popularized the practice, which many database administrators (DBA) and storage administrators have followed ever since. I follow this practice and suggest you do the same.
First, you need to determine how these cards will be accessed:
In database deployments, your choice is usually among online transaction processing (OLTP) applications like airline and hotel reservation systems and corporate financial or enterprise resource planning (ERP) applications, or data warehouse/data mining/data analytics applications, or a mix of both environments. OLTP applications involve small random reads and writes as well as many sequential writes for log files. Data warehouse/data mining/data analytics applications involve mostly large sequential reads with very few sequential log writes.
Before setting up one or many PCIe flash cards in a RAID array either using LVM on RAID or creating a RAID array using MDADM, you need to know the access pattern of the IO, capacity requirements and budget. These requirements will dictate which RAID level will work best for your environment and fit your budget.
I would pick either a RAID 1/RAID 10 configuration (mirroring without striping, or striping and mirroring respectively), or RAID 5 (striping with parity). RAID 1/RAID 10 costs more but delivers the best performance, whereas RAID 5 costs less but imposes a significant write penalty.
Optimizing OLTP application performance
To optimize performance of an OLTP application, I would implement either a RAID 1 or RAID 10 array. If I were budget constrained, or implementing a data warehouse application, I would use a RAID 5 array. Normally a RAID 5 array will produce a higher throughput (megabits per second) appropriate for a data warehouse/data mining application.
In a nutshell, knowing how to tune the configuration to the application is key to reaping the best performance.
For either RAID array, you need to create an aligned partition using sfdisk:
echo “2048,,fd” | sfdisk –uS /dev/sdX –force
“fd” is the system identifier for a Linux RAID auto device.
Keep in mind that it is not mandatory to create a partition for LVMs or RAID arrays. Instead, you can assign RAW devices. It’s important to remember to align the sectors if combining RAW and partitioned devices or just creating a basic partition. It’s sound practice to always create an aligned partition when using PCIe flash cards.
At this point, aligned partitions have been created and are now ready to be used in LVMs or RAID arrays. Instructions for creating these are on the web or in Linux/Unix reference manuals. Here are a couple of websites that go over the process of creating LVM, RAID, or LVM on RAID:
Specifying a stripe width value
Also remember that, when creating LVMs with striping or RAID arrays, you’ll need to specify a stripe width value. Many years ago, Oracle and EMC conducted a number studies on this and concluded that a 1M stripe width performed the best as long as the database IO request was equal to or less than 1M. When implementing Oracle ASM, Oracle’s standard is to use 1M allocation units, which matches its coarse striping size of 1M.
Part 2 of this series will describe how to create RAW devices or file systems.
Part 3 of this series will describe how to use Oracle ASM when deploying PCIe flash cards.
Part 4 of this series will describe how to persist assignment to dynamically changing NWD/NMR devices.
Tags: ASM, automatic storage management, data analytics, data mining, data warehouse, Database Smart Cache, EMC, enterprise resource planning, ERP, Facebook, flash storage, Flashcache, Linux, logical volume, Logical Volume Manager, LVM, MDADM, multiple device administration, MySQL, non-relational database, OLTP, online transaction processing, Oracle, partition, PCI Express, PCIe flash, performance, RAID, RAW, relational database, SAME, sector, Stripe and Mirror Everything, Unix
Most consumers are skeptical when they see a manufacturer whipping out grandiose performance claims. And for good reason. The manufacturer could be stretching the truth, twisting the results, or just being downright misleading. From this distrust grew demand for 3rd-party writers to review products, test claims and provide an unbiased analysis of the device’s performance and other capabilities – as consumers would experience themselves.
Who can really claim to be an SSD benchmarking expert?
Solid state drive (SSD) technology is still relatively new in the computer industry, and in many ways SSDs are profoundly different from hard disk drives – perhaps most notably, in the way they record data, to a NAND cell rather than on spinning media. Because of differences in their operation, SSDs have to be tested in ways that are not necessarily obvious.
Can anyone who simply runs a benchmark application claim to be an expert? I would say not. Just as anyone sitting behind the wheel of a car is not necessarily an expert driver. The problem is that it is hard to determine the thoroughness and expertise of an SSD reviewer. Does the author really understand the details behind the technology to run adequate tests and analyze the results?
Can “experts” present bad data?
Maybe it is obvious, but of course experts can be wrong, especially when they are self-proclaimed mavens without deep experience in the technology they cover. At a minimum, you can generally count on them to act in good faith – that is, to not be intentionally misleading – but they can easily be misinformed (for instance, by manufacturers) and perpetuate the misinformation. What’s more, some reviewers are pressured to do a cursory analysis of an SSD as they crank through countless product evaluations under unremitting deadlines – a crush that can cause oversights in telling aspects of a drive’s performance. In any case, it is not good to rely on bad data no matter the intention.
What makes for a thorough SSD review?
Some reviewers have gone to great lengths to ensure their SSD analysis is extremely detailed and represents a real-world environment and performance. These reviewers will generally talk about how their analysis simulates a true user or server environment. The trouble can begin if a reviewer doesn’t recognize normal operation of an SSD in its own environment. With SSDs, “normal” is when garbage collection is operating, which greatly impacts overall performance. It’s important for reviewers to recognize that, with a new SSD, garbage collection is inactive until at least one full physical capacity of data has been sequentially written to the device. For example, with a 256GB SSD, 256GB of data must be written to trigger garbage collecting. At that point, garbage collection is ongoing, the drive has reached its steady-state performance, and the device is ready for evaluation. Random writes are another story, requiring up to three passes (full-capacity writes) randomly written to the SSD before the steady-state performance level shown below is reached.
You can see that running only a few minutes of random write tests on this SSD logs performance of over 275 MB/s. However, once garbage collection starts, performance plunges and then takes up to 3 hours before the true performance of 25 MB/s (a 90% drop) is finally evident – a phenomenon that often is not communicated clearly in reviews nor widely understood.
Good benchmarkers will discuss how their review factors in both garbage collection preparation and steady-state performance testing. Test results that purportedly achieve steady state in less time than in the example above are unlikely to reflect real-world performance. This is all part of what is called SSD preconditioning, but keep in mind that different tests require different steps for preconditioning.
For additional information on this topic, you can review my presentation from Flash Memory Summit 2013 on “Don’t let your favorite benchmarks lie to you.”