My first blog in this series, “How to maximize performance of PCIe flash for enterprise applications running on Linux,” describes the steps for aligning PCIe® flash devices. This blog covers the next stage of setting up the PCIe flash device when using the Linux® operating system: creating a RAW device or a file system.
At this point, one or more PCIe flash cards have been partitioned on a sector boundary. Depending on their use, these partitioned devices are either set up as a single RAW device or as part of a logical volume or RAID array.
Next step is to determine how these devices will be used. Most administrators will create file systems on these partitions. Some Oracle administrators will use them as RAW devices and assign them to Automatic Storage Management (ASM). Still others, those looking for the best performance possible from the device will stick to a RAW device. For many years, the recommendation was not to use RAW devices because the complexity of managing them outweighed their small potential gains in performance.
ASM uses RAW devices but makes administration of these devices much easier. More on ASM in Part 3 of this series.
Building a file system
Next is to build a file system on the RAW device, LVM or RAID. But first we need to determine the best type of file system to use. There are many to choose from including:
To keep this brief, I will only go over EXT-4. This type of file system is the most current and provides the latest enhancements for increasing capacity, disabling journaling and many other capabilities, though XFS can be a higher performance alternative.
To create an EXT-4 file system, use this command:
You can now turn off or on certain features of the EXT-4 file system by using “tune2fs. Here are a couple of examples of using tune2fs:
tune2fs –l /dev/sdX1 | grep ‘Filesystem features’
tune2fs -O ^has_journal /dev/sdX1
Mounting the file system
The next step is to mount the file system and assign the owner:group to the mount point. There are also many tuning options that can be added to the mount command when using PCIe flash cards. The mount options I use are:
The mount command for /dev/sda1 to /u01 would be:
mount –o noatime,nodiratime,max_batch_time=0, nobarrier,discard /dev/sda1 /u01
To make these mount points persistent over reboots, add them to the mount entries in ‘/etc/fstab’. Finally, you need to give a user rights for reading and writing to this mount point, and to assign ownership to /u01 – for example, assigning ownership of /u01 to the oracle userid and to the dba group. To do this, use the “chown” command:
chown oracle:dba /u01
The PCIe flash device is now ready to be used.
Part 3 of this series will describe how to use Oracle ASM when deploying PCIe flash cards.
Part 4 of this series will describe how to persist assignment to dynamically changing NWD/NMR devices.
Customer dilemma: I just purchased PCIe® flash cards to increase performance of my enterprise applications that run on Linux® and Unix®. How do I set them up to get the best performance?
Good question. I wish there were a simple answer but each environment is different. There is no cookie-cutter configuration that fits all, though a few questions will reveal how the PCIe flash cards should be configured for optimum performance.
Most of the popular relational and non-relational databases run on many different operating systems. I will be describing Linux-specific configurations, but most of them should also work with Unix systems that are supported by the PCIe flash card vendor. I’m a database guy, but these same principals and techniques that I’ll be covering apply to other applications like mail servers, web servers, application servers and, of course, databases.
Aligning PCIe flash devices
The most important step to perform on each PCIe flash card is to create a partition that is aligned on a specific boundary (such as 4k or 8k) so each read and write to the flash device will require only one physical input/output (IO) operation. If the card is not partitioned on such a boundary, then reads and writes will span the sector groups, which doubles the IO latency for each read or write request.
To align a partition, I use the sfdisk command to start a partition on a 1M boundary (sector 2048). Aligning to a 1M boundary resolves the dependency to align to a 4k, 8k, or even a 64k boundary. But before I do this, I need to know how I am going to use this device. Will this be a standalone partition? Part of a logical volume? Or part of a RAID group?
Which one is best?
If I were deploying the PCIe flash device for database caching (for example, the Oracle database has provided this caching functionality for years using the Database Smart Flash Cache feature, and Facebook created the open source Flashcache used in MySQL databases), I would use a single-partitioned PCIe flash card if I knew the capacity would meet my needs now and over the next 5 years. If I selected this configuration, the sfdisk command to create the partition would be:
echo “2048,,” | sfdisk –uS /dev/sdX –force
This single partitioning is also required with the Oracle® Automatic Storage Management system (ASM). Oracle has provided ASM for many years and I will go over how to use this storage feature in Part 3 of this series.
If I need to deploy multiple PCIe flash cards for database caching, I would create Logical Volume Manager (LVM) over all the flash devices to simplify administration. The sfdisk command to create a partition for each PCIe flash card would be:
echo “2048,,8e” | sfdisk –uS /dev/sdX –force
“8e” is the system partition type for creating a logical volume.
Neither of these solutions needs fault tolerance since they will be used for write-thru caching. My recent blog “How to optimize PCIe flash cards – a new approach to creating logical volumes” covers this process in detail.
If I want to use the PCIe flash card for persisting data, I would need to make the PCIe flash cards fault tolerant, using two or more cards to build the RAID array and eliminate any single point of failure. There are a number of ways to create a RAID over multiple PCIe flash cards, two of which are:
But what type of RAID setup is best to use?
Oracle coined the term S.A.M.E. – Stripe And Mirror Everything – in 1999 and popularized the practice, which many database administrators (DBA) and storage administrators have followed ever since. I follow this practice and suggest you do the same.
First, you need to determine how these cards will be accessed:
In database deployments, your choice is usually among online transaction processing (OLTP) applications like airline and hotel reservation systems and corporate financial or enterprise resource planning (ERP) applications, or data warehouse/data mining/data analytics applications, or a mix of both environments. OLTP applications involve small random reads and writes as well as many sequential writes for log files. Data warehouse/data mining/data analytics applications involve mostly large sequential reads with very few sequential log writes.
Before setting up one or many PCIe flash cards in a RAID array either using LVM on RAID or creating a RAID array using MDADM, you need to know the access pattern of the IO, capacity requirements and budget. These requirements will dictate which RAID level will work best for your environment and fit your budget.
I would pick either a RAID 1/RAID 10 configuration (mirroring without striping, or striping and mirroring respectively), or RAID 5 (striping with parity). RAID 1/RAID 10 costs more but delivers the best performance, whereas RAID 5 costs less but imposes a significant write penalty.
Optimizing OLTP application performance
To optimize performance of an OLTP application, I would implement either a RAID 1 or RAID 10 array. If I were budget constrained, or implementing a data warehouse application, I would use a RAID 5 array. Normally a RAID 5 array will produce a higher throughput (megabits per second) appropriate for a data warehouse/data mining application.
In a nutshell, knowing how to tune the configuration to the application is key to reaping the best performance.
For either RAID array, you need to create an aligned partition using sfdisk:
echo “2048,,fd” | sfdisk –uS /dev/sdX –force
“fd” is the system identifier for a Linux RAID auto device.
Keep in mind that it is not mandatory to create a partition for LVMs or RAID arrays. Instead, you can assign RAW devices. It’s important to remember to align the sectors if combining RAW and partitioned devices or just creating a basic partition. It’s sound practice to always create an aligned partition when using PCIe flash cards.
At this point, aligned partitions have been created and are now ready to be used in LVMs or RAID arrays. Instructions for creating these are on the web or in Linux/Unix reference manuals. Here are a couple of websites that go over the process of creating LVM, RAID, or LVM on RAID:
Specifying a stripe width value
Also remember that, when creating LVMs with striping or RAID arrays, you’ll need to specify a stripe width value. Many years ago, Oracle and EMC conducted a number studies on this and concluded that a 1M stripe width performed the best as long as the database IO request was equal to or less than 1M. When implementing Oracle ASM, Oracle’s standard is to use 1M allocation units, which matches its coarse striping size of 1M.
Part 2 of this series will describe how to create RAW devices or file systems.
Part 3 of this series will describe how to use Oracle ASM when deploying PCIe flash cards.
Part 4 of this series will describe how to persist assignment to dynamically changing NWD/NMR devices.
Tags: ASM, automatic storage management, data analytics, data mining, data warehouse, Database Smart Cache, EMC, enterprise resource planning, ERP, Facebook, flash storage, Flashcache, Linux, logical volume, Logical Volume Manager, LVM, MDADM, multiple device administration, MySQL, non-relational database, OLTP, online transaction processing, Oracle, partition, PCI Express, PCIe flash, performance, RAID, RAW, relational database, SAME, sector, Stripe and Mirror Everything, Unix
The lifeblood of any online retailer is the speed of its IT infrastructure. Shoppers aren’t infinitely patient. Sluggish infrastructure performance can make shoppers wait precious seconds longer than they can stand, sending them fleeing to other sites for a faster purchase. Our federal government’s halting rollout of the Health Insurance Marketplace website is a glaring example of what can happen when IT infrastructure isn’t solid. A few bad user experiences that go viral can be damaging enough. Tens of thousands can be crippling.
In hyperscale datacenters, any number of problems including network issues, insufficient scaling and inconsistent management can undermine end users’ experience. But one that hits home for me is the impact of slow storage on the performance of databases, where the data sits. With the database at the heart of all those online transactions, retailers can ill afford to have their tier of database servers operating at anything less than peak performance.
Slow storage undermines database performance
Typically, Web 2.0 and e-commerce companies run relational databases (RDBs) on these massive server-centric infrastructures. (Take a look at my blog last week to get a feel for the size of these hyperscale datacenter infrastructures). If you are running that many servers to support millions of users, you are likely using some kind of open-sourced RDB such as MySQL or other variations. Keep in mind that Oracle 11gR2 likely retails around $30K per core but MSQL is free. But the performance of both, and most other relational databases, suffer immensely when transactions are retrieving data from storage (or disk). You can only throw so much RAM and CPU power at the performance problem … sooner rather than later you have to deal with slow storage.
Almost everyone in industry – Web 2.0, cloud, hyperscale and other providers of massive database infrastructures – is lining up to solve this problem the best way they can. How? By deploying flash as the sole storage for database servers and applications. But is low-latency flash enough? For sheer performance it beats rotational disk hands down. But … even flash storage has its limitations, most notably when you are trying to drive ultra-low latencies for write IOs. Most IO accesses by RDBs, which do the transactional processing, are a mix or read/writes to the storage. Specifically, the mix is 70%/30% reads/writes. These are also typically low q-depth accesses (less than 4). It is those writes that can really slow things down.
PCIe flash reduces write latencies
The good news is that the right PCIe flash technology in the mix can solve the slowdowns. Some interesting PCIe flash technologies designed to tackle this latency problem are on display at AIS this week. DRAM and in particular NVDRAM are being deployed as a tier in front of flash to really tackle those nasty write latencies.
Among other demos, we’re showing how a Nytro™ 6000 series PCIe flash card helps solve the MySQL database performance issues. The typical response time for a small data read (this is what the database will see for a Database IO) from an HDD is 5ms. Flash-based devices such as the Nytro WarpDrive® card can complete the same read in less than 50μs on average during testing, an improvement of several orders-of-magnitude in response time. This response time translates to getting much higher transactions out of the same infrastructure – but with less space (flash is denser) and a lot less power (flash consumes a lot lower power than HDDs).
We’re also showing the Nytro 7000 series PCIe flash cards. They reach even lower write latencies than the 6000 series and very low q-depths. The 7000 series cards also provide DRAM buffering while maintaining data-integrity even in the event of a power loss.
For online retailers and other businesses, higher database speeds mean more than just faster transactions. They can help keep those cash registers ringing.
Tags: AIS, database, DRAM, e-commerce, flash, flash memory, hard disk drive, HDD, hyperscale datacenter, latency, MySQL, NVDRAM, Nytro 6000, Nytro 7000, Nytro WarpDrive, Oracle, PCIe flash, relational database, storage latency, web 2.0, write latency
You might be surprised to find out how big the infrastructure for cloud and Web 2.0 is. It is mind-blowing. Microsoft has acknowledged packing more than 1 million servers into its datacenters, and by some accounts that is fewer than Google’s massive server count but a bit more than Amazon.
Facebook’s server count is said to have skyrocketed from 30,000 in 2012 to 180,000 just this past August, serving 900 million plus users. And the social media giant is even putting its considerable weight behind the Open Compute effort to make servers fit better in a rack and draw less power. The list of mega infrastructures also includes Tencent, Baidu and Alibaba and the roster goes on and on.
Even more jaw-dropping is that almost 99.9% of these hyperscale infrastructures are built with servers featuring direct-attached storage. That’s right – they do the computing and store the data. In other words, no special, dedicated storage gear. Yes, your Facebook photos, your Skydrive personal cloud and all the content you use for entertainment, on-demand video and gaming data are stored inside the server.
Direct-attached storage reigns supreme
Everything in these infrastructures – compute and storage – is built out of x-86 based servers with storage inside. What’s more, growth of direct-attached storage is many folds bigger than any other storage deployments in IT. Rising deployments of cloud, or cloud-like, architectures are behind much of this expansion.
The prevalence of direct-attached storage is not unique to hyperscale deployments. Large IT organizations are looking to reap the rewards of creating similar on-premise infrastructures. The benefits are impressive: Build one kind of infrastructure (server racks), host anything you want (any of your properties), and scale if you need to very easily. TCO is much less than infrastructures relying on network storage or SANs.
With direct-attached you no longer need dedicated appliances for your database tier, your email tier, your analytics tier, your EDA tier. All of that can be hosted on scalable, share-nothing infrastructure. And just as with hyperscale, the storage is all in the server. No SAN storage required.
Open Compute, OpenStack and software-defined storage drive DAS growth
Open Compute is part of the picture. A recent Open Compute show I attended was mostly sponsored by hyperscale customers/suppliers. Many big-bank IT folks attended. Open Compute isn’t the only initiative driving growing deployments of direct-attached storage. So is software-defined storage and OpenStack. Big application vendors such as Oracle, Microsoft, VMware and SAP are also on board, providing solutions that support server-based storage/compute platforms that are easy and cost-effective to deploy, maintain and scale and need no external storage (or SAN including all-flash arrays).
So if you are a network-storage or SAN manufacturer, you have to be doing some serious thinking (many have already) about how you’re going to catch and ride this huge wave of growth.
Tags: Alibaba, Amazon, Baidu, cloud computing, DAS, direct attached storage, enterprise, enterprise IT, Google, hyperscale, Microsoft, Open Compute, OpenStack, Oracle, SAP, Tencent, VMware
I was lucky enough to get together for dinner and beer with old friends a few weeks ago. Between the 4 of us, we’ve been involved in or responsible for a lot of stuff you use every day, or at least know about.
Supercomputers, minicomputers, PCs, Macs, Newton, smart phones, game consoles, automotive engine controllers and safety systems, secure passport chips, DRAM interfaces, netbooks, and a bunch of processor architectures: Alpha, PowerPC, Sparc, MIPS, StrongARM/XScale, x86 64-bit, and a bunch of other ones you haven’t heard of (um – most of those are mine, like TriCore). Basically if you drive a European car, travel internationally, use the Internet , if you play video games, or use a smart phone, well… you’re welcome.
Why do I tell you this? Well – first I’m name dropping – I’m always stunned I can call these guys friends and be their peers. But more importantly, we’ve all been in this industry as architects for about 30 years. Of course our talk went to what’s going on today. And we all agree that we’ve never seen more changes – inflexions – than the raft unfolding right now. Maybe its pressure from the recession, or maybe un-naturally pent up need for change in the ecosystem, but change there is.
Changes in who drives innovation, what’s needed, the companies on top and on bottom at every point in the food chain, who competes with whom, how workloads have changed from compute to dataflow, software has moved to opensource, how abstracted code is now from processor architecture, how individual and enterprise customers have been revolting against the “old” ways, old vendors, old business models, and what the architectures look like, how processors communicate, and how systems are purchased, and what fundamental system architectures look like. But not much besides that…
Ok – so if you’re an architect, that’s as exciting as it gets (you hear it in my voice – right ?), and it makes for a lot of opportunities to innovate and create new or changed businesses. Because innovation is so often at the intersection of changing ways of doing things. We’re at a point where the changes are definitely not done yet. We’re just at the start. (OK – now try to imagine a really animated 4-way conversation over beers at the Britannia Arms in Cupertino… Yea – exciting.)
I’m going to focus on just one sliver of the market – but it’s important to me – and that’s enterprise IT. I think the changes are as much about business models as technology.
Hyperscale datacenters drive innovation
I’ll start in a strange place. Hyperscale datacenters (think social media, search, etc.) and the scale of deployment changes the optimization point. Most of us starting to get comfortable with rack as the new purchase quantum. And some of us are comfortable with the pod or container as the new purchase quantum. But the hyperscale dataenters work more at the datacenter as the quantum. By looking at it that way, they can trade off the cost of power, real estate, bent sheet metal, network bandwidth, disk drives, flash, processor type and quantity, memory amount, where work gets done, and what applications are optimized for. In other words, we shifted from looking at local optima to looking for global optima. I don’t know about you, but when I took operations research in university, I learned there was an unbelievable difference between the two – and global optima was the one you wanted…
Hyperscale datacenters buy enough (top 6 are probably more than 10% of the market today) that 1) they need to determine what they deploy very carefully on their own, and 2) vendors work hard to give them what they need.
That means innovation used to be driven by OEMs, but now it’s driven by hyperscale datacenters and it’s driven hard. That global optimum? It’s work/$ spent. That’s global work, and global spend. It’s OK to spend more, even way more on one thing if over-all you get more done for the $’s you spend.
That’s why the 3 biggest consumers of flash in servers are Facebook, Google, and Apple, with some of the others not far behind. You want stuff, they want to provide it, and flash makes it happen efficiently. So efficiently they can often give that service away for free.
Hyperscale datacenters have started to publish their cost metrics, and open up their architectures (like OpenCompute), and open up their software (like Hadoop and derivatives). More to the point, services like Amazon have put a very clear $ value on services. And it’s shockingly low.
Enterprises are paying attention
Enterprises have looked at those numbers. Hard. That’s catalyzed a customer revolt against the old way of doing things – the old way of buy and billing. OEMs and ISVs are creating lots of value for enterprise, but not that much. They’ve been innovating around “stickiness” and “lock-in” (yea – those really are industry terms) for too long, while hyperscale datacenters have been focused on getting stuff done efficiently. The money they save per unit just means they can deploy more units and provide better services.
That revolt is manifesting itself in 2 ways. The first is seen in the quarterly reports of OEMs and ISVs. Rumors of IBM selling its X-series to Lenovo, Dell going private, Oracle trying to shift business, HP talking of the “new style of IT”… The second is enterprises are looking to emulate hyperscale datacenters as much as possible, and deploy private cloud infrastructure. And often as not, those will be running some of the same open source applications and file systems as the big hyperscale datacenters use.
Where are the hyperscale datacenters leading them? It’s a big list of changes, and they’re all over the place.
But they’re also looking at a few different things. For example, global name space NAS file systems. Personally? I think this one’s a mistake. I like the idea of file systems/object stores, but the network interconnect seems like a bottleneck. Storage traffic is shared with network traffic, creates some network spine bottlenecks, creates consistency performance bottlenecks between the NAS heads, and – let’s face it – people usually skimp on the number of 10GE ports on the server and in the top of rack switch. A typical SAS storage card now has 8 x 12G ports – that’s 96G of bandwidth. Will servers have 10 x 10G ports? Yea. I didn’t think so either.
Anyway – all this is not academic. One Wall Street bank shared with me that – hold your breath – it could save 70% of its spend going this route. It was shocked. I wasn’t shocked, because at first blush this seems absurd – not possible. That’s how I reacted. I laughed. But… The systems are simpler and less costly to make. There is simply less there to make or ship than OEMs force into the machines for uniqueness and “value.” They are purchased from much lower margin manufacturers. They have massively reduced maintenance costs (there’s less to service, and, well, no OEM service contracts). And also important – some of the incredibly expensive software licenses are flipped to open source equivalents. Net savings of 70%. Easy. Stop laughing.
Disaggregation: Or in other words, Pooled Resources
But probably the most important trend from all of this is what server manufacturers are calling “disaggregation” (hey – you’re ripping apart my server!) but architects are more descriptively calling pooled resources.
First – the intent of disaggregation is not to rip the parts of a server to pieces to get lowest pricing on the components. No. If you’re buying by the rack anyway – why not package so you can put like with like. Each part has its own life cycle after all. CPUs are 18 months. DRAM is several years. Flash might be 3 years. Disks can be 5 to 7 years. Networks are 5 to 10 years. Power supplies are… forever? Why not replace each on its own natural failure/upgrade cycle? Why not make enclosures appropriate to the technology they hold? Disk drives need solid vibration-free mechanical enclosures of heavy metal. Processors need strong cooling. Flash wants to run hot. DRAM cool.
Second – pooling allows really efficient use of resources. Systems need slush resources. What happens to a systems that uses 100% of physical memory? It slows down a lot. If a database runs out of storage? It blue screens. If you don’t have enough network bandwidth? The result is, every server is over provisioned for its task. Extra DRAM, extra network bandwidth, extra flash, extra disk drive spindles.. If you have 1,000 nodes you can easily strand TBytes of DRAM, TBytes of flash, a TByte/s of network bandwidth of wasted capacity, and all that always burning power. Worse, if you plan wrong and deploy servers with too little disk or flash or DRAM, there’s not much you can do about it. Now think 10,000 or 100,000 nodes… Ouch.
If you pool those things across 30 to 100 servers, you can allocate as needed to individual servers. Just as importantly, you can configure systems logically, not physically. That means you don’t have to be perfect in planning ahead what configurations and how many of each you’ll need. You have sub-assemblies you slap into a rack, and hook up by configuration scripts, and get efficient resource allocation that can change over time. You need a lot of storage? A little? Higher performance flash? Extra network bandwidth? Just configure them.
That’s a big deal.
And of course, this sets the stage for immense pooled main memory – once the next generation non-volatile memories are ready – probably starting around 2015.
You can’t underestimate the operational problems associated with different platforms at scale. Many hyperscale datacenters today have around 6 platforms. If you think they are rolling out new versions of those before old ones are retired they often have 3 generations of each. That’s 18 distinct platforms, with multiple software revisions of each. That starts to get crazy when you may have 200,000 to 400,000 servers to manage and maintain in a lights out environment. Pooling resources and allocating them in the field goes a huge way to simplifying operations.
Alternate Processor Architecture
It didn’t always used to be Intel x86. There was a time when Intel was an upstart in the server business. It was Power, MIPs, Alpha, SPARC… (and before that IBM mainframes and minis, etc). Each of the changes was brought on by changing the cost structure. Mainframes got displaced by multi-processor RISC, which gave way to x86.
Today, we have Oracle saying they’re getting out of x86 commodity servers and doubling down on SPARC. IBM is selling off its x86 business and doubling down on Power (hey – don’t confuse that with PowerPC – which started as an architectural cut-down of Power – I was there…). And of course there is a rash of 64-bit ARM server SOCs coming – with HP and Dell already dabbling in it. What’s important to realize is that all of these offerings are focusing on the platform architecture, and how applications really perform in total, not just the processor.
Let me warp up with an email thread cut/paste from a smart friend – Wayne Nation. I think he summed up some of what’s going on well, in a sobering way most people don’t even consider.
“Does this remind you of a time, long ago, when the market was exploding with companies that started to make servers out of those cheap little desktop x86 CPUs? What is different this time? Cost reduction and disaggregation? No, cost and disagg are important still, but not new.
A new CPU architecture? No, x86 was “new” before. ARM promises to reduce cost, as did Intel.
Disaggregation enables hyperscale datacenters to leverage vanity-free, but consistent delivery will determine the winning supplier. There is the potential for another Intel to rise from these other companies. “