When implementing an LSI Nytro WarpDrive (NWD) or Nytro MegaRAID (NMR) PCIe flash card in a Linux server, you need to modify quite a few variables to get the best performance out of these cards.

In the Linux server, device assignments sometimes change after reboots. Sometimes, the PCIe flash card can be assigned /dev/sda. Other times, it can be assigned /dev/sdd, or any device name. This variability can wreak havoc when modifying the Linux environment variables. To get around this issue, assignments by the SCSI address should be used so all of the Linux performance variables will persist properly across reboots. If using a filesystem, use the device UUID address in the mount statement in /etc/fstab to persist the mount command across reboots.

Cut and paste the script
The first step to solving is to cut and paste the following script, except the SCSI address (highlighted in yellow), into /etc/rc.local. You’ll need to enter the SCSI address of the PCIe card before executing the script.

ls -al /dev/disk/by-id |grep 'scsi-3600508e07e726177965e06849461a804 ' |grep /sd > nwddevice.txt
awk '{split($11,arr,"/"); print arr[3]}' nwddevice.txt > nwd1device.txt
variable1=$(cat nwd1device.txt)
echo "4096" > /sys/block/$variable1/queue/nr_requests
echo "512" > /sys/block/$variable1/device/queue_depth
echo "deadline" > /sys/block/$variable1/queue/scheduler
echo "2" > /sys/block/$variable1/queue/rq_affinity
echo 0 > /sys/block/$variable1/queue/rotational
echo 0 > /sys/block/$variable1/queue/add_random
echo 1024 > /sys/block/$variable1/queue/max_sectors_kb
echo 0 > /sys/block/$variable1/queue/nomerges
blockdev --setra 0 /dev/$variable1

The highlighted SCSI address above needs to be modified with the SCSI address of the PCIe flash card. To get the address, issue this command:

ls –al /dev/disk/by-id

When you install the Nytro PCIe flash card, Linux will assign a name to the device. For example, the device name can be listed as /dev/sdX, and X can be any letter. The output from the ls command above will show the SCSI address for this PCIe device. Don’t use the address containing “-partX” in it. Be sure to note this SCSI address since you will need it to create the script below. Include a single space between the SCSI address and the closing single quote in the script.

Create nwd_getdevice.sh file
Next, copy the code and create a file called “nwd_getdevice.sh” with the modified SCSI address.

After saving this file, change file permissions to “execute” and then place this command in the /etc/rc.local file:


Test the script
To test this script, execute it on the command line exactly how you stated it in the rc.local file. The next time the system reboots, the settings will be set to the appropriate device.

Multiple PCIe flash cards
If you plan to deploy multiple LSI PCIe flash cards in the server, the easiest way is to duplicate all of commands in the nwd_getdevice.sh script and paste them at the end. Then change the SCSI address of the next card and overlay the SCSI address in the newly pasted area. You can follow this procedure for as many LSI PCIe flash cards as are installed in the server. For example:

ls -al /dev/disk/by-id |grep 'scsi-1stscsiaddr83333365e06849461a804 ' |grep /sd > nwddevice.txt
awk '{split($11,arr,"/"); print arr[3]}' nwddevice.txt > nwd1device.txt
variable1=$(cat nwd1device.txt)
echo "4096" > /sys/block/$variable1/queue/nr_requests
echo "512" > /sys/block/$variable1/device/queue_depth
echo "deadline" > /sys/block/$variable1/queue/scheduler
echo "2" > /sys/block/$variable1/queue/rq_affinity
echo 0 > /sys/block/$variable1/queue/rotational
echo 0 > /sys/block/$variable1/queue/add_random
echo 1024 > /sys/block/$variable1/queue/max_sectors_kb
echo 0 > /sys/block/$variable1/queue/nomerges
blockdev --setra 0 /dev/$variable1
ls -al /dev/disk/by-id |grep 'scsi-2ndscsiaddr1234566666654444444444 ' |grep /sd > nwddevice.txt
awk '{split($11,arr,"/"); print arr[3]}' nwddevice.txt > nwd1device.txt
variable1=$(cat nwd1device.txt)
echo "4096" > /sys/block/$variable1/queue/nr_requests
echo "512" > /sys/block/$variable1/device/queue_depth
echo "deadline" > /sys/block/$variable1/queue/scheduler
echo "2" > /sys/block/$variable1/queue/rq_affinity
echo 0 > /sys/block/$variable1/queue/rotational
echo 0 > /sys/block/$variable1/queue/add_random
echo 1024 > /sys/block/$variable1/queue/max_sectors_kb
echo 0 > /sys/block/$variable1/queue/nomerges
blockdev --setra 0 /dev/$variable1

Final thoughts
The most important step in implementing Nytro PCIe flash cards under Linux is aligning the card on a boundary, which I cover in Part 1 of this series.  This step alone can deliver a 3x performance gain or more based on our in-house tests as well as testing from some of our customers. The rest of this series walks you through the process of setting up these aligned flash cards using a file system, ASM or RAW device and, finally, persisting all the Linux performance variables to the card so these settings are persisted across reboots.

Links to the other posts in this series:

How to maximize PCIe flash performance under Linux

Part 1: Aligning PCIe flash devices
Part 2: Creating the RAW device or filesystem
Part 3: Oracle ASM

Tags: , , , , , , , , , , ,
Views: (1369)

My “Size matters: Everything you need to know about SSD form factors” blog in January spawned some interesting questions, a number of them on Z-height.

What is a Z-height anyway?
For a solid state drive (SSD), Z-height describes its thickness and is generally its smallest dimension. Z-height is a redundant term, since Z is a variable representing the height of an SSD. The “Z” is one of the variables – X, Y and Z, synonymous with length, width and height – that describe the measurements of a 3-dimensional object. Ironically, no one says X-length or Y-width, but Z-height is widely used.

What’s the state of affairs with SSD Z-height?
The Z-height has typically been associated with the 2.5″ SSD form factor. As I covered in my January form factor blog, the initial dimensions of SSDs were modeled after hard disk drives (HDDs). The 2.5” HDD form factor featured various heights depending on the platter count – the more disks, the greater the capacity and the thicker the HDD. The first 2.5” full capacity HDDs had a maximum Z-height of 19mm, but quickly dropped to a 15mm maximum to enable full-capacity HDDs in thinner laptops. By the time SSDs hit high-volume market acceptance, the dimensional requirements for storage had shrunk even more, to a maximum height of 12.5mm in the 2.5” form factor. Today, the Z-height of most 2.5″ SSDs generally ranges from 5.0mm to 9.5mm.

With printed circuit board (PCB) form factor SSDs—those with no outer case—the Z-height is defined by the thickness of the board and its components, which can be 3mm or less. Some laptops have unique shape or height restrictions for the SSD space allocation. For example, the MacBook Air’s ultra-thin profile requires some of the thinnest SSDs produced.

A new standard in SSD thickness
The platter count of an HDD determines its Z-height. In contrast, an SSD’s Z-height is generally the same regardless of capacity. The proportion of SSD form factors deployed in systems is shifting from the traditional, encased SSDs to the new bare PCB SSDs. As SSDs drift away from the older form factors with different heights, consumers and OEM system designers will no longer need to consider Z-height because the thickness of most bare PCB SSDs will be standard.

Tags: , , , , , , , , , , ,
Views: (1290)

The introduction of LSI® SF3700 flash controllers has prompted many questions about the PCIe® (PCI Express) interface and how it benefits solid state storage, and there’s no better person to turn to for insights than our resident expert, Jeremy Werner, Sr. Director of Product and Customer Management in LSI’s Flash Components Division (SandForce):

Most client-based SSDs have used SATA in the past, while PCIe was mainly used for enterprise applications. Why is the PCIe interface becoming so popular for the client market?

Jeremy: Over the past few decades, the performance of host interfaces for client devices has steadily climbed. Parallel ATA (PATA) interface speed grew from 33MB/s to 100MB/s, while the performance of the Serial ATA (SATA) connection rose from 1.5Gb/s to 6Gb/s. Today, some solid state drives (SSDs) use the PCIe Gen2 x4 (second-generation speeds with four data communication lanes) interface, supporting up to 20Gb/s (in each direction). Because the PCIe interface can simultaneously read and write (full duplex) and SATA can only read or write at one time (half-duplex), PCIe can potentially double the 20Gb/s speeds in a mixed (read and write) workload, making it nearly seven times faster than SATA.

Will the PCIe interface replace SATA for SSDs?

Jeremy: Eventually the replacement is likely, but it will probably take many years in the single-drive client PC market given two hindrances. First, some single-drive client platforms must use a common HDD and SSD connection to give users the choice between the two devices. And because the 6Gb/s SATA interface delivers much higher speeds than than hard disk drives, there is no immediate need for HDDs to move to the faster PCIe connection, leaving SATA as the sole interface for the client market. And, secondly, the older personal computers already in consumers’ homes that need an SSD upgrade support only SATA storage devices, so there’s no opportunity for PCIe in that upgrade market.

By contrast, the enterprise storage market, and even some higher-end client systems, will migrate quickly to PCIe since they will see significant speed increases and can more easily integrate PCIe SSD solutions available now.

It is noteworthy that some standards, like M.2 and SATA Express, have defined a single connector that supports SATA or PCIe devices.  The recently announced LSI SF3700 is one example of an SSD controller that supports both of those interfaces on an M.2 board.

What is meant by the terms “x1, x2, x4, x16” when referencing a particular PCIe interface?

Jeremy: These numbers are the PCIe lane counts in the connection. Either the host (computer) or the device (SSD) could limit the number of lanes used. The theoretical maximum speed of the connection (not including protocol overhead) is the number of lanes multiplied by the speed of each lane.

What is protocol overhead?

Jeremy: PCIe, like many bus interfaces, uses a transfer encoding scheme – a set number of data bits represented by a slightly larger number of bits called a symbol. The additional bits in the symbol constitute the inefficient overhead of metadata required to manage the transmitted user data. PCIe Gen3 features a more efficient data transfer encoding with 128b/132b (3% overhead) instead of the 8b/10b (20% overhead) of PCIe Gen2, increasing data transfer speeds by up to 21%.

What is defined in the PCIe 2.0 and 3.0 specifications, and do end users really care?

Jeremy: Although each PCIe Gen3 lane is faster than PCIe Gen2 (8Gb/s vs 5Gb/s, respectively), lanes can be combined to boost performance in both versions. The changes most relevant to consumers pertain to higher speeds. For example, today consumer SSDs top out at 150K random read IOPS at 4KB data transfer sizes. That translates to about 600MB/s, which is insufficient to saturate a PCIe Gen2 x2 link, so consumers would see little benefit from a PCIe Gen3 solution over PCIe Gen2. The maximum performance of PCIe Gen2 x4 and PCIe Gen3 x2 devices is almost identical because of the different transfer encoding schemes mentioned previously.

Are there mandatory features that must be supported in any of these specifications?

Jeremy: Yes, but nearly all of these features have little impact on performance, so most users have no interest in the specs. It’s important to keep in mind that the PCIe speeds I’ve cited are defined as the maximums, and the spec has no minimum speed requirement. This means a PCIe Gen3 solution might support only a maximum of 5Gb/s, but still be considered a PCIe Gen3 solution if it meets the necessary specifications. So buyers need to be aware of the actual speed rating of any PCIe solution.

Is a PCIe Gen3 SSD faster than a PCIe Gen2 SSD?

Jeremy: Not necessarily. For example, a PCIe Gen2 x4 SSD is capable of higher speeds than a PCIe Gen3 x1 SSD. However, bottlenecks other than the front-end PCIe interface will limit the performance of many SSDs. Examples of other choke points include the bandwidth of the flash, the processing/throughput of the controller, the power or thermal limitations of the drive and its environment, and the ability to remove heat from that environment. All of these factors can, and typically do, prevent the interface from reaching its full steady-state performance potential.

In what form factors are PCIe cards available?

Jeremy: PCIe cards are typically referred to as plug-in products, much like SSDs, graphics cards and host-bus adapters. PCIe SSDs come in many form factors, with the most popular called “half-height, half-length.” But the popularity of the new, tiny M.2 form factors is growing, driven by rising demand for smaller consumer computers. There are other PCIe form factors that resemble traditional hard disk drives, such as the SFF-8639, a 2.5” hard disk drive form factor that features four PCIe lanes and is hot pluggable. What’s more, its socket is compatible with the SAS and SATA interfaces. The adoption of the SATA Express 2.5” form factor has been limited, but could be given a boost with the availability of new capabilities like SRIS (Separate Refclk with Independent SSC), which enables the use of lower cost interconnection cables between the device and host.

Are all M.2 cards the same?

Jeremy: No. All SSD M.2 cards are 22 mm wide (while some WAN cards are 30 mm wide), but the specification allows for different lengths (30, 42, 60, 80, and 110 mm). What’s more, the cards can be single- or double-sided to account for differences in the thickness of the products. Also, they are compatible with two different sockets (socket 2 and socket 3). SSDs compatible with both socket types, or only socket 2, can connect only two lanes  (x2), while SSDs compatible with only socket 3 can connect up to four (x4).

In my last few blogs, I covered various aspects of SSD form factors and included many images of the types that Jeremy mentioned above. I also delve deeper into details of the M.2 form factor in my blog “M.2: Is this the Prince of SSD form factors?” One thing about PCIe is certain: It is the next step in the evolution of computer interfaces and will give rise to more SSDs with higher performance, lower power consumption and better reliability.


Tags: , , , , , , , , , ,
Views: (1738)

What is Oracle ASM?
The Oracle® automatic storage management system (ASM) was developed 10 years ago to make it much easier for database administrators (DBAs) to use and tune database storage. Oracle ASM enables DBAs to:

  • Automatically stripe data over each RAW device to improve database storage performance
  • Mirror data for greater fault tolerance
  • Simplify the management and extension of database storage for the cloud and, with the ASM Cluster File System (ACFS), use the snapshot and replication functionality to increase availability
  • Add the Oracle Real Application Clusters (RAC) capability to help reduce total cost of ownership (TCO), expand scalability and increase availability, among other benefits
  • Easily move data from one device to another while the database is active with no performance degradation
  • Reduce or eliminate storage or Linux administrator time for configuring database storage
  • Use ASM as a Linux®/Unix operating system file system called ACFS. (I know what you are thinking. Since you need Oracle Grid up and running to mount and use ASM, how can an ACFS device be available to the operating system at system boot? The reason is that the kernel has been modified to allow this functionality. Learn more about ACFS here.)
  • What’s more, it’s free – comes with Oracle Grid

The drawbacks of using Oracle ASM:

  • DBAs now control the storage they are using. Therefore, they need to know more about the storage and how the logical unit numbers (LUNs) are being used by Oracle ASM, and how to create ASM disk groups for higher performance.
  • Most ASM commands are executed through SQLPlus, not through the command line. That means storage is accessed through SQLPlus and sometimes ASMCMD, isolating the storage and making it harder for Linux admins to identify storage issues.
  • Recover Manager (RMAN) is the only guaranteed/supported method of backing up databases on ASM.

What will be covered in this blog and what won’t
ASM is quite complex to learn and to set up properly for both performance and high availability. I won’t be going over all the commands and configurations of ASM, but I will cover how to set up an aligned LSI Nytro WarpDrive and Nytro MegaRAID PCIe® card and create an ASM disk to be assigned to an ASM disk group. There are many websites and books that go over all the details of Oracle ASM, and the most current book that I would recommend is “Database Cloud Storage: The Essential Guide to Oracle Automatic Storage Management.” Or visit Oracle’s docs.oracle.com website.

Setting up ASM
The following steps cover configuring a LUN for ASM. In order to use ASM, you will need to install the Oracle Grid software from otn.oracle.com. I prefer using Oracle ASMLIB when configuring ASM.  Included in the box of the latest version of Oracle Linux,  ASMLIB offers an easier way to configure ASM. If you are using an older version of ASM, you will need to install the RPMs for ASM from support.oracle.com.

Step 1: Create aligned partition
Refer to Part 1 of this series to create a LUN on a 1M boundary. Oracle recommends using the full disk for ASM, so just create one large aligned partition. I suggest using this command:

echo “2048,,” | sfdisk –uS /dev/sdX –force

Step 2: Create an ASM disk
Once the device has an aligned partition created on it, we can assign it to ASM by using the ASM createdisk command with two input parameters – ASM disk name and the PCIe flash partitioned device name – as follows:

/etc/init.d/oracleasm createdisk ASMDISK1 /dev/sda1

To verify that the create ASM disk process was successful, and the device was marked as an ASM disk, enter the following commands:

/etc/init.d/oracleasm querydisk /dev/sda1

(the output should state: “/dev/sda is an Oracle ASM disk [OK])

/etc/init.d/oracleasm listdisks

(the output should state: ASMDISK1)

Step 3: Assign ASM disk to disk group
The ASM disk group is the primary component of ASM as well as the highest level data structure in ASM. A disk group is a container of multiple ASM disks, and it is the disk group that the database references when creating Oracle Tablespaces.

There are multiple ways to create an ASM disk group. The easiest way is to use ASM Configuration Assistant (ASMCA), which walks you through the creation process. See Oracle ASM documentation on how to use ASMCA.

Here are the steps for creating a disk group:

a: Log in to GRID using sqlplus / as sysasm.

b: Select name, path, header status, state from v$asm_disk as follows:

c: Create diskgroup DG1 external redundancy disk using this command:


The disk group is now ready to be used in creating an Oracle database Tablespace. To use this disk group in an Oracle database, please refer to Oracle’s database documentation at docs.oracle.com.

In Part 4, the final installment of this series, I’ll discuss how to persist assignment to dynamically changing  Nytro WarpDrive and Nytro MegaRAID PCIe cards.

Tags: , , , , , , , , , , , , , , , , , , , , , , , , ,
Views: (811)

One of the coolest parts of my job is talking with customers and partners about their production environment challenges around database technology.  A topic of particular interest lately is in-memory database (IMDB) systems and their integration into an existing environment.

The need for speed
Much of the media coverage of IMDB integrations is heavily focused on speed and loaded with terms like real-time processing, on-demand analytics and memory speed.  But zeroing in on the performance benefits comes at the expense of so many other key aspects of IMDBs. The technology needs to be evaluated as a whole.

Granted, in-memory databases can store data structures in DRAM with latency that is measured in nanoseconds. (Latency of disk-based technology, comparatively, is glacial – clocked in milliseconds.)  Depending on the workload and the vendor’s database engine architecture, DRAM processing can improve database performance by as much as 50X-100X.

How durable is it?
Keep in mind that most relational database systems conform to the ACID (Atomicity, Consistency, Isolation, and Durability) properties of transactions. (You can find a more thorough investigation of these properties in this paper – “The Transaction Concept: Virtues and Limitation” – authored by database pioneer Jim Gray.) The matter of relational database system durability naturally raises the question: But how is data protected from DRAM failures when things go haywire and what is the recovery experience like?  Relational databases implement the durable property to prevent problems associated with power loss or hardware failure to ensure transaction information is permanently captured.

The commonly used WAL (Write Ahead Logging) method ensures that the transaction data is written to a log file (persisted on non-volatile storage) before it is committed and subsequently written to a data file (persisted on non-volatile storage). When the database engine restarts after a failure, it switches to recovery mode to read the log file and determine if the transactions should be rolled forward (committed) or rolled back (cancelled), depending on their state at the time of failure.

Current in-memory database systems support durability and their implementations vary by vendor.  Here is a sampling of durability techniques they use:

  • WAL (Write Ahead Logging)
    • Traditional method described above using a log file.
    • Changes are persisted to non-volatile storage that is used for recovery.
  • Replication
    • Data is copied to more than one location, and can be across different nodes.
    • Recovery can be handled using failover to alternate nodes.
  • Snapshots
    • Database snapshots are taken at intervals.
    • Previous snapshots can be used for recovery.
  • Data Tiering
    • Frequently accessed data resides only in in-memory DRAM structures.
    • Archival or less frequently accessed data resides only on non-volatile storage.
    • Replication can be used as well.

Shopping tip: Consider durability when evaluating your options
If changes in your data environment are frequent and require greater persistence and consistency, be sure to also consider durability when evaluating and comparing vendor implementations.  Durability is no less important than query speed.  Different implementations may or may not be a good fit and in some cases might require additional hardware that can increase cost.

It’s easy to get swept away by all the media attention about how in-memory databases deliver blazing performance, but customers often tell me they would gladly give up some performance for rock-solid stability and interoperability.

For our part, LSI enterprise PCIe® flash storage solutions not only perform well but also include DuraClass™ technology, which can increase the endurance, reliability and power efficiency of non-volatile storage used for in-memory database systems.

*Old suitcase by allesok used with permission.


Tags: , , , , , , , , , , , , , , ,
Views: (983)

My first blog in this series, “How to maximize performance of PCIe flash for enterprise applications running on Linux,” describes the steps for aligning PCIe® flash devices. This blog covers the next stage of setting up the PCIe flash device when using the Linux® operating system: creating a RAW device or a file system.

At this point, one or more PCIe flash cards have been partitioned on a sector boundary. Depending on their use, these partitioned devices are either set up as a single RAW device or as part of a logical volume or RAID array.

Next step is to determine how these devices will be used. Most administrators will create file systems on these partitions. Some Oracle administrators will use them as RAW devices and assign them to Automatic Storage Management (ASM). Still others, those looking for the best performance possible from the device will stick to a RAW device. For many years, the recommendation was not to use RAW devices because the complexity of managing them outweighed their small potential gains in performance.

ASM uses RAW devices but makes administration of these devices much easier. More on ASM in Part 3 of this series.

Building a file system
Next is to build a file system on the RAW device, LVM or RAID. But first we need to determine the best type of file system to use. There are many to choose from including:

  • EXT-2
  • EXT-3
  • EXT-4
  • XFS
  • ZFS

To keep this brief, I will only go over EXT-4. This type of file system is the most current and provides the latest enhancements for increasing capacity, disabling journaling and many other capabilities, though XFS can be a higher performance alternative.

To create an EXT-4 file system, use this command:

mkfs.ext4 /dev/sdX

You can now turn off or on certain features of the EXT-4 file system by using “tune2fs. Here are a couple of examples of using tune2fs:

  • To list all file system features for /dev/sdX1, use this tune2fs command:

tune2fs –l /dev/sdX1 | grep ‘Filesystem features’

  • To disable journaling on /dev/sdX1, use this tune2fs command:

tune2fs -O ^has_journal /dev/sdX1

Mounting the file system
The next step is to mount the file system and assign the owner:group to the mount point. There are also many tuning options that can be added to the mount command when using PCIe flash cards. The mount options I use are:


The mount command for /dev/sda1 to /u01 would be:

mount –o noatime,nodiratime,max_batch_time=0, nobarrier,discard /dev/sda1 /u01

To make these mount points persistent over reboots, add them to the mount entries in ‘/etc/fstab’. Finally, you need to give a user rights for reading and writing to this mount point, and to assign ownership to /u01 – for example, assigning ownership of /u01 to the oracle userid and to the dba group. To do this, use the “chown” command:

chown oracle:dba /u01

The PCIe flash device is now ready to be used.

Part 3 of this series will describe how to use Oracle ASM when deploying PCIe flash cards.

Part 4 of this series will describe how to persist assignment to dynamically changing NWD/NMR devices.

Tags: , , , , , , , , , ,
Views: (742)

Customer dilemma: I just purchased PCIe® flash cards to increase performance of my enterprise applications that run on Linux® and Unix®. How do I set them up to get the best performance?

Good question. I wish there were a simple answer but each environment is different. There is no cookie-cutter configuration that fits all, though a few questions will reveal how the PCIe flash cards should be configured for optimum performance.

Most of the popular relational and non-relational databases run on many different operating systems. I will be describing Linux-specific configurations, but most of them should also work with Unix systems that are supported by the PCIe flash card vendor. I’m a database guy, but these same principals and techniques that I’ll be covering apply to other applications like mail servers, web servers, application servers and, of course, databases.

Aligning PCIe flash devices
The most important step to perform on each PCIe flash card is to create a partition that is aligned on a specific boundary (such as 4k or 8k) so each read and write to the flash device will require only one physical input/output (IO) operation. If the card is not partitioned on such a boundary, then reads and writes will span the sector groups, which doubles the IO latency for each read or write request.

To align a partition, I use the sfdisk command to start a partition on a 1M boundary (sector 2048). Aligning to a 1M boundary resolves the dependency to align to a 4k, 8k, or even a 64k boundary. But before I do this, I need to know how I am going to use this device. Will this be a standalone partition? Part of a logical volume? Or part of a RAID group?

Which one is best?
If I were deploying the PCIe flash device for database caching (for example, the Oracle database has provided this caching functionality for years using the Database Smart Flash Cache feature, and Facebook created the open source Flashcache used in MySQL databases), I would use a single-partitioned PCIe flash card if I knew the capacity would meet my needs now and over the next 5 years. If I selected this configuration, the sfdisk command to create the partition would be:

echo “2048,,” | sfdisk –uS /dev/sdX –force

This single partitioning is also required with the Oracle® Automatic Storage Management system (ASM). Oracle has provided ASM for many years and I will go over how to use this storage feature in Part 3 of this series.

If I need to deploy multiple PCIe flash cards for database caching, I would create Logical Volume Manager (LVM) over all the flash devices to simplify administration. The sfdisk command to create a partition for each PCIe flash card would be:

echo “2048,,8e” | sfdisk –uS /dev/sdX –force

“8e” is the system partition type for creating a logical volume.

Neither of these solutions needs fault tolerance since they will be used for write-thru caching. My recent blog “How to optimize PCIe flash cards – a new approach to creating logical volumes” covers this process in detail.

If I want to use the PCIe flash card for persisting data, I would need to make the PCIe flash cards fault tolerant, using two or more cards to build the RAID array and eliminate any single point of failure. There are a number of ways to create a RAID over multiple PCIe flash cards, two of which are:

  • Use LVM with the RAID option.
  • Use the software RAID utility MDADM (multiple device administration) to create the RAID array.

But what type of RAID setup is best to use?
Oracle coined the term S.A.M.E. – Stripe And Mirror Everything – in 1999 and popularized the practice, which many database administrators (DBA) and storage administrators have followed ever since. I follow this practice and suggest you do the same.

First, you need to determine how these cards will be accessed:

  • Small random reads and writes
  • Larger sequential reads
  • Hybrid (mix of both)

In database deployments, your choice is usually among online transaction processing (OLTP) applications like airline and hotel reservation systems and corporate financial or enterprise resource planning (ERP) applications, or data warehouse/data mining/data analytics applications, or a mix of both environments. OLTP applications involve small random reads and writes as well as many sequential writes for log files. Data warehouse/data mining/data analytics applications involve mostly large sequential reads with very few sequential log writes.

Before setting up one or many PCIe flash cards in a RAID array either using LVM on RAID or creating a RAID array using MDADM, you need to know the access pattern of the IO, capacity requirements and budget. These requirements will dictate which RAID level will work best for your environment and fit your budget.

I would pick either a RAID 1/RAID 10 configuration (mirroring without striping, or striping and mirroring respectively), or RAID 5 (striping with parity). RAID 1/RAID 10 costs more but delivers the best performance, whereas RAID 5 costs less but imposes a significant write penalty.

Optimizing OLTP application performance
To optimize performance of an OLTP application, I would implement either a RAID 1 or RAID 10 array. If I were budget constrained, or implementing a data warehouse application, I would use a RAID 5 array. Normally a RAID 5 array will produce a higher throughput (megabits per second) appropriate for a data warehouse/data mining application.

In a nutshell, knowing how to tune the configuration to the application is key to reaping the best performance.

For either RAID array, you need to create an aligned partition using sfdisk:

echo “2048,,fd” | sfdisk –uS /dev/sdX –force

“fd” is the system identifier for a Linux RAID auto device.

Keep in mind that it is not mandatory to create a partition for LVMs or RAID arrays. Instead, you can assign RAW devices. It’s important to remember to align the sectors if combining RAW and partitioned devices or just creating a basic partition. It’s sound practice to always create an aligned partition when using PCIe flash cards.

At this point, aligned partitions have been created and are now ready to be used in LVMs or RAID arrays. Instructions for creating these are on the web or in Linux/Unix reference manuals. Here are a couple of websites that go over the process of creating LVM, RAID, or LVM on RAID:


Specifying a stripe width value
Also remember that, when creating LVMs with striping or RAID arrays, you’ll need to specify a stripe width value. Many years ago, Oracle and EMC conducted a number studies on this and concluded that a 1M stripe width performed the best as long as the database IO request was equal to or less than 1M. When implementing Oracle ASM, Oracle’s standard is to use 1M allocation units, which matches its coarse striping size of 1M.

Part 2 of this series will describe how to create RAW devices or file systems.

Part 3 of this series will describe how to use Oracle ASM when deploying PCIe flash cards.

Part 4 of this series will describe how to persist assignment to dynamically changing NWD/NMR devices.

Tags: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
Views: (1107)