I started working years ago to engage large datacenters, learn what their problems are and try to craft solutions for their problems. It’s taken years, but we engaged them, learned, changed how we thought about storage and began creating solutions that are being deployed at scale.
We’ve started to do the same with the Chinese Internet giants. They’re growing at an incredible rate. They have similar problems, but it’s surprising how different their solution approaches are. Each one is unique. And we’re constantly learning from these guys.
So to wrap up the blog series on my interview with CIO & CEO magazine, here are the last two questions to explain a bit more.
CEO & CIO: Please use examples to tell the stories about the forward-looking technologies and architectures that LSI has jointly developed with Internet giants.
While our host bus adapters (HBAs) and MegaRAID® solutions have been part of the hyperscale Internet companies’ infrastructure since the beginning, we have only recently worked very closely with them to drive joint innovation. In 2009 I led the first LSI engagement with what we then called “mega datacenters.” It took a while to understand what they were doing and why. By 2010 we realized there were specialized needs, and began to imagine new hardware products that worked with these datacenters. Out of this work came the realization that flash was important for efficiency and capability, and the “invention” of LSI® Nytro™ product portfolio. (More are in the pipeline). We have worked closely with hyperscale datacenters to evolve and tune these solutions, to where Nytro products have become the backbone of their main revenue platforms. Facebook has been a vitally important partner in evolving our Nytro platform – teaching us what was truly needed, and now much of their infrastructure runs on LSI products. These same products are a good fit for other hyperscale customers, and we are slowly winning many of the large ones.
Looking forward, we are partnered with several Internet giants in the U.S. and China to work on cold storage solutions, and more importantly shared DAS (Distributed DAS: D-DAS) solutions. We have been demonstrating prototypes. These solutions enable pooled architectures and rack scale architecture, and can be made to work tightly with software-defined datacenters (SDDCs). They simplify management and resource allocation – making task deployment more efficient and easier. Shared DAS solutions increase infrastructure efficiency and improves lifecycle management of components. And they have the potential to radically improve application performance and infrastructure costs.
Looking further into the future, we see even more radical changes in silicon supporting transport protocols and storage models, and in rack scale architectures supporting storage and pooled memory. And cold storage is a huge though, some would say, boring problem that we are also focused on – storing lots of data for free and using no power to do it… but I really can’t talk about any of that.
CEO & CIO: LSI maintains good contact with big Internet companies in China. What are the biggest differences between dealing with these Internet enterprises and dealing with traditional partners?
Yes, we have a very good relationship with large Chinese Internet companies. In fact, I will be visiting Tencent, Alibaba and Baidu in a few weeks. One of the CTOs I would like to say is a friend. That is, we have fun talking together about the future.
These meetings have evolved. The first meetings LSI had about two years ago were sales calls, or support for OEM storage solutions. These accomplished very little. Once we began visiting as architects speaking to architects, real dialogs began. Our CEO has been spending time in China meeting with these Internet companies both to learn, and to make it clear that they are important to us, and we want a chance to solve their problems. But the most interesting conversations have been the architectural ones. There have been very clear changes in the two years I have traveled within China – from standard enterprise to hyperscale architectures.
We’ve received fascinating feedback on architecture, use, application profiles, platforms, problems and goals. We have strong engagement with the U.S. Internet giants. At the highest level, the Chinese Internet companies have similar problems and goals. But the details quickly diverge because of revenue per user, resources, power availability, datacenter ownership and Internet company age. The use of flash is very different.
The Chinese Internet giants are at an amazing change point. Most are ready for explosive growth of infrastructure and deployment of cloud services. Most are changing from standard OEM systems and architectures to self-designed hyperscale systems after experimenting with Scorpio and microserver deployments. Several, like JD.com (an Amazon-like company) are moving from hosted to self-built infrastructure. And there seems to be a general realization that the datacenter has changed from a compute-centric model to a dataflow model, where storage and network dictate how much work gets done more than the CPU does. These giants are leveraging their experience and capability to move very quickly, and in a few cases are working to create true pooled rack level architectures much like Facebook and Google have started in the U.S. In fact, Baidu is similar to Facebook in this approach, but is different in its longer term goals for the architecture.
The Chinese companies are amazingly diverse, even within one datacenter, and arguments on architectural direction are raging within these Internet giants – it’s healthy and exciting. However, the innovations that are coming are similar to those developed by large U.S. Internet companies. Personally I have found these Internet companies much more exciting and satisfying to work with than traditional OEMs. The speed and cadence of advancement, the recognition of problems and their importance, the focus on efficiency and optimization have been much more exciting. And the youthful mentality and view to problems, without being burdened by “the way we’ve always done this” has been wonderful.
Also see these blogs of mine over the past year, where you can read more about some of these changes:
“Postcard from Shenzhen: China’s hyperscale datacenter growth, mixed with a more traditional approach”
“China in the clouds, again”
“China: A lot of talk about resource pooling, a better name for disaggregation”
Or see them (and others) all here.
Summary: So it’s taken years, but we engaged U.S. Internet giants, learned about their problems, changed how we thought about storage and began creating solutions that are now being deployed at scale. And we’re constantly learning from these guys. Constantly, because their problems are constantly changing.
We’ve now started to do the same with the Chinese Internet giants. They have similar problems, and will need similar solutions, but they are not the same. And just like the U.S. Internet giants, each one is unique.
Tags: Alibaba, Amazon, Baidu, CEO & CIO Magazine, China, cloud services, cold storage, D-DAS, DAS, datacenter, datacenter ecosystem, direct attached storage, distributed DAS, Facebook, flash, flash storage, Google, HBA, host bus adapter, hyperscale datacenter, Internet, JD.com, MegaRAID, OEM, original equipment manufacturer, Scorpio, Tencent
One of the coolest parts of my job is talking with customers and partners about their production environment challenges around database technology. A topic of particular interest lately is in-memory database (IMDB) systems and their integration into an existing environment.
The need for speed
Much of the media coverage of IMDB integrations is heavily focused on speed and loaded with terms like real-time processing, on-demand analytics and memory speed. But zeroing in on the performance benefits comes at the expense of so many other key aspects of IMDBs. The technology needs to be evaluated as a whole.
Granted, in-memory databases can store data structures in DRAM with latency that is measured in nanoseconds. (Latency of disk-based technology, comparatively, is glacial – clocked in milliseconds.) Depending on the workload and the vendor’s database engine architecture, DRAM processing can improve database performance by as much as 50X-100X.
How durable is it?
Keep in mind that most relational database systems conform to the ACID (Atomicity, Consistency, Isolation, and Durability) properties of transactions. (You can find a more thorough investigation of these properties in this paper – “The Transaction Concept: Virtues and Limitation” – authored by database pioneer Jim Gray.) The matter of relational database system durability naturally raises the question: But how is data protected from DRAM failures when things go haywire and what is the recovery experience like? Relational databases implement the durable property to prevent problems associated with power loss or hardware failure to ensure transaction information is permanently captured.
The commonly used WAL (Write Ahead Logging) method ensures that the transaction data is written to a log file (persisted on non-volatile storage) before it is committed and subsequently written to a data file (persisted on non-volatile storage). When the database engine restarts after a failure, it switches to recovery mode to read the log file and determine if the transactions should be rolled forward (committed) or rolled back (cancelled), depending on their state at the time of failure.
Current in-memory database systems support durability and their implementations vary by vendor. Here is a sampling of durability techniques they use:
Shopping tip: Consider durability when evaluating your options
If changes in your data environment are frequent and require greater persistence and consistency, be sure to also consider durability when evaluating and comparing vendor implementations. Durability is no less important than query speed. Different implementations may or may not be a good fit and in some cases might require additional hardware that can increase cost.
It’s easy to get swept away by all the media attention about how in-memory databases deliver blazing performance, but customers often tell me they would gladly give up some performance for rock-solid stability and interoperability.
For our part, LSI enterprise PCIe® flash storage solutions not only perform well but also include DuraClass™ technology, which can increase the endurance, reliability and power efficiency of non-volatile storage used for in-memory database systems.
Tags: ACID, data tiering, database, DRAM, durability, DuraClass, flash storage, IMDB, in-memory database systems, PCI Express, PCIe flash, relational database, replication, snapshots, WAL, write ahead logging
My first blog in this series, “How to maximize performance of PCIe flash for enterprise applications running on Linux,” describes the steps for aligning PCIe® flash devices. This blog covers the next stage of setting up the PCIe flash device when using the Linux® operating system: creating a RAW device or a file system.
At this point, one or more PCIe flash cards have been partitioned on a sector boundary. Depending on their use, these partitioned devices are either set up as a single RAW device or as part of a logical volume or RAID array.
Next step is to determine how these devices will be used. Most administrators will create file systems on these partitions. Some Oracle administrators will use them as RAW devices and assign them to Automatic Storage Management (ASM). Still others, those looking for the best performance possible from the device will stick to a RAW device. For many years, the recommendation was not to use RAW devices because the complexity of managing them outweighed their small potential gains in performance.
ASM uses RAW devices but makes administration of these devices much easier. More on ASM in Part 3 of this series.
Building a file system
Next is to build a file system on the RAW device, LVM or RAID. But first we need to determine the best type of file system to use. There are many to choose from including:
To keep this brief, I will only go over EXT-4. This type of file system is the most current and provides the latest enhancements for increasing capacity, disabling journaling and many other capabilities, though XFS can be a higher performance alternative.
To create an EXT-4 file system, use this command:
You can now turn off or on certain features of the EXT-4 file system by using “tune2fs. Here are a couple of examples of using tune2fs:
tune2fs –l /dev/sdX1 | grep ‘Filesystem features’
tune2fs -O ^has_journal /dev/sdX1
Mounting the file system
The next step is to mount the file system and assign the owner:group to the mount point. There are also many tuning options that can be added to the mount command when using PCIe flash cards. The mount options I use are:
The mount command for /dev/sda1 to /u01 would be:
mount –o noatime,nodiratime,max_batch_time=0, nobarrier,discard /dev/sda1 /u01
To make these mount points persistent over reboots, add them to the mount entries in ‘/etc/fstab’. Finally, you need to give a user rights for reading and writing to this mount point, and to assign ownership to /u01 – for example, assigning ownership of /u01 to the oracle userid and to the dba group. To do this, use the “chown” command:
chown oracle:dba /u01
The PCIe flash device is now ready to be used.
Part 3 of this series will describe how to use Oracle ASM when deploying PCIe flash cards.
Part 4 of this series will describe how to persist assignment to dynamically changing NWD/NMR devices.
Customer dilemma: I just purchased PCIe® flash cards to increase performance of my enterprise applications that run on Linux® and Unix®. How do I set them up to get the best performance?
Good question. I wish there were a simple answer but each environment is different. There is no cookie-cutter configuration that fits all, though a few questions will reveal how the PCIe flash cards should be configured for optimum performance.
Most of the popular relational and non-relational databases run on many different operating systems. I will be describing Linux-specific configurations, but most of them should also work with Unix systems that are supported by the PCIe flash card vendor. I’m a database guy, but these same principals and techniques that I’ll be covering apply to other applications like mail servers, web servers, application servers and, of course, databases.
Aligning PCIe flash devices
The most important step to perform on each PCIe flash card is to create a partition that is aligned on a specific boundary (such as 4k or 8k) so each read and write to the flash device will require only one physical input/output (IO) operation. If the card is not partitioned on such a boundary, then reads and writes will span the sector groups, which doubles the IO latency for each read or write request.
To align a partition, I use the sfdisk command to start a partition on a 1M boundary (sector 2048). Aligning to a 1M boundary resolves the dependency to align to a 4k, 8k, or even a 64k boundary. But before I do this, I need to know how I am going to use this device. Will this be a standalone partition? Part of a logical volume? Or part of a RAID group?
Which one is best?
If I were deploying the PCIe flash device for database caching (for example, the Oracle database has provided this caching functionality for years using the Database Smart Flash Cache feature, and Facebook created the open source Flashcache used in MySQL databases), I would use a single-partitioned PCIe flash card if I knew the capacity would meet my needs now and over the next 5 years. If I selected this configuration, the sfdisk command to create the partition would be:
echo “2048,,” | sfdisk –uS /dev/sdX –force
This single partitioning is also required with the Oracle® Automatic Storage Management system (ASM). Oracle has provided ASM for many years and I will go over how to use this storage feature in Part 3 of this series.
If I need to deploy multiple PCIe flash cards for database caching, I would create Logical Volume Manager (LVM) over all the flash devices to simplify administration. The sfdisk command to create a partition for each PCIe flash card would be:
echo “2048,,8e” | sfdisk –uS /dev/sdX –force
“8e” is the system partition type for creating a logical volume.
Neither of these solutions needs fault tolerance since they will be used for write-thru caching. My recent blog “How to optimize PCIe flash cards – a new approach to creating logical volumes” covers this process in detail.
If I want to use the PCIe flash card for persisting data, I would need to make the PCIe flash cards fault tolerant, using two or more cards to build the RAID array and eliminate any single point of failure. There are a number of ways to create a RAID over multiple PCIe flash cards, two of which are:
But what type of RAID setup is best to use?
Oracle coined the term S.A.M.E. – Stripe And Mirror Everything – in 1999 and popularized the practice, which many database administrators (DBA) and storage administrators have followed ever since. I follow this practice and suggest you do the same.
First, you need to determine how these cards will be accessed:
In database deployments, your choice is usually among online transaction processing (OLTP) applications like airline and hotel reservation systems and corporate financial or enterprise resource planning (ERP) applications, or data warehouse/data mining/data analytics applications, or a mix of both environments. OLTP applications involve small random reads and writes as well as many sequential writes for log files. Data warehouse/data mining/data analytics applications involve mostly large sequential reads with very few sequential log writes.
Before setting up one or many PCIe flash cards in a RAID array either using LVM on RAID or creating a RAID array using MDADM, you need to know the access pattern of the IO, capacity requirements and budget. These requirements will dictate which RAID level will work best for your environment and fit your budget.
I would pick either a RAID 1/RAID 10 configuration (mirroring without striping, or striping and mirroring respectively), or RAID 5 (striping with parity). RAID 1/RAID 10 costs more but delivers the best performance, whereas RAID 5 costs less but imposes a significant write penalty.
Optimizing OLTP application performance
To optimize performance of an OLTP application, I would implement either a RAID 1 or RAID 10 array. If I were budget constrained, or implementing a data warehouse application, I would use a RAID 5 array. Normally a RAID 5 array will produce a higher throughput (megabits per second) appropriate for a data warehouse/data mining application.
In a nutshell, knowing how to tune the configuration to the application is key to reaping the best performance.
For either RAID array, you need to create an aligned partition using sfdisk:
echo “2048,,fd” | sfdisk –uS /dev/sdX –force
“fd” is the system identifier for a Linux RAID auto device.
Keep in mind that it is not mandatory to create a partition for LVMs or RAID arrays. Instead, you can assign RAW devices. It’s important to remember to align the sectors if combining RAW and partitioned devices or just creating a basic partition. It’s sound practice to always create an aligned partition when using PCIe flash cards.
At this point, aligned partitions have been created and are now ready to be used in LVMs or RAID arrays. Instructions for creating these are on the web or in Linux/Unix reference manuals. Here are a couple of websites that go over the process of creating LVM, RAID, or LVM on RAID:
Specifying a stripe width value
Also remember that, when creating LVMs with striping or RAID arrays, you’ll need to specify a stripe width value. Many years ago, Oracle and EMC conducted a number studies on this and concluded that a 1M stripe width performed the best as long as the database IO request was equal to or less than 1M. When implementing Oracle ASM, Oracle’s standard is to use 1M allocation units, which matches its coarse striping size of 1M.
Part 2 of this series will describe how to create RAW devices or file systems.
Part 3 of this series will describe how to use Oracle ASM when deploying PCIe flash cards.
Part 4 of this series will describe how to persist assignment to dynamically changing NWD/NMR devices.
Tags: ASM, automatic storage management, data analytics, data mining, data warehouse, Database Smart Cache, EMC, enterprise resource planning, ERP, Facebook, flash storage, Flashcache, Linux, logical volume, Logical Volume Manager, LVM, MDADM, multiple device administration, MySQL, non-relational database, OLTP, online transaction processing, Oracle, partition, PCI Express, PCIe flash, performance, RAID, RAW, relational database, SAME, sector, Stripe and Mirror Everything, Unix
We all watch the local weather and wonder how forecasters predict (or in some cases mis-predict) the future of weather. While they may not all agree on the forecast, they do agree that the more current and historical data you have, the better your ability to predict what might happen over the next hours, days and weeks.
A term used to describe this growing amount of information is Big Data, and more and more of it leverages Hadoop, a flexible architecture that provides the analysis tools and scalability required to comb through and utilize all available data. When recently talking to a US-based meteorologist (the technical name for a degreed weather forecaster), I learned that meteorologists rely on many different weather models from various sources to help create their forecasts.
Weather spawns downpour of Big Data
These models collect massive amounts of weather information from around the world. Using this information, computers then run billions of calculations to mimic the motion of weather patterns in the Earth’s dynamic atmosphere and produce forecasts for any given location over time. It was interesting to learn that not all weather models are equal.
While weather modeling websites worldwide collect this atmospheric data and provide it to meteorologists, the European community is seen as having the most accurate information. When I asked why, I learned that European weather modeling sites have some of the fastest computer hardware and technology, enabling them to analyze more data faster, which produces better overall forecasts. The US weather professional I spoke with tends to use these European sites as part of his analysis, and when European models conflict with those from US sites, he often leans toward the European data.
His use of the European weather modeling sites points to the value of fast, accurate analysis of Big Data. It also underscores the implications of vast amounts of data overwhelming the ability of the compute and storage resources available to process it. An accurate and timely weather forecast is critical and a bad or missed forecast can have terrible and even deadly consequences.
A case in point: Hurricane Sandy
In this article on Hurricane Sandy forecast speed and accuracy, you can see how removing just one source of data can dramatically reduce the accuracy of predicting a critical event such as where a hurricane will make landfall. To be sure, the more data you can store and the faster you can process it for analysis, the greater your potential competitive advantage, even in the vaunted halls of meteorological analysis and prediction.
The Hadoop® architecture is a great tool for efficiently storing and processing the growing amount of data worldwide, but Hadoop is only as good as the processing and storage performance that supports it. This gets interesting as you think about and explore the ripple effect of accurate or inaccurate forecasting in many areas. In my next blog post I will explore one of those – flu vaccines.
Whether in meteorology or other fields that leverage Big Data technologies, the use of Hadoop for high levels of speed and accuracy in Big Data analysis requires computers with application acceleration. One such tool is LSI® Nytro™ Application Acceleration. You can go to TheSmarterWayToFaster™ for more information on the Nytro product family.
This three-part series examines some of the diverse uses of Big Data in our everyday lives. It also explores how expanded data access and higher processing and storage speed can help optimize Big Data application performance.
Tags: application accleration, big data, European weather modeling, flash, flash storage, Hadoop, Hurricane Sandy, meterology, Nytro, processing performance, storage performance, weather modeling