Try using a sledgehammer to pack 15 pounds of potatoes into a bag with a 5-pound capacity, and what do you end up with? Too much messy and disgusting material crammed into a vessel too small for the job and a lot of sloppy overspill.
Unless you have the right sledgehammer. What does this have to do with computer storage? Plenty. And it all starts with a new data-reshaping capability of LSI® DuraWrite™ technology. Keep the numbers in mind: 15 pounds of data in a 5-pound bag. Or three units of data in a space designed for one. It’s key. More on that later.
What does DuraWrite technology do again?
My blog Write Amplification (Part 2) talks about how DuraWrite data reduction technology can make more space for over provisioning. That translates into faster data transfers, longer flash memory endurance and lower power draw. What it has NOT done is increase the space available for data storage.
DuraWrite Virtual Capacity is like the Dr. Who TARDIS
Excuse my Dr. Who phone booth reference, but for those who know the TARDIS, it provides a great analogy. For those unfamiliar with the TARDIS, it is a fictional time machine that looks like a British police call box. It is very small by external appearances, but inside it is vast in its carrying capacity, taking occupants on an odyssey through time.
DuraWrite Virtual Capacity (DVC) is a new feature of our SandForce® SSD controllers, and it’s a bit like the TARDIS. While there is no time travel involved, it does provide a lot more than can be seen. DVC takes advantage of data entropy (randomness) as data is written to the SSD. Some people like to think of it as data compression. Whatever you call it, the end result is the same—less data is written to the flash than what is written from the host. DuraWrite technology alone will increase the over provisioning of an SSD, but DVC increases the user data storage available in an SSD (not the over provisioning).
How much more space can it add?
The efficiency of DVC is inversely related to the entropy. High entropy data like JPEG, encrypted and similar compressed files do little to increase data capacity. In contrast, files like Microsoft OfficeOutlook® PST, Oracle® databases, EXE, and DLL (operating system) files have much lower entropy and can increase usable storage space on the order of two to three times for the same physical flash memory. Yes, I said two to three times. Better still, that translates into a two to three times reduction in the net cost of the flash storage. Again, no typo: two to three times more affordable. Since most enterprise deployments of flash memory are limited by the cost per GB of the flash, this kind of advance has the potential to further accelerate flash memory deployments in the enterprise.
Why hasn’t this been done in the past?
Think TARDIS again. Step into the booth, and take a joy ride through space and time. It happens on-the-fly. Simple … but, only in a fictional world. With on-the-fly data reduction and compression, the process is filled with complexities. The biggest problem is most operating systems don’t understand that the maximum capacity of a primary storage device (hard disk or solid state drive) can increase or decrease over time. However, open source operating systems can address the issue with customized drivers.
The other problem is any storage device that includes data reduction or compression must use a variable mapping table to track the location of the data on the device once data is reduced. Hard disk drives (HDDs) do not require any kind of mapping table because the operating system can write new data over old data. However, the lack of a mapping table prevents HDDs from supporting an on-the-fly data reduction and compression system.
All solid state drives (SSDs) using NAND flash feature a basic mapping table, typically called the flash translation layer (FTL). This mapping table is required because NAND flash memory pages cannot be rewritten directly, but must first be erased in larger blocks. The SSD controller needs to relocate valid data while the old data gets erased. This process, called garbage collection, uses the mapping table. However, the data reduction and compression system requires a mapping table that is variable in size. Most SSDs lack that capability, but not those using a SandForce controller, making SSDs with SandForce controllers perfectly suited to the job.
What use cases can be applied to DVC?
DVC can be used to increase usable data storage space or provide more cache capacity flexibility by two to three times. To create more usable data storage space, the operating system must be altered with new primary storage device drivers for it to understand the drive’s maximum capacity, which can fluctuate over time based on how much the data is reduced or compressed.
To support greater cache capacity flexibility, a host controller would manage the flash memory directly as a cache. The controller would isolate the flash memory capacity from the host so the operating system does not even see it. The dynamic cache capacity would increase cache performance at a lower price per GB depending upon the entropy of the data. The LSI Nytro product line and some SandForce Driven® program SSDs already support both of these use cases.
When will this appear in my personal computer?
While DVC is already being deployed and evaluated in enterprise datacenters around the world, the use in personal computers will take a bit longer due to the need to have the operating system changed with new storage device drivers that understand the fluctuating maximum capacity.
When these operating system changes come together, you will not need that sledgehammer to pack more data into your TARDIS (SSD). Now that’s a space odyssey to write home about.
Where did my email go?
This week I was dragged into to the virtualized cloud kicking and screaming … well, sort of. LSI has moved me, and all my co-workers, from my nice, safe local Exchange server to one in the amorphous, mysterious cloud. Scary. My IT department says the new cloud email is a great thing. They are now promising me unlimited email storage. Long gone will be the days of harrowing emails telling me I am approaching my storage limit and soon will be unable to send new email.
With cloud email, I can keep everything forever! I am not quite sure that saving mountains of email will be a good thing :-). Other than having to redirect my tablet and smartphone to this new service, update my webmail bookmark and empty my email inbox, there was not much I had to do. So far, so good. I have not experienced any challenges or performance issues. A key reason is flash storage.
To be sure, virtualization is a great tool for improving physical server utilization and flexibility as well as reducing power, cooling and datacenter footprint costs. That’s why the use of virtualization for email, databases and desktops is growing dramatically. But virtualized servers are only as effective as the storage performance that supports them. If, as a datacenter manager, your clients cannot access their application data quickly or boot their virtual desktop in a reasonable time, your company’s productivity and client satisfaction can drop dramatically.
Today, applications most likely run on virtualized servers. The upside of server virtualization is that a company can improve server utilization and run more applications on fewer physical servers. This can reduce power consumption, make more efficient use of datacenter floor space and make it easier to configure servers and deploy applications. The cloud also helps streamline application development, allowing companies to more efficiently and cost effectively test software applications across a broad set of configurations and operating systems.
A heated dispute – storage contention
Once application testing is complete, a virtual server’s configuration and image can be put on a virtual shelf until they are needed again, freeing up memory, processing, storage and other resources on the physical server for new virtual servers with just a few keystrokes. But with virtualization and the cloud there can be downsides, like slow performance – especially storage performance.
When a number of virtual servers are all using the same physical storage, there can be infighting for storage capacity, generally known as storage contention. These internecine battles can slow application response to a frustrating glacial pace and lead to issues like VDI Boot and Login Storm that can extend the time it takes for users to login to tens of minutes.
Flash helps alleviate slowdowns in storage performance
Here is where flash comes to the rescue. New flash storage solutions are being deployed to help improve virtualized storage performance and alleviate productivity-sapping slowdowns caused by VDI Boot and Login Storm — the crush of end users booting up or logging in within a small window that overwhelms the server with data requests and degrades response times. Flash can be used as primary storage inside servers running virtual machines to dramatically speed storage response time. Flash can also be deployed as an intelligent cache for DAS- or SAN-connected storage and even as an external shared storage pool.
It’s clear that virtualization will require higher storage performance and better, more cost-effective ways to deploy flash storage. But how much flash you need depends on your particular virtualization challenge, configuration and of course budget: while flash storage is extremely fast, it is costlier than disk-based storage. So choosing the right storage acceleration solution – one is LSI® Nytro™ Application Acceleration – can be as important as choosing the right cloud provider for your company’s email.
While my email is now stored in the cloud in Timbuktu, I know the flash storage solutions in that datacenter help keep my mail quickly accessible 24/7 whether I access it from my computer, tablet or smartphone, giving my productivity a boost. I can be assured that every action item I am sent will quickly make it to my inbox and be added to my ever-growing to-do list. Now my next big challenge is to improve my own response performance to those email requests!
August was always an exciting time at my childhood home. We were excited that was school was starting in September and mom was relieved that summer was coming to an end. I remember the annual trips to the local department stores to buy school clothes every year. It was always exciting to pick out a new school clothing and a new winter coat. With only a few stores to choose from, many of us wore similar clothes and coats when classes started.
As consumers, we have far more fashion and store options today. There are specialty stores at the mall, big box outlets, membership stores and specialty online portals. With so many more clothing designers than in years past, retailers are also inundated with fashion choices. The question becomes, “how does the fashion chain – from textile suppliers and clothing manufacturers to the retailers themselves – choose what to carry?”
They all rely on big data to make critical decisions. Let’s go to the start of the chain: the textile manufacturer. It may analyze previous years’ orders, competitive intelligence, purchasing trend data, and raw material and manufacturing costs. While tracking analytics on one data source is relatively easy, capturing and analyzing multiple data sources can be a tremendous challenge – a point underscored in a 2012 research report from Gartner. In its analysis, Gartner found that big data processing challenges don’t come from analysis or a single data set or source but rather from the complexity of interaction between two or more data sets.
“When combining large assets and new asset types, how they relate to each other becomes more complex,” the Gartner report explains. “As more assets are combined, the tempo of record creation and the qualification of the data within use cases becomes more complex.”
The next link is the clothing companies that create the fashion. They have a much more complex job, using big data to analyze fashion trends and improve their decision-making. Information such as historical sales, weather predictions, demographic data and economic details help them chose the right colors, sizes and price points for the clothing they make.
Swim Suits and Snow Parkas
This is where we, as consumers, come into the picture. Just as I did many years ago, people still shop for school and winter clothing this time of year. The clothes on the racks at our favorite retailer or from an online catalogue were chosen and ordered 6-9 months ago. Take Kohl’s. The nationwide retailer uses a blend of geographic weather prediction data sources to know where to best sell those snow parkas versus swim suits, economic and competitive data to price it right, demographic data sources to better predict the required sizes and customer demand, and market trends data sources to better forecast the colors and styles that will sell best. The more accurately Kohl’s buyers can predict consumer behavior using big data, the less the retailer will need to discount overstock, and the higher its sales and profit.
As I stated in my previous blog posts, the Hadoop® architecture is a great tool for efficiently storing and processing the growing amount of data worldwide, but Hadoop is only as good as the processing and storage performance that supports it. As with flu strain and weather predictions, the more data you can quickly and efficiently analyze, the more accurate your prediction. When it comes to weather and flu vaccines, these predictions can help save lives, but in the fashion industry it is all about improving the bottom line.
Whether in fashion, medical, weather or other fields , the use of Hadoop for high levels of speed and accuracy in big data analysis requires computers with application acceleration. One such tool is LSI® Nytro™ Application Acceleration. You can go to TheSmarterWayToFaster™ for more information on the Nytro product family.
Part three of this three-part series continues to examine some of the diverse and potentially life-saving uses of big data in our everyday lives. It also explores how expanded data access and higher processing and storage speed can help optimize big data application performance.
Every year I diligently get in line for my annual flu (or more technically accurate “seasonal influenza”) shot. I’m not particularly fond of needles, but I have seen what the flu can do and the how many die each year from this seasonal virus.
When you get the flu shot – or, now, the nasal mist – you and I are trusting a lot of people that what you are taking will actually help protect you. According to the CDC (Centers for Disease Control and Prevention), there are 3 three strains, (A, B &C Antigenic) of influenza virus and of those three types, two cause the seasonal epidemics we suffer through each year.
Not to get too technical, but I learned that the A strain is further segregated by 2 proteins and are given code names like H1N1, H3N2 and H5N1. They can even be updated by year if there is a change in them. An example of this was in 2009, when the H1N1 became the 2009 H1N1. So where we may just call it H1N1, the World Health Organization has a whole taxonomy to describe a seasonal influenza strain.
This taxonomy includes:
As you can see, it can really get complicated quickly. If you would like to go deeper, you can read more about this here. While much of this information seems pretty arcane to the lay reader, you quickly can see that the sheer volume of information collected, stored and analyzed to combat seasonal influenza is a great example of big data.
In the US, once the CDC sifts through this data – using big data analytics tools – it uses its findings to determine what strains might affect the US and build a flu shot to combat those strains. During the 2012/2013 season, the predominant virus was Influenza A (H3N2), though some influenza B viruses contained a dash of influenza A (H1N1) pdm09 (pH1N1). (See the full report here.)
In addition to identifying dominant viruses, the CDC also uses big data to track the spread and potential effect on the population. Reviewing information from prior outbreaks, population data, and even weather patterns, the CDC uses big data analytics to quickly estimate and attempt to determine where viruses might hit first, hardest and longest so that a targeted vaccine can be produced in sufficient quantities, in the required timeframe and even for the right geography. The faster and more accurately this can be done, the more people can get this potentially life saving vaccine before the virus travels to their area.
As I stated in my previous blog post, the Hadoop® architecture is a great tool for efficiently storing and processing the growing amount of data worldwide, but Hadoop is only as good as the processing and storage performance that supports it. As with weather predictions, the more data you can quickly and efficiently analyze, the greater the likelihood of an accurate prediction. When it comes to weather and flu vaccines, these predictions can help save lives. In my final blog post in this series, I will explore how big data helps the fashion industry.
Whether in medical, weather or other fields that leverage big data technologies, the use of Hadoop for high levels of speed and accuracy in big data analysis requires computers with application acceleration. One such tool is LSI® Nytro™ Application Acceleration. You can go to TheSmarterWayToFaster™ for more information on the Nytro product family.
Part two of this three-part series continues to examine some of the diverse and potentially life-saving uses of big data in our everyday lives. It also explores how expanded data access and higher processing and storage speed can help optimize big data application performance.
We all watch the local weather and wonder how forecasters predict (or in some cases mis-predict) the future of weather. While they may not all agree on the forecast, they do agree that the more current and historical data you have, the better your ability to predict what might happen over the next hours, days and weeks.
A term used to describe this growing amount of information is Big Data, and more and more of it leverages Hadoop, a flexible architecture that provides the analysis tools and scalability required to comb through and utilize all available data. When recently talking to a US-based meteorologist (the technical name for a degreed weather forecaster), I learned that meteorologists rely on many different weather models from various sources to help create their forecasts.
Weather spawns downpour of Big Data
These models collect massive amounts of weather information from around the world. Using this information, computers then run billions of calculations to mimic the motion of weather patterns in the Earth’s dynamic atmosphere and produce forecasts for any given location over time. It was interesting to learn that not all weather models are equal.
While weather modeling websites worldwide collect this atmospheric data and provide it to meteorologists, the European community is seen as having the most accurate information. When I asked why, I learned that European weather modeling sites have some of the fastest computer hardware and technology, enabling them to analyze more data faster, which produces better overall forecasts. The US weather professional I spoke with tends to use these European sites as part of his analysis, and when European models conflict with those from US sites, he often leans toward the European data.
His use of the European weather modeling sites points to the value of fast, accurate analysis of Big Data. It also underscores the implications of vast amounts of data overwhelming the ability of the compute and storage resources available to process it. An accurate and timely weather forecast is critical and a bad or missed forecast can have terrible and even deadly consequences.
A case in point: Hurricane Sandy
In this article on Hurricane Sandy forecast speed and accuracy, you can see how removing just one source of data can dramatically reduce the accuracy of predicting a critical event such as where a hurricane will make landfall. To be sure, the more data you can store and the faster you can process it for analysis, the greater your potential competitive advantage, even in the vaunted halls of meteorological analysis and prediction.
The Hadoop® architecture is a great tool for efficiently storing and processing the growing amount of data worldwide, but Hadoop is only as good as the processing and storage performance that supports it. This gets interesting as you think about and explore the ripple effect of accurate or inaccurate forecasting in many areas. In my next blog post I will explore one of those – flu vaccines.
Whether in meteorology or other fields that leverage Big Data technologies, the use of Hadoop for high levels of speed and accuracy in Big Data analysis requires computers with application acceleration. One such tool is LSI® Nytro™ Application Acceleration. You can go to TheSmarterWayToFaster™ for more information on the Nytro product family.
This three-part series examines some of the diverse uses of Big Data in our everyday lives. It also explores how expanded data access and higher processing and storage speed can help optimize Big Data application performance.
Tags: application accleration, big data, European weather modeling, flash, flash storage, Hadoop, Hurricane Sandy, meterology, Nytro, processing performance, storage performance, weather modeling
Big data and Hadoop are all about exploiting new value and opportunities with data. In financial trading, business and some areas of science, it’s all about being fastest or first to take advantage of the data. The bigger the data sets, the smarter the analytics. The next competitive edge with big data comes when you layer in flash acceleration. The challenge is scaling performance in Hadoop clusters.
The most cost-effective option emerging for breaking through disk-to-I/O bottlenecks to scale performance is to use high-performance read/write flash cache acceleration cards for caching. This is essentially a way to get more work for less cost, by bringing data closer to the processing. The LSI® Nytro™ product has been shown during testing to improve the time it takes to complete Hadoop software framework jobs up to a 33%.
Flash cache cards increase Hadoop application performance
Combining flash cache acceleration cards with Hadoop software is a big opportunity for end users and suppliers. LSI estimates that less than 10% of Hadoop software installations today incorporate flash acceleration1. This will grow rapidly as companies see the increased productivity and ROI of flash to accelerate their systems. And Hadoop software adoption is also growing fast. IDC predicts a CAGR of as much as 60% by 20162. Drivers include IT security, e-commerce, fraud detection and mobile data user management. Gartner predicts that Hadoop software will be in two-thirds of advanced analytics products by 20153. Many thousands of Hadoop software clusters are already deployed.
Where flash makes the most immediate sense is with those who have smaller clusters doing lots of in-place batch processing. Hadoop is purpose-built for analyzing a variety of data, whether structured, semi-structured or unstructured, without the need to define a schema or otherwise anticipate results in advance. Hadoop enables scaling that allows an unprecedented volume of data to be analyzed quickly and cost-effectively on clusters of commodity servers. Speed gains are about data proximity. This is why flash cache acceleration typically delivers the highest performance gains when the card is placed directly in the server on the PCI Express® (PCIe) bus.
Combining the best of flash and HDDs to drive higher performance and storage capacity
PCIe flash cache cards are now available with multiple terabytes of NAND flash storage, which substantially increases the hit rate. We offer a solution with both onboard flash modules and Serial-Attached SCSI (SAS) interfaces to enable high-performance direct-attached storage (DAS) configurations consisting of solid state and hard disk drive storage. This couples the low-latency performance benefits of flash with the capacity and cost-per-gigabyte advantages of HDDs.
To keep the processor close to the data, Hadoop uses servers with DAS. And to get the data even closer to the processor, the servers are usually equipped with significant amounts of random access memory (RAM). An additional benefit: Smart implementation of Hadoop and flash components can reduce the overall server footprint and simplify scaling, with some solutions enabling up to 128 devices to share a very high bandwidth interface. Most commodity servers provide 8 or less SATA ports for disks, reducing expandability.
Hadoop is great, but flash-accelerated Hadoop is best. It’s an effective way, as you work to extract full value from big data, to secure a competitive edge.
The term global warming can be very polarizing in a conversation and both sides of the argument have mountains of material that support or discredit the overall situation. The most devout believers in global warming point to the average temperature increases in the Earth’s atmosphere over the last 100+ years. They maintain the rise is primarily caused by increased greenhouse gases from humans burning fossil fuels and deforestation.
The opposition generally agrees with the measured increase in temperature over that time, but claims that increase is part of a natural cycle of the planet and not something humans can significantly impact one way or another. The US Energy Information Administration estimates that 90% of world’s marketed energy consumption is from non-renewable energy sources like fossil fuels. Our internet-driven lives run through datacenters that are well-known to consume large quantities of power. No matter which side of the global warming argument you support, most people agree that wasting power is not a good long-term position. Therefore, if the power consumed by datacenters can be reduced, especially as we live in an increasingly digitized world, this would benefit all mankind.
When we look at the most power-hungry components of a datacenter, we find mainly server and storage systems. However, people sometimes forget that those systems require cooling to counteract the heat generated. But the cooling itself consumes even more energy. So anything that can store data more efficiently and quickly will reduce both the initial energy consumption and the energy to cool those systems. As datacenters demand faster data storage, they are shifting to solid state drives (SSDs). SSDs generally provide higher performance per watt of power consumed over hard disk drives, but there is still more that can be done.
Reducing data to help turn down the heat
The good news is that there’s a way to reduce the amount of data that reaches the flash memory of the SSD. The unique DuraWrite™ technology found in all LSI® SandForce® flash controllers reduces the amount of data written to the flash memory to cut the time it takes to complete the writes and therefore reduce power consumption, below levels of other SSD technologies. That, in turn, reduces the cooling needed to further reduce overall power consumption. Now this data reduction is “loss-less,” meaning 100% of what is saved is returned to the host, unlike MPEG, JPEG, and MP3 files, which tolerate some amount of data loss to reduce file sizes.
Today you can find many datacenters already using SandForce Driven SSDs and LSI Nytro™ application acceleration products (which use DuraWrite technology as well). When we start to see datacenters deploying these flash storage products by the millions, you will certainly be able to measure the reduction in power consumed by datacenters. Unfortunately, LSI will not be able to claim it stopped global warming, but at least we, and our customers, can say we did something to help defer the end result.