Big data, it’s the buzz word of the year and it’s generating a lot of attention. An incalculable number of articles fervently repeat the words “variety, velocity and volume,” citing click streams, RFID tags, email, surveillance cameras, Twitter® feeds, Facebook® posts, Flickr® images, blog musings, YouTube® videos, cellular texting, healthcare monitoring …. (gasps for air). We have become a society that sweats buckets of data every day (the latest estimates are approximately 34GB per person every 24 hours) and businesses are scrambling to capture all this information to learn more about us.

Save every scrap of data!
“Save all your data” has become the new business mantra, because data – no matter how seemingly meaningless it appears – contains information, and information provides insight, and improved insight makes for better decision-making, and better decision-making leads to a more efficient and profitable business.

Okay, so we get why we save data, but if the electronic bit bucket costs become prohibitive, big data could turn into its own worst enemy, undermining the value of mining data.  While Hadoop® software is an excellent (and cost-free) tool for storing and analyzing data, most organizations use a multitude of applications in conjunction with Hadoop to create a system for data ingest, analytics, data cleansing and record management. Several Hadoop vendors (Cloudera, MapR, Hortonworks, Intel, IBM, Pivotal) offer bundled software packages that ease integration and installation of these applications.

Installing a Hadoop cluster to manage big data can be a chore
With the demand for data scientists growing, the challenge can become finding the right talent to help build and manage a big data infrastructure.  A case in point: Installing a Hadoop cluster involves more than just installing the Hadoop software. Here is the sequence of steps:

  1. Install the hardware, disks, cables.
  2. Install the operating system.
  3. Optimize the file system and operating system (OS) parameters (i.e. open file limits, virtual memory).
  4. Configure and optimize the network and switches.
  5. Plan node management (for Hadoop 1.x this would be Namenode, Secondary Namenode, JobTracker, ZooKeeper, etc.).
  6. Install Hadoop across all the nodes. Configure each node according to its planned role.
  7. Configure high availability (HA) (when required).
  8. Configure security (i.e. Kerberos, Secure Shell [ssh]).
  9. Apply optimizations (I have several years’ experience in Hadoop optimization, so can say with some authority that this is not a job to be taken lightly. The benefits of a well-optimized cluster are incredible, but it can be a challenge to balance the resources correctly without adding undo system pressure elsewhere.)
  10. Install and integrate additional software and connectors (i.e. to connect to data warehousing system, input streams or database management system [DBMS] servers).
  11. Test the system.

Setup, from bare bones to a simple 15-node cluster, can take weeks to months including planning, research, installation and integration. It’s no small job.

Appliances simplify Hadoop cluster deployments
Enter appliances: low-cost, pre-validated, easy-to-deploy “bricks.” According to a Gartner forecast (Forecast: Data Center Hardware Spending to Support Big Data Projects, Worldwide 2013), appliance spending for big data projects will grow from 0.9% of hardware spending in 2012 to 9.3% by 2017. I have found myself inside a swirl of new big data appliance projects all designed to provide highly integrated systems with easy support and fully tested integration. An appliance is a great turnkey solution for companies that can’t (or don’t wish to) employ a hardware and software installation team: Simply pick up the box from the shipping area, unpack it and start analyzing data within minutes. In addition, many companies are just beginning to dabble in Hadoop, and appliances can be an easy, cost-effective way to demonstrate the value of Hadoop before making a larger investment.

While Hadoop is commonplace in the big data infrastructure, the use models can be quite varied. I’ve heard my fair share of highly connected big data engineers attempt to identify core categories for Hadoop deployments, and they generally fall into one of four categories:

  1. Business intelligence, querying, reporting, searching – such as filtering, indexing, trend analysis, search optimization – and good old-fashioned information retrieval.
  2. Higher performance for common data management operations including log storage, data storage and archiving, extraction/transform loading (ETL) processing and data conversions.
  3. Non database applications such as image processing, data sequencing, web crawling and workflow processing.
  4. Data mining and analytical applications including social network/sentiment analysis, profile matching, machine learning, personalization and recommendation analysis, ad optimization and behavioral analysis.

Finding the right appliance for you
While appliances lower the barrier to entry to Hadoop clusters, their designs and costs are as varied as their use cases.  Some appliances build in the flexibility of cloud services, while others focus on integration of applications components and reducing service level agreements (SLAs). Still others focus primarily on low cost storage. And while some appliances are just hardware (although they are validated designs), they still require a separate software agreement and installation via a third-party vendor.

In general, pricing is usually quoted either by capacity ($/TB), or per node or rack depending on the vendor and product. Licensing can significantly increase overall costs, with annual maintenance costs (software subscription and support) and license renewals adding to the cost of doing business. The good news is that, with so many appliances to choose from, any organization can find one that enables it to design a cluster that fits its budget, operating costs and value expectations.

Tags: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
Views: (483)

Big data and Hadoop are all about exploiting new value and opportunities with data. In financial trading, business and some areas of science, it’s all about being fastest or first to take advantage of the data. The bigger the data sets, the smarter the analytics. The next competitive edge with big data comes when you layer in flash acceleration. The challenge is scaling performance in Hadoop clusters.

The most cost-effective option emerging for breaking through disk-to-I/O bottlenecks to scale performance is to use high-performance read/write flash cache acceleration cards for caching. This is essentially a way to get more work for less cost, by bringing data closer to the processing. The LSI® Nytro™ product has been shown during testing to improve the time it takes to complete Hadoop software framework jobs up to a 33%.

Flash cache cards increase Hadoop application performance
Combining flash cache acceleration cards with Hadoop software is a big opportunity for end users and suppliers. LSI estimates that less than 10% of Hadoop software installations today incorporate flash acceleration1.  This will grow rapidly as companies see the increased productivity and ROI of flash to accelerate their systems.  And Hadoop software adoption is also growing fast. IDC predicts a CAGR of as much as 60% by 20162. Drivers include IT security, e-commerce, fraud detection and mobile data user management. Gartner predicts that Hadoop software will be in two-thirds of advanced analytics products by 20153. Many thousands of Hadoop software clusters are already deployed.

Where flash makes the most immediate sense is with those who have smaller clusters doing lots of in-place batch processing. Hadoop is purpose-built for analyzing a variety of data, whether structured, semi-structured or unstructured, without the need to define a schema or otherwise anticipate results in advance. Hadoop enables scaling that allows an unprecedented volume of data to be analyzed quickly and cost-effectively on clusters of commodity servers. Speed gains are about data proximity. This is why flash cache acceleration typically delivers the highest performance gains when the card is placed directly in the server on the PCI Express® (PCIe) bus.

Combining the best of flash and HDDs to drive higher performance and storage capacity
PCIe flash cache cards are now available with multiple terabytes of NAND flash storage, which substantially increases the hit rate. We offer a solution with both onboard flash modules and Serial-Attached SCSI (SAS) interfaces to enable high-performance direct-attached storage (DAS) configurations consisting of solid state and hard disk drive storage. This couples the low-latency performance benefits of flash with the capacity and cost-per-gigabyte advantages of HDDs.

To keep the processor close to the data, Hadoop uses servers with DAS. And to get the data even closer to the processor, the servers are usually equipped with significant amounts of random access memory (RAM). An additional benefit: Smart implementation of Hadoop and flash components can reduce the overall server footprint and simplify scaling, with some solutions enabling up to 128 devices to share a very high bandwidth interface. Most commodity servers provide 8 or less SATA ports for disks, reducing expandability.

Hadoop is great, but flash-accelerated Hadoop is best. It’s an effective way, as you work to extract full value from big data, to secure a competitive edge.

  1. Based on internal LSI research.
  2. “IDC Worldwide Hadoop-MapReduce Ecosystem Software 2012-2016 Forecast,” May 2012.
  3. “Gartner Predicts 2013: Business Intelligence and Analytics Need to Scale Up to Support Explosive Growth in Data Sources,” December 2012.

Tags: , , , , , , , , , , , , , ,
Views: (15050)