Scroll to Top

Big data, it’s the buzz word of the year and it’s generating a lot of attention. An incalculable number of articles fervently repeat the words “variety, velocity and volume,” citing click streams, RFID tags, email, surveillance cameras, Twitter® feeds, Facebook® posts, Flickr® images, blog musings, YouTube® videos, cellular texting, healthcare monitoring …. (gasps for air). We have become a society that sweats buckets of data every day (the latest estimates are approximately 34GB per person every 24 hours) and businesses are scrambling to capture all this information to learn more about us.

Save every scrap of data!
“Save all your data” has become the new business mantra, because data – no matter how seemingly meaningless it appears – contains information, and information provides insight, and improved insight makes for better decision-making, and better decision-making leads to a more efficient and profitable business.

Okay, so we get why we save data, but if the electronic bit bucket costs become prohibitive, big data could turn into its own worst enemy, undermining the value of mining data.  While Hadoop® software is an excellent (and cost-free) tool for storing and analyzing data, most organizations use a multitude of applications in conjunction with Hadoop to create a system for data ingest, analytics, data cleansing and record management. Several Hadoop vendors (Cloudera, MapR, Hortonworks, Intel, IBM, Pivotal) offer bundled software packages that ease integration and installation of these applications.

Installing a Hadoop cluster to manage big data can be a chore
With the demand for data scientists growing, the challenge can become finding the right talent to help build and manage a big data infrastructure.  A case in point: Installing a Hadoop cluster involves more than just installing the Hadoop software. Here is the sequence of steps:

  1. Install the hardware, disks, cables.
  2. Install the operating system.
  3. Optimize the file system and operating system (OS) parameters (i.e. open file limits, virtual memory).
  4. Configure and optimize the network and switches.
  5. Plan node management (for Hadoop 1.x this would be Namenode, Secondary Namenode, JobTracker, ZooKeeper, etc.).
  6. Install Hadoop across all the nodes. Configure each node according to its planned role.
  7. Configure high availability (HA) (when required).
  8. Configure security (i.e. Kerberos, Secure Shell [ssh]).
  9. Apply optimizations (I have several years’ experience in Hadoop optimization, so can say with some authority that this is not a job to be taken lightly. The benefits of a well-optimized cluster are incredible, but it can be a challenge to balance the resources correctly without adding undo system pressure elsewhere.)
  10. Install and integrate additional software and connectors (i.e. to connect to data warehousing system, input streams or database management system [DBMS] servers).
  11. Test the system.

Setup, from bare bones to a simple 15-node cluster, can take weeks to months including planning, research, installation and integration. It’s no small job.

Appliances simplify Hadoop cluster deployments
Enter appliances: low-cost, pre-validated, easy-to-deploy “bricks.” According to a Gartner forecast (Forecast: Data Center Hardware Spending to Support Big Data Projects, Worldwide 2013), appliance spending for big data projects will grow from 0.9% of hardware spending in 2012 to 9.3% by 2017. I have found myself inside a swirl of new big data appliance projects all designed to provide highly integrated systems with easy support and fully tested integration. An appliance is a great turnkey solution for companies that can’t (or don’t wish to) employ a hardware and software installation team: Simply pick up the box from the shipping area, unpack it and start analyzing data within minutes. In addition, many companies are just beginning to dabble in Hadoop, and appliances can be an easy, cost-effective way to demonstrate the value of Hadoop before making a larger investment.

While Hadoop is commonplace in the big data infrastructure, the use models can be quite varied. I’ve heard my fair share of highly connected big data engineers attempt to identify core categories for Hadoop deployments, and they generally fall into one of four categories:

  1. Business intelligence, querying, reporting, searching – such as filtering, indexing, trend analysis, search optimization – and good old-fashioned information retrieval.
  2. Higher performance for common data management operations including log storage, data storage and archiving, extraction/transform loading (ETL) processing and data conversions.
  3. Non database applications such as image processing, data sequencing, web crawling and workflow processing.
  4. Data mining and analytical applications including social network/sentiment analysis, profile matching, machine learning, personalization and recommendation analysis, ad optimization and behavioral analysis.

Finding the right appliance for you
While appliances lower the barrier to entry to Hadoop clusters, their designs and costs are as varied as their use cases.  Some appliances build in the flexibility of cloud services, while others focus on integration of applications components and reducing service level agreements (SLAs). Still others focus primarily on low cost storage. And while some appliances are just hardware (although they are validated designs), they still require a separate software agreement and installation via a third-party vendor.

In general, pricing is usually quoted either by capacity ($/TB), or per node or rack depending on the vendor and product. Licensing can significantly increase overall costs, with annual maintenance costs (software subscription and support) and license renewals adding to the cost of doing business. The good news is that, with so many appliances to choose from, any organization can find one that enables it to design a cluster that fits its budget, operating costs and value expectations.

Tags: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
Views: (2556)


Hadoop has grown from an identity-challenged adolescent, a budding technology unsure of which use cases to call its own, to a fairly mature young adult with its most recent release of Hadoop® 2.0. Apache™ Hadoop® was introduced in 2007 with the primary intent to provide MapReduce-based batch processing for big data. While the original Hadoop certainly has made a big impact on how we use big data, it also had its limitations, chief among them:

  1. Batch processing compute resources were allocated strictly on a (static) slot basis. The number of slots per node was based on simple math, and the slots were identified either as a map or a reduce resource. In addition, jobs were managed by the JobTracker, which had little knowledge of the resources available in the worker (TaskTracker) nodes. Also, map and reduce processes generally had very little overlap, and jobs were scheduled in a batch mode. The collective upshot: an inefficient use of memory and compute resources.
  2. The NameNode, which provides critical Hadoop Distributed File System (HDFS) services, was the single point of failure (SPOF) for the entire cluster, Hadoop’s Achilles’ Heel. While custom solutions were available to eliminate the failure point, they made the clusters harder to manage and added cost.
  3. Scalability of a single cluster was limited, generally to 4,000 nodes due to the physical limitations of using a single NameNode.

YARN beefs up Hadoop for big data
Hadoop 2.0 overcomes these shortcomings. Apache’s newest software introduced the workload manager YARN (Yet Another Resource Negotiator) to replace the original MapReduce framework. YARN provides a better structure for running applications in Hadoop, making it more of a big data operating system. In the new framework, system resources are monitored by Node Managers and Application Masters. And instead of using slots, resources are dynamically allocated based on containers – cluster resources such as memory and processing times.

While Hadoop still supports MapReduce, it’s now an add-on feature. Make no mistake: YARN is a game changer for Hadoop, allowing any distributed application to work within the Hadoop architecture. Many applications have already done this – HBase, Giraph, Storm and Tez just to name a few. With YARN providing more of an operating system layer for the Hadoop architecture, the use cases are limitless. Going forward, Hadoop may very well lay the foundation for more than just analytical batch jobs, enabling greater scalability and lower cost storage to add more oxygen for the growth of relational database management systems, data warehousing and cold storage.

Automated failover and almost limitless node scalability
With Hadoop 2.0 and the new HDFS 2 features, NameNode high availability with automated failover is a standard feature – almost guaranteeing uninterrupted service to the cluster. In addition, cluster Federation, a way of carving up the NameNode’s namespace, provides almost limitless node scalability.

Other Hadoop 2.0 features include HDFS snapshots that allow point-in-time recovery of data, and enhanced security features that help ensure government compliance and authentication in multi-tenant clusters.

The ability to run so many parallel applications on top of YARN has given rise to a wide range of application data access patterns including streaming sequential for typical batch operations and low latency random for interactive queries.  To accommodate this new, dizzying array of patterns, evolving datacenter infrastructures for big data will need to take advantage of a variety of hardware including spinning media, SSDs and various volatile and non-volatile memory architectures. Features such as HDFS-2832 and HDFS-4949 will give users the benefits of non-homogenous data hierarchies to help ensure the highest performance for applications such as real-time analytics processing or extract, transform and load (ETL) operations.

Hadoop 2.0 is easy to come by. Apache released its first general-availability version of Hadoop 2.0, called Hadoop 2.2, in mid-October, and within days Hortonworks released its Hortonworks Data Platform 2. Cloudera has been beta testing its CDH 5 version since November 2013, and MapR last week announced plans to release a YARN-based version in March.

Big data: more growth, greater efficiencies
The growing momentum around YARN and HDFS 2.0 promises to drive more growth and greater efficiencies in big data as more companies and open source projects build applications and toolsets that fuel more innovation. The broad availability of these tools will enable organizations of all sizes to derive deeper insight, enhance their competitiveness and efficiency and, ultimately, improve their profitability from the staggering amount of data available to them.

Tags: , , , , , , , , , ,
Views: (1396)