Hadoop has grown from an identity-challenged adolescent, a budding technology unsure of which use cases to call its own, to a fairly mature young adult with its most recent release of Hadoop® 2.0. Apache™ Hadoop® was introduced in 2007 with the primary intent to provide MapReduce-based batch processing for big data. While the original Hadoop certainly has made a big impact on how we use big data, it also had its limitations, chief among them:
YARN beefs up Hadoop for big data
Hadoop 2.0 overcomes these shortcomings. Apache’s newest software introduced the workload manager YARN (Yet Another Resource Negotiator) to replace the original MapReduce framework. YARN provides a better structure for running applications in Hadoop, making it more of a big data operating system. In the new framework, system resources are monitored by Node Managers and Application Masters. And instead of using slots, resources are dynamically allocated based on containers – cluster resources such as memory and processing times.
While Hadoop still supports MapReduce, it’s now an add-on feature. Make no mistake: YARN is a game changer for Hadoop, allowing any distributed application to work within the Hadoop architecture. Many applications have already done this – HBase, Giraph, Storm and Tez just to name a few. With YARN providing more of an operating system layer for the Hadoop architecture, the use cases are limitless. Going forward, Hadoop may very well lay the foundation for more than just analytical batch jobs, enabling greater scalability and lower cost storage to add more oxygen for the growth of relational database management systems, data warehousing and cold storage.
Automated failover and almost limitless node scalability
With Hadoop 2.0 and the new HDFS 2 features, NameNode high availability with automated failover is a standard feature – almost guaranteeing uninterrupted service to the cluster. In addition, cluster Federation, a way of carving up the NameNode’s namespace, provides almost limitless node scalability.
Other Hadoop 2.0 features include HDFS snapshots that allow point-in-time recovery of data, and enhanced security features that help ensure government compliance and authentication in multi-tenant clusters.
The ability to run so many parallel applications on top of YARN has given rise to a wide range of application data access patterns including streaming sequential for typical batch operations and low latency random for interactive queries. To accommodate this new, dizzying array of patterns, evolving datacenter infrastructures for big data will need to take advantage of a variety of hardware including spinning media, SSDs and various volatile and non-volatile memory architectures. Features such as HDFS-2832 and HDFS-4949 will give users the benefits of non-homogenous data hierarchies to help ensure the highest performance for applications such as real-time analytics processing or extract, transform and load (ETL) operations.
Hadoop 2.0 is easy to come by. Apache released its first general-availability version of Hadoop 2.0, called Hadoop 2.2, in mid-October, and within days Hortonworks released its Hortonworks Data Platform 2. Cloudera has been beta testing its CDH 5 version since November 2013, and MapR last week announced plans to release a YARN-based version in March.
Big data: more growth, greater efficiencies
The growing momentum around YARN and HDFS 2.0 promises to drive more growth and greater efficiencies in big data as more companies and open source projects build applications and toolsets that fuel more innovation. The broad availability of these tools will enable organizations of all sizes to derive deeper insight, enhance their competitiveness and efficiency and, ultimately, improve their profitability from the staggering amount of data available to them.
Pushing your enterprise cluster solution to deliver the highest performance at the lowest cost is key in architecting scale-out datacenters. Administrators must expand their storage to keep pace with their compute power as capacity and processing demands grow.
safijidsjfijdsifjiodsjfiosjdifdsoijfdsoijfsfkdsjifodsjiof dfisojfidosj iojfsdiojofodisjfoisdjfiodsj ofijds fds foids gfd gfd gfd gfd gfd gfd gfd gfd gfd gfdg dfg gfdgfdg fd gfd gdf gfd gdfgdf g gfd gdfg dfgfdg fdgfdgBeyond price and capacity, storage resources must also deliver enough bandwidth to support these growing demands. Without enough I/O bandwidth, connected servers and users can bottleneck, requiring sophisticated storage tuning to maintain reasonable performance. By using direct attached storage (DAS) server architectures, IT administrators can
Beyond price and capacity, storage resources must also deliver enough bandwidth to support these growing demands. Without enough I/O bandwidth, connected servers and users can bottleneck, requiring sophisticated storage tuning to maintain reasonable performance. By using direct attached storage (DAS) server architectures, IT administrators can reduce the complexities and performance latencies associated with storage area networks (SANs). Now, with LSI 12Gb/s SAS or MegaRAID® technology, or both, connected to 12Gb/s SAS expander-based storage enclosures, administrators can leverage the DataBolt™ technology to clear I/O bandwidth bottlenecks. The result: better overall resource utilization, while preserving legacy drive investments. Typically a slower end device would step down the entire 12Gb/s SAS storage subsystem to 6Gb/s SAS speeds. How does Databolt technology overcome this? Well, without diving too deep into the nuts and bolts, intelligence in the expander buffers data and then transfers it out to the drives at 6Gb/s speeds in order to match the bandwidth between faster hosts and slower SAS or SATA devices.
So for this demonstration at AIS, we are showcasing two Hadoop Distributed File System (HDFS) servers. Each server houses the newly shipping MegaRAID 9361-8i 12Gb/s SAS RAID controller connected to a drive enclosure featuring a 12Gb/s SAS expander and 32 6Gb/s SAS hard drives. One has a DataBolt-enabled configuration, while the other is disabled.
For the benchmarks, we ran DFSIO, which simulates MapReduce workloads and is typically used to detect performance network bottlenecks and tune hardware configurations as well as overall I/O performance.
The primary goal of the DFSIO benchmarks is to saturate storage arrays with random read workloads in order to ensure maximum performance of a cluster configuration. Our tests resulted in MapReduce Jobs completing faster in 12Gb/s mode, and overall throughput increased by 25%.