Big data, it’s the buzz word of the year and it’s generating a lot of attention. An incalculable number of articles fervently repeat the words “variety, velocity and volume,” citing click streams, RFID tags, email, surveillance cameras, Twitter® feeds, Facebook® posts, Flickr® images, blog musings, YouTube® videos, cellular texting, healthcare monitoring …. (gasps for air). We have become a society that sweats buckets of data every day (the latest estimates are approximately 34GB per person every 24 hours) and businesses are scrambling to capture all this information to learn more about us.

Save every scrap of data!
“Save all your data” has become the new business mantra, because data – no matter how seemingly meaningless it appears – contains information, and information provides insight, and improved insight makes for better decision-making, and better decision-making leads to a more efficient and profitable business.

Okay, so we get why we save data, but if the electronic bit bucket costs become prohibitive, big data could turn into its own worst enemy, undermining the value of mining data.  While Hadoop® software is an excellent (and cost-free) tool for storing and analyzing data, most organizations use a multitude of applications in conjunction with Hadoop to create a system for data ingest, analytics, data cleansing and record management. Several Hadoop vendors (Cloudera, MapR, Hortonworks, Intel, IBM, Pivotal) offer bundled software packages that ease integration and installation of these applications.

Installing a Hadoop cluster to manage big data can be a chore
With the demand for data scientists growing, the challenge can become finding the right talent to help build and manage a big data infrastructure.  A case in point: Installing a Hadoop cluster involves more than just installing the Hadoop software. Here is the sequence of steps:

  1. Install the hardware, disks, cables.
  2. Install the operating system.
  3. Optimize the file system and operating system (OS) parameters (i.e. open file limits, virtual memory).
  4. Configure and optimize the network and switches.
  5. Plan node management (for Hadoop 1.x this would be Namenode, Secondary Namenode, JobTracker, ZooKeeper, etc.).
  6. Install Hadoop across all the nodes. Configure each node according to its planned role.
  7. Configure high availability (HA) (when required).
  8. Configure security (i.e. Kerberos, Secure Shell [ssh]).
  9. Apply optimizations (I have several years’ experience in Hadoop optimization, so can say with some authority that this is not a job to be taken lightly. The benefits of a well-optimized cluster are incredible, but it can be a challenge to balance the resources correctly without adding undo system pressure elsewhere.)
  10. Install and integrate additional software and connectors (i.e. to connect to data warehousing system, input streams or database management system [DBMS] servers).
  11. Test the system.

Setup, from bare bones to a simple 15-node cluster, can take weeks to months including planning, research, installation and integration. It’s no small job.

Appliances simplify Hadoop cluster deployments
Enter appliances: low-cost, pre-validated, easy-to-deploy “bricks.” According to a Gartner forecast (Forecast: Data Center Hardware Spending to Support Big Data Projects, Worldwide 2013), appliance spending for big data projects will grow from 0.9% of hardware spending in 2012 to 9.3% by 2017. I have found myself inside a swirl of new big data appliance projects all designed to provide highly integrated systems with easy support and fully tested integration. An appliance is a great turnkey solution for companies that can’t (or don’t wish to) employ a hardware and software installation team: Simply pick up the box from the shipping area, unpack it and start analyzing data within minutes. In addition, many companies are just beginning to dabble in Hadoop, and appliances can be an easy, cost-effective way to demonstrate the value of Hadoop before making a larger investment.

While Hadoop is commonplace in the big data infrastructure, the use models can be quite varied. I’ve heard my fair share of highly connected big data engineers attempt to identify core categories for Hadoop deployments, and they generally fall into one of four categories:

  1. Business intelligence, querying, reporting, searching – such as filtering, indexing, trend analysis, search optimization – and good old-fashioned information retrieval.
  2. Higher performance for common data management operations including log storage, data storage and archiving, extraction/transform loading (ETL) processing and data conversions.
  3. Non database applications such as image processing, data sequencing, web crawling and workflow processing.
  4. Data mining and analytical applications including social network/sentiment analysis, profile matching, machine learning, personalization and recommendation analysis, ad optimization and behavioral analysis.

Finding the right appliance for you
While appliances lower the barrier to entry to Hadoop clusters, their designs and costs are as varied as their use cases.  Some appliances build in the flexibility of cloud services, while others focus on integration of applications components and reducing service level agreements (SLAs). Still others focus primarily on low cost storage. And while some appliances are just hardware (although they are validated designs), they still require a separate software agreement and installation via a third-party vendor.

In general, pricing is usually quoted either by capacity ($/TB), or per node or rack depending on the vendor and product. Licensing can significantly increase overall costs, with annual maintenance costs (software subscription and support) and license renewals adding to the cost of doing business. The good news is that, with so many appliances to choose from, any organization can find one that enables it to design a cluster that fits its budget, operating costs and value expectations.

Tags: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
Views: (476)


August was always an exciting time at my childhood home. We were excited that was school was starting in September and mom was relieved that summer was coming to an end. I remember the annual trips to the local department stores to buy school clothes every year. It was always exciting to pick out a new school clothing and a new winter coat. With only a few stores to choose from, many of us wore similar clothes and coats when classes started.

As consumers, we have far more fashion and store options today. There are specialty stores at the mall, big box outlets, membership stores and specialty online portals. With so many more clothing designers than in years past, retailers are also inundated with fashion choices. The question becomes, “how does the fashion chain – from textile suppliers and clothing manufacturers to the retailers themselves – choose what to carry?”  

They all rely on big data to make critical decisions. Let’s go to the start of the chain: the textile manufacturer. It may analyze previous years’ orders, competitive intelligence, purchasing trend data, and raw material and manufacturing costs. While tracking analytics on one data source is relatively easy, capturing and analyzing multiple data sources can be a tremendous challenge – a point underscored in a 2012 research report from Gartner.  In its analysis, Gartner found that big data processing challenges don’t come from analysis or a single data set or source but rather from the complexity of interaction between two or more data sets.

“When combining large assets and new asset types, how they relate to each other becomes more complex,” the Gartner report explains. “As more assets are combined, the tempo of record creation and the qualification of the data within use cases becomes more complex.”

The next link is the clothing companies that create the fashion. They have a much more complex job, using big data to analyze fashion trends and improve their decision-making.  Information such as historical sales, weather predictions, demographic data and economic details help them chose the right colors, sizes and price points for the clothing they make.    

Swim Suits and Snow Parkas
This is where we, as consumers, come into the picture.  Just as I did many years ago, people still shop for school and winter clothing this time of year. The clothes on the racks at our favorite retailer or from an online catalogue were chosen and ordered 6-9 months ago. Take Kohl’s. The nationwide retailer uses a blend of geographic weather prediction data sources to know where to best sell those snow parkas versus swim suits, economic and competitive data to price it right, demographic data sources to better predict the required sizes and customer demand, and market trends data sources to better forecast the colors and styles that will sell best.  The more accurately Kohl’s buyers can predict consumer behavior using big data, the less the retailer will need to discount overstock, and the higher its sales and profit. 

As I stated in my previous blog posts, the Hadoop® architecture is a great tool for efficiently storing and processing the growing amount of data worldwide, but Hadoop is only as good as the processing and storage performance that supports it. As with flu strain and weather predictions, the more data you can quickly and efficiently analyze, the more accurate your prediction. When it comes to weather and flu vaccines, these predictions can help save lives, but in the fashion industry it is all about improving the bottom line.

Whether in fashion, medical, weather or other fields , the use of Hadoop for high levels of speed and accuracy in big data analysis requires computers with application acceleration. One such tool is LSI® Nytro™ Application Acceleration. You can go to TheSmarterWayToFaster™ for more information on the Nytro product family.

Part three of this three-part series continues to examine some of the diverse and potentially life-saving uses of big data in our everyday lives. It also explores how expanded data access and higher processing and storage speed can help optimize big data application performance.

Tags: , , , , ,
Views: (1062)


Every year I diligently get in line for my annual flu (or more technically accurate “seasonal influenza”) shot.  I’m not particularly fond of needles, but I have seen what the flu can do and the how many die each year from this seasonal virus.

When you get the flu shot – or, now, the nasal mist – you and I are trusting a lot of people that what you are taking will actually help protect you. According to the CDC (Centers for Disease Control and Prevention), there are 3 three strains, (A, B &C Antigenic) of influenza virus and of those three types, two cause the seasonal epidemics we suffer through each year.

Not to get too technical, but I learned that the A strain is further segregated by 2 proteins and are given code names like H1N1, H3N2 and H5N1. They can even be updated by year if there is a change in them.  An example of this was in 2009, when the H1N1 became the 2009 H1N1.  So where we may just call it H1N1, the World Health Organization has a whole taxonomy to describe a seasonal influenza strain.

 

This taxonomy includes:

  • The antigenic type (e.g., A, B, C)
  • The host of origin (e.g., swine, equine, chicken, etc.)
  • Geographical origin (e.g., Denver, Taiwan, etc.)
  • Strain number (e.g., 15, 7, etc.)
  • Year of isolation (e.g., 57, 2009, etc.)
  • For influenza A viruses, parentheses denote the hemagglutinin and neuraminidase antigens [e.g., (H1N1), (H5N1)]

As you can see, it can really get complicated quickly. If you would like to go deeper, you can read more about this here. While much of this information seems pretty arcane to the lay reader, you quickly can see that the sheer volume of information collected, stored and analyzed to combat seasonal influenza is a great example of big data.

In the US, once the CDC sifts through this data – using big data analytics tools – it uses its findings to determine what strains might affect the US and build a flu shot to combat those strains.  During the 2012/2013 season, the predominant virus was Influenza A (H3N2), though some influenza B viruses contained a dash of influenza A (H1N1) pdm09 (pH1N1). (See the full report here.)

In addition to identifying dominant viruses, the CDC also uses big data to track the spread and potential effect on the population.  Reviewing information from prior outbreaks, population data, and even weather patterns, the CDC uses big data analytics to quickly estimate and attempt to determine where viruses might hit first, hardest and longest so that a targeted vaccine can be produced in sufficient quantities, in the required timeframe and even for the right geography.  The faster and more accurately this can be done, the more people can get this potentially life saving vaccine before the virus travels to their area.

As I stated in my previous blog post, the Hadoop® architecture is a great tool for efficiently storing and processing the growing amount of data worldwide, but Hadoop is only as good as the processing and storage performance that supports it. As with weather predictions, the more data you can quickly and efficiently analyze, the greater the likelihood of an accurate prediction. When it comes to weather and flu vaccines, these predictions can help save lives. In my final blog post in this series, I will explore how big data helps the fashion industry.

Whether in medical, weather or other fields that leverage big data technologies, the use of Hadoop for high levels of speed and accuracy in big data analysis requires computers with application acceleration. One such tool is LSI® Nytro™ Application Acceleration. You can go to TheSmarterWayToFaster™ for more information on the Nytro product family.

Part two of this three-part series continues to examine some of the diverse and potentially life-saving uses of big data in our everyday lives. It also explores how expanded data access and higher processing and storage speed can help optimize big data application performance.

 

Tags: , , , , , , , , , ,
Views: (1314)