It’s the start of the new year, and it’s traditional to make predictions – right? But predicting the future of the datacenter has been hard lately. There have been and continue to be so many changes in flight that possibilities spin off in different directions. Fractured visions through a kaleidoscope. Changes are happening in the businesses behind datacenters, the scale, the tasks and what is possible to accomplish, the value being monetized, and the architectures and technologies to enable all of these.
A few months ago I was asked to describe the datacenter in 2020 for some product planning purposes. Dave Vellante of Wikibon & John Furrier of SiliconANGLE asked me a similar question a few weeks ago. 2020 is out there – almost 7 years. It’s not easy to look into the crystal ball that far and figure out what the world will look like then, especially when we are in the midst of those tremendous changes. For some context I had to think back 7 years – what was the datacenter like then, and how profound have the changes been over the past 7 years?
And 7 years ago, our forefathers…
It was a very different world. Facebook barely existed, and had just barely passed the “university only” membership. Google was using Velcro, Amazon didn’t have its services, cloud was a non-existent term. In fact DAS (direct attach storage) was on the decline because everyone was moving to SAN/NAS. 10GE networking was in the future (1GE was still in growth mode). Linux was not nearly as widely accepted in enterprise – Amazon was in the vanguard of making it usable at scale (with Werner Vogels saying “it’s terrible, but it’s free, as in free beer”). Servers were individual – no “PODs,” and VMware was not standard practice yet. SATA drives were nowhere in datacenters.
An enterprise disk drive topped out at around 200GB in capacity. Nobody used the term petabyte. People, including me, were just starting to think about flash in datacenters, and it was several years later that solutions became available. Big data did not even exist. Not as a term or as a technology, definitely not Hadoop or graph search. In fact, Google’s seminal paper on MapReduce had just been published, and it would become the inspiration for Hadoop – something that would take many years before Yahoo picked it up and helped make it real.
Analytics were statistical and slow, and you had to be very explicitly looking for something. Advertising on the web was a modest business. Cold storage was tape or MAID, not vast pools of cheap disks in the cloud at absurdly low price points. None of the Chinese web-cloud guys existed… In truth, at LSI we had not even started looking at or getting to know the web datacenter guys. We assumed they just bought from OEMs…
No one streamed mainstream media – TV and movies – and there were no tablets to stream them to. YouTube had just been purchased by Google. Blu-ray was just getting started and competing with HD-DVD (which I foolishly bought 7 years ago), and integrated GPS’s in your car were a high-tech growth area. The iPhone or Android had not launched, Danger’s Sidekick was the cool phone, flip phones were mainstream, there was no App store or the billions of sales associated with that, and a mobile web browser was virtually useless.
Dell, IBM, and HP were the only real server companies that mattered, and the whole industry revolved around them, as well as EMC and NetApp for storage. Cisco, Lenovo and Huawei were not server vendors. And Sun was still Sun.
7 years from now
So – 7 years from now? That’s hard to predict, so take this with a grain of salt… There are many ways things could play out, especially when global legal, privacy, energy, hazardous waste recycling, and data retention requirements come into play, not to mention random chaos and invention along the way.
Compute-centric to dataflow-centric
Major applications are changing (have changed) from compute-centric to dataflow architectures. That is big data. The result will probably be a decline in the influence of processor vendors, and the increased focus on storage, network and memory, and optimized rack-level architectures. A handful of hyperscale datacenters are leading the way, and dragging the rest of us along. These types of solutions are already being deployed in big enterprise for specialized use cases, and their adoption will only increase with time. In 7 years, the main deployment model will echo what hyperscale datacenters are doing today: disaggregated racks of compute, memory and storage resources.
The datacenter is now being viewed as a profit growth enabler, rather than a cost center. That implies more compute = more revenue. That changes the investment profile and the expectations for IT. It will not be enough for enterprise IT departments to minimize change and risk because then they would be slowing revenue growth.
Customers and vendors
We are in the early stages of a customer revolt. Whether it’s deserved or not is immaterial, though I believe it’s partially deserved. Large customers have decided (and I’m doing broad brush strokes here) that OEMs are charging them too much and adding “features” that add no value and burn power, that the service contracts are excessively expensive and that there is very poor management interoperability among OEM offerings – on purpose to maintain vendor lockin. The cost structures of public cloud platforms like Amazon are proof there is some merit to the argument. Management tools don’t scale well, and require a lot of admin intervention. ISVs are seen as no better. Sure the platforms and apps are valuable and critical, but they’re really expensive too, and in a few cases, open source solutions actually scale better (though ISVs are catching up quickly).
The result? We’re seeing a push to use whitebox solutions that are interoperable and simple. Open source solutions – both software and hardware – are gaining traction in spite of their problems. Just witness the latest Open Compute Summit and the adoption rate of Hadoop and OpenStack. In fact many large enterprises have a policy that’s pretty much – any new application needs to be written for open source platforms on scale-out infrastructure.
Those 3 OEMs are struggling. Dell, HP and IBM are selling more servers, but at a lower revenue. Or in the case of IBM – selling the business. They are trying to upsell storage systems to offset those lost margins, and they are trying to innovate and vertically integrate to compensate for the changes. In contrast we’re seeing a rapid increase planned from self-built, self-architected hyperscale datacenters, especially in China. To be fair – those pressures on price and supplier revenue are not necessarily good for our industry. As well, there are newer entrants like Huawei and Cisco taking a noticeable chunk of the market, as well as an impending growth of ISV and 3rd party full rack “shrink wrapped” systems. Everybody is joining the party.
Storage, cold storage and storage-class memory
Stepping further out on the limb, I believe (but who really knows) that by 2020 storage as we know is no longer shipping. SMB is hollowed out to the cloud – that is – why would any small business use anything but cloud services? The costs are too compelling. Cloud storage is stratified into 3 levels: storage-class memory, flash/NVM and cool/cold bulk disk storage. Cold storage is going to be a very, very important area. You need to save that data, but spend zero power, and zero $ on storing it. Just look at some of the radical ideas like Facebook’s Blu-ray jukebox to address that, which was masterminded by a guy I really like – Gio Coglitore – and I am very glad is getting some rightful attention. (http://www.wired.com/wiredenterprise/2014/02/facebook-robots/)
I believe that pooled storage class memory is inevitable and will disrupt high-performance flash storage, probably beginning in 2016. My processor architect friends and I have been daydreaming about this since 2005. That disruption’s OK, because flash use will continue to grow, even as disk use grows. There is just too much data. I’ve seen one massive vendor’s data showing average servers are adding something like 0.2 hard disks per year and 0.1 SSDs per year – and that’s for the average server including diskless nodes that are usually the most common in hyperscale datacenters. So growth in spite of disruption and capacity growth.
Data will be pooled, and connected by fabric as distributed objects or key/value pairs, with erasure coding. In fact, Object store (key/value – whatever) may have “obsoleted” block storage. And the need for these larger objects will probably also obsolete file as we’re used to it. Sure disk drives may still be block based, though key/value gives rise to all sorts of interesting opportunities to support variable size structures, obscure small fault domains, and variable encryption/compression without wasting space on disk platters. I even suspect that disk drives as we know them will be morphing into cold store specialty products that physically look entirely different and are made from different materials – for a lot of reasons. 15K drives will be history, and 10K drives may too. In fact 2” drives may not make sense anymore as the laptop drive and 15K drive disappear and performance and density are satisfied by flash.
Enterprise becomes private cloud that is very similar structurally to hyperscale, but is simply in an internal facility. And SAN/NAS products as we know them will be starting on the long end of the tail as legacy support products. Sure new network based storage models are about to emerge, but they’re different and more aligned to key/value.
Rack-scale architectures will have taken over clustered deployments. That means pooled resources. Processing will be pools of single socket SoC servers enabling massive clusters, rather than lots of 2- socket servers. These SoCs might even be mobile device SoCs at some point or at least derived from that – the economics of scale and fast cadence of consumer SoCs will make that interesting, maybe even inevitable. After all, the current Apple A7 in the iphone 5S is a dual core, 64-bit V8 ARM at 1.4GHz and the whole iPhone costs as much as mainstream server processor chips. In a few years, an 8 or 16 core equivalent at 1.5GHz or 2GHz is not hard to imagine, and the cost structure should be excellent.
Rapidly evolving open source applications will have morphed into eventually consistent dataflow tasks. Or they will be emerging in-memory applications working on vast data structures in the pooled storage class memory at the rack or larger scale, which will add tremendous monetary value to businesses. Whatever the evolutionary paths – the challenge for the next 10 years is optimizing dataflow as the amount used continues to exponentially grow. After all – data has value in aggregate, so why would you throw anything away, even as the amount we generate increases?
Clusters will be autonomous. Really autonomous. As in a new term I love: “emergent.” It’s when you can start using big data analytics to monitor the datacenter, and make workload/management and data placement decisions in real time, automatically, and the datacenter begins to take on un-predicted characteristics. Deployment will be autonomous too. Power on a pod of resources, and it just starts working. Google does that already.
Layer 2 datacenter network switches will either be disappearing or will have migrated to a radically different location in the rack hierarchy. There are many ways this can evolve. I’m not sure which one(s) will dominate, but I know it will look different. And it will have different bandwidth. 100G moving to 400G interconnect fabric over fiber.
So there you have it. Guaranteed correct…
Different applications and dataflow, different architectures, different processors, different storage, different fabrics. Probably even a re-alignment of vendors.
Predicting the future of the datacenter has not been easy. There have been, and are so many changes happening. The businesses behind them. The scale, the tasks and what is possible to accomplish, the value being monetized, and the architectures and technologies to enable all of these. But at least we have some idea what’s ahead. And it’s pretty different, and exciting.
Tags: 10 gigabit ethernet, 2020, Amazon, Apple, China, Cisco, cloud storage, cold storage, datacenter, Dell, EMC, Facebook, flash, Google, Hadoop, HP, Huawei, hyperscale datacenter, IBM, iPhone, kaleidoscope, Lenovo, NAS, NetApp, non-volatile memory, NVM, Open Compute, OpenStack, rack scale architecture, SAN, SoC, Sun, VMware, YouTube
Open Compute and OpenStack are changing the datacenter world that we know and love. I thought they were having impact. Changing our OEMs and ODM products, changing what we expect from our vendors, changing the interoperability of managing infrastructure from different vendors. Changing our ability to deploy and manage grid and scale-out infrastructure. And changing how quickly and at what high level we can be innovating. I was wrong. It’s happening much more quickly than I thought.
On November 20-21 we hosted LSI AIS 2013. As I mentioned in a previous post, I was lucky enough to moderate a panel about Open Compute and OpenStack – “the perfect storm.” Truthfully? It felt more like sitting with two friends talking about our industry over beer. I hope to pick up that conversation again someday.
The panelists were awesome: Cole Crawford of Open Compute and Chris Kemp of OpenStack. These guys are not only influential. They have been involved from the very start of these two initiatives, and are in many ways key drivers of both movements. These are impressive, passionate guys who really are changing the world. There aren’t too many of us who can claim that. It was an engaging hour that I learned quite a bit from, and I think the audience did too. I wanted to share from my notes what I took away from that panel. I think you’ll be interested.
Goals and Vision: two “open source” initiatives
There were a few motivations behind Open Compute, and the goal was to improve these things.
The goal then, for the first time, is to work backwards from workload and create open source hardware and infrastructure that is openly available and designed from the start for large scale-out deployments. The idea is to drive high efficiency in cost, materials use and energy consumption. More work/$.
One surprising thing that came up – LSI is in every current contribution in Open Compute.
OpenStack layers services that describe abstractions of computer networking and storage. LSI products tend to sit at that lowest level of abstraction, where there is now a wave of innovation. OpenStack had similar fragmentation issues to deal with and its goals are something like:
There is a certain amount of compatibility with Amazon’s cloud services. Chris’s point was that Amazon is incredibly innovative and a lot of enterprises should use it, but OpenStack enables both service providers and private clouds to compete with Amazon, and it allows unique innovation to evolve on top of it.
OpenStack and Open Compute are not products. They are “standards” or platform architectures, with companies using those standards to innovate on top of them. The idea is for one company to innovate on another’s improvements – everybody building on each other’s work. A huge brain trust. The goal is to create a competitive ecosystem and enable a rapid pace of innovation, and enable large-scale, inexpensive infrastructure that can be managed by a small team of people, and can be managed like a single server to solve massive scale problems.
Here’s their thought. Hardware is a supply chain management game + services. Open Compute is an opportunity for anyone to supply that infrastructure. And today, OEMs are killer at that. But maybe ODMs can be too. Open Compute allows innovation on top of the basic interoperable platforms. OpenStack enables a framework for innovation on top as well: security, reliability, storage, network, performance. It becomes the enabler for innovation, and it provides an “easy” way for startups to plug into a large, vibrant ecosystem. And for customers – someone said its “exa data without exadollar”…
As a result, the argument is this should be good for OEMs and ISVs, and help create a more innovative ecosystem and should also enable more infrastructure capacity to create new and better services. I’m not convinced that will happen yet, but it’s a laudable goal, and frankly that promise is part of what is appealing to LSI.
Open Compute and OpenStack are “peanut butter and jelly”
Ok – if you’re outside of the US, that may not mean much to you. But if you’ve lived in the US, you know that means they fit perfectly, and make something much greater together than their humble selves.
Graham Weston, Chairman of the Rackspace Board, was the one who called these two “peanut butter and jelly.”
Cole and Chris both felt the initiatives are co-enabling, and probably co-travelers too. Sure they can and will deploy independently, but OpenStack enables the management of large scale clusters, which really is not easy. Open Compute enables lower cost large-scale manageable clusters to be deployed. Together? Large-scale clusters that can be installed and deployed more affordably, and easily without hiring a cadre of rare experts.
Personally? I still think they are both a bit short of being ready for “prime time” – or broad deployment, but Cole and Chris gave me really valid arguments to show me I’m wrong. I guess we’ll see.
US or global vision?
I asked if these are US-centric or global visions. There were no qualms – these are global visions. This is just the 3rd anniversary of OpenStack, but even so, there are OpenStack organizations in more than 100 countries, 750 active contributors, and large-scale deployments in datacenters that you probably use every day – especially in China and the US. Companies like PayPal and Yahoo, Rackspace, Baidu, Sina Weibo, Alibaba, JD, and government agencies and HPC clusters like CERN, NASA, and China Defense.
Open Compute is even younger – about 2 years old. (I remember – I was invited to the launch). Even so, most of Facebook’s infrastructure runs on Open Compute. Two Wall Street banks have deployed large clusters, with more coming, and Riot Games, which uses Open Compute infrastructure, drives 3% of the global network traffic with League of Legends. (A complete aside – one of my favorite bands to workout with did a lot of that game’s music, and the live music at the League of Legends competition a few months ago: http://www.youtube.com/watch?v=mWU4QvC09uM – not for everyone, but I like it.)
Both Cole and Chris emailed me more data after the fact on who is using these initiatives. I have to say – they are right. It really has taken off globally, especially OpenStack in the fast-paced Chinese market this year.
Book: 4th Paradigm – A tribute to computer science researcher Jim Grey
Cole and Chris mentioned a book during the panel discussion. A book I had frankly never heard of. It’s called the 4th Paradigm. It was a series of papers dedicated to researcher Jim Grey, who was a quiet but towering figure that I believe I met once at Microsoft Research. The book was put together by Gordon Bell, someone who I have met, and have profound respect for. And there are mentions of people, places, and things that have been woven through my (long) career. I think I would sum up its thesis in a quote from Jim Grey near the start of the book:
“We have to do better producing tools to support the whole research cycle – from data capture and data curation to data analysis and data visualization.”
This is stunningly similar to the very useful big data framework we have been using recently at LSI: ”capture, hold, analyze”… I guess we should have added visualize, but that doesn’t have too much to do with LSI’s business.
As an aside, I would recommend this book for the background and inspiration in why we as an industry are trying to solve many of these computer science problems, and how transformational the impact might be. I mean really transformational in the world around us, what we know, what we can do, and how quickly we can do it – which is tightly related to our CEO’s keynote and the vision video at AIS.
Demos at AIS: “peanut butter and jelly” - and bread?
Ok – I’m struggling for analogy. We had an awesome demo at AIS that Chris and Cole pointed out during the panel. It was originally built using Nebula’s TOR appliance, Open Compute hardware, and LSI’s storage magic to make it complete. The three pieces coming together. Tasty. The Open Compute hardware was swapped out last minute (for safety, those boxes were meant for the datacenter – not the showcase in a hotel with tipsy techies) and were generously supplied by Supermicro.
I don’t think the proto was close to any one of our visions, but even as it stood, it inspireda lot of people, and would make a great product. A short rack of servers, with pooled storage in the rack, OpenStack orchestrating the point and click spawning and tear down of dynamically sized LUNs of different characteristics under the Cinder presentation layer, and deployment of tasks or VMs on them.
We’re working on completing our joint vision. I think the industry will be very impressed when they see it. Chris thinks people will be stunned, and the industry will be changed.
Catalyzing the market… The future may be closer than we think…
Ultimately, this is all about economics. We’re in the middle of an unprecedented bifurcation in IT use. On one hand we’re running existing apps on new, dense enterprise hardware using VMs to layer many applications on few servers. On the other, we’re investing in applications to run at scale across inexpensive clusters of commodity hardware. This has spawned a split in IT vendor business units, product lines and offerings, and sometimes even IT infrastructure management in the datacenter.
New applications and services are needing more infrastructure, and are getting more expensive to power, cool, purchase, run. And there is pressure to transform the datacenter from a cost center into a profit center. As these innovations start, more companies will need scale infrastructure, arguably Open Compute, and then will need an Openstack framework to deploy it quickly.
Whats this mean? With a combination of big data and mobile device services driving economic value, we may be at the point where these clusters start to become mainstream. As an industry we’re already seeing a slight decline in traditional IT equipment sales and a rapid growth in scale-out infrastructure sales. If that continues, then OpenStack and Open Compute are a natural fit. The deployment rate uptick in life sciences, oil and gas, financials this year – really anywhere there is large-scale Hadoop, big data or analytics – may be the start of that growth curve. But both Chris and Cole felt it would probably take 5 years to truly take off.
Time to Wrap Up
I asked Chris and Cole for audience takeaways. Theirs were pretty simple, though possibly controversial in an industry like ours.
Hardware vendors should think about products and how they interface and what abstractions they present and how they fit into the ecosystem. These new ecosystems should allow them to easily plug in. For example, storage under Cinder can be quickly and easily morphed – that’s what we did with our demo.
We should be designing new software to run on distributed scale-out systems in clouds. Chris went on to say their code name was “Maestro” because it orchestrates like in a symphony, bringing things together in a beautiful way. He said “make instruments for the artists out there.” The brain trust. Look for their brushstrokes.
Innovate in the open, and leverage the open initiatives that are available to accelerate innovation and efficiency.
On your next IT purchase, try an RFP with an Open Compute vendor. Cole said you might be surprised. Worst case, you may get a better deal from your existing vendor.
So, Open Compute and Openstack are changing the datacenter world that we know and love. I thought these were having a quick impact, changing our OEMs and ODM products, changing what we expect from our vendors, changing the interoperability of managing infrastructure from different vendors, changing our ability to deploy and manage grid and scale-out infrastructure, and changing how quickly and at what high level we can be innovating. I was wrong. It’s happening much more quickly than even I thought.
Tags: AIS, Alibaba, Amazon, Baidu, big data, CERN, China, China Defense, Chris Kemp, Cole Crawford, datacenter, Facebook, Hadoop, HPC, IT infrastructure, JD, Jim Grey, NASA, Nebula, Networking, Open Compute, OpenStack, PayPal, Rackspace, Riot Games, scale-out cluster, Sina Weibo, Storage, Supermicro, Yahoo
You might be surprised to find out how big the infrastructure for cloud and Web 2.0 is. It is mind-blowing. Microsoft has acknowledged packing more than 1 million servers into its datacenters, and by some accounts that is fewer than Google’s massive server count but a bit more than Amazon.
Facebook’s server count is said to have skyrocketed from 30,000 in 2012 to 180,000 just this past August, serving 900 million plus users. And the social media giant is even putting its considerable weight behind the Open Compute effort to make servers fit better in a rack and draw less power. The list of mega infrastructures also includes Tencent, Baidu and Alibaba and the roster goes on and on.
Even more jaw-dropping is that almost 99.9% of these hyperscale infrastructures are built with servers featuring direct-attached storage. That’s right – they do the computing and store the data. In other words, no special, dedicated storage gear. Yes, your Facebook photos, your Skydrive personal cloud and all the content you use for entertainment, on-demand video and gaming data are stored inside the server.
Direct-attached storage reigns supreme
Everything in these infrastructures – compute and storage – is built out of x-86 based servers with storage inside. What’s more, growth of direct-attached storage is many folds bigger than any other storage deployments in IT. Rising deployments of cloud, or cloud-like, architectures are behind much of this expansion.
The prevalence of direct-attached storage is not unique to hyperscale deployments. Large IT organizations are looking to reap the rewards of creating similar on-premise infrastructures. The benefits are impressive: Build one kind of infrastructure (server racks), host anything you want (any of your properties), and scale if you need to very easily. TCO is much less than infrastructures relying on network storage or SANs.
With direct-attached you no longer need dedicated appliances for your database tier, your email tier, your analytics tier, your EDA tier. All of that can be hosted on scalable, share-nothing infrastructure. And just as with hyperscale, the storage is all in the server. No SAN storage required.
Open Compute, OpenStack and software-defined storage drive DAS growth
Open Compute is part of the picture. A recent Open Compute show I attended was mostly sponsored by hyperscale customers/suppliers. Many big-bank IT folks attended. Open Compute isn’t the only initiative driving growing deployments of direct-attached storage. So is software-defined storage and OpenStack. Big application vendors such as Oracle, Microsoft, VMware and SAP are also on board, providing solutions that support server-based storage/compute platforms that are easy and cost-effective to deploy, maintain and scale and need no external storage (or SAN including all-flash arrays).
So if you are a network-storage or SAN manufacturer, you have to be doing some serious thinking (many have already) about how you’re going to catch and ride this huge wave of growth.
Tags: Alibaba, Amazon, Baidu, cloud computing, DAS, direct attached storage, enterprise, enterprise IT, Google, hyperscale, Microsoft, Open Compute, OpenStack, Oracle, SAP, Tencent, VMware
You may have noticed I’m interested in Open Compute. What you may not know is I’m also really interested in OpenStack. You’re either wondering what the heck I’m talking about or nodding your head. I think these two movements are co-dependent. Sure they can and will exist independently, but I think the success of each is tied to the other. In other words, I think they are two sides of the same coin.
Why is this on my mind? Well – I’m the lucky guy who gets to moderate a panel at LSI’s AIS conference, with the COO of Open Compute, and the founder of OpenStack. More on that later. First, I guess I should describe my view of the two. The people running these open-source efforts probably have a different view. We’ll find that out during the panel.
I view Open Compute as the very first viable open-source hardware initiative that general business will be able to use. It’s not just about saving money for rack-scale deployments. It’s about having interoperable, multi-source systems that have known, customer-malleable – even completely customized and unique – characteristics including management. It also promises to reduce OpEx costs.
Ready for Prime Time?
But the truth is Open Compute is not ready for prime time yet. Facebook developed almost all the designs for its own use and gifted them to Open Compute, and they are mostly one or two generations old. And somewhere between 2,000 and 10,000 Open Compute servers have shipped. That’s all. But, it’s a start.
More importantly though, it’s still just hardware. There is still a need to deploy and manage the hardware, as well as distribute tasks, and load balance a cluster of Open Compute infrastructure. That’s a very specialized capability, and there really aren’t that many people who can do that. And the hardware is so bare bones – with specialized enclosures, cooling, etc – that it’s pretty hard to deploy small amounts. You really want to deploy at scale – thousands. If you’re deploying a few servers, Open Compute probably isn’t for you for quite some time.
I view OpenStack in a similar way. It’s also not ready for prime time. OpenStack is an orchestration layer for the datacenter. You hear about the “software defined datacenter.” Well, this is it – at least one version. It pools the resources (compute, object and block storage, network, and memory at some time in the future), presents them, allows them to be managed in a semi-automatic way, and automates deployment of tasks on the scaled infrastructure. Sure there are some large-scale deployments. But it’s still pretty tough to deploy at large scale. That’s because it needs to be tuned and tailored to specific hardware. In fact, the biggest datacenters in the world mostly use their own orchestration layer. So that means today OpenStack is really better at smaller deployments, like 50, 100 or 200 server nodes.
The synergy – 2 sides of the same coin
You’ll probably start to see the synergy. Open Compute needs management and deployment. OpenStack prefers known homogenous hardware or else it’s not so easy to deploy. So there is a natural synergy between the two. It’s interesting too that some individuals are working on both… Ultimately, the two Open initiatives will meet in the big, but not-too-big (many hundreds to small thousands of servers) deployments in the next few years.
And then of course there is the complexity of the interaction of for-profit companies and open-source designs and distributions. Companies are trying to add to the open standards. Sometimes to the betterment of standards, but sometimes in irrelevant ways. Several OEMs are jumping in to mature and support OpenStack. And many ODMs are working to make Open Compute more mature. And some companies are trying to accelerate the maturity and adoption of the technologies in pre-configured solutions or appliances. What’s even more interesting are the large customers – guys like Wall Street banks – that are working to make them both useful for deployment at scale. These won’t be the only way scaled systems are deployed, but they’re going to become very common platforms for scale-out or grid infrastructure for utility computing.
Here is how I charted the ecosystem last spring. There’s not a lot of direct interaction between the two, and I know there are a lot of players missing. Frankly, it’s getting crazy complex. There has been an explosion of players, and I’ve run out of space, so I’ve just not gotten around to updating it. (If anyone engaged in these ecosystems wants to update it and send me a copy – I’d be much obliged! Maybe you guys at Nebula ? ;-)).
An AIS keynote panel – What?
Which brings me back to that keynote panel at AIS. Every year LSI has a conference that’s by invitation only (sorry). It’s become a pretty big deal. We have some very high-profile keynotes from industry leaders. There is a fantastic tech showcase of LSI products, partner and ecosystem company’s products, and a good mix of proof of concepts, prototypes and what-if products. And there are a lot of breakout sessions on industry topics, trends and solutions. Last year I personally escorted an IBM fellow, Google VPs, Facebook architects, bank VPs, Amazon execs, flash company execs, several CTOs, some industry analysts, database and transactional company execs…
It’s a great place to meet and interact with peers if you’re involved in the datacenter, network or cellular infrastructure businesses. One of the keynotes is actually a panel of 2. The COO of Open Compute, Cole Crawford, and the co-founder of OpenStack, Chris Kemp (who is also the founder and CSO of Nebula). Both of them are very smart, experienced and articulate, and deeply involved in these movements. It should be a really engaging, interesting keynote panel, and I’m lucky enough to have a front-row seat. I’ll be the moderator, and I’m already working on questions. If there is something specific you would like asked, let me know, and I’ll try to accommodate you.
You can see more here.
Yea – I’m very interested in Open Compute and OpenStack. I think these two movements are co-dependent. And I think they are already changing our industry – even before they are ready for real large-scale deployment. Sure they can and will exist independently, but I think the success of each is tied to the other. The people running these open-source efforts might have a different view. Luckily, we’ll get to find out what they think next month… And I’m lucky enough to have a front row seat.
I’ve just been to China. Again. It’s only been a few months since I was last there.
I was lucky enough to attend the 5th China Cloud Computing Conference at the China National Convention Center in Beijing. You probably have not heard of it, but it’s an impressive conference. It’s “the one” for the cloud computing industry. It was a unique view for me – more of an inside-out view of the industry. Everyone who’s anyone in China’s cloud industry was there. Our CEO, Abhi Talwalkar, had been invited to keynote the conference, so I tagged along.
First, the air was really hazy, but I don’t think the locals considered it that bad. The US consulate iPhone app said the particulates were in the very unhealthy range. Imagine looking across the street. Sure, you can see the building there, but the next one? Not so much. Look up. Can you see past the 10th floor? No, not really. The building disappears into the smog. That’s what it was like at the China National Convention Center, which is part of the same Olympics complex as the famous Birdcage stadium: http://www.cnccchina.com/en/Venues/Traffic.aspx
I had a fantastic chance to catch up with a university friend, who has been living in Beijing since the 90’s, and is now a venture capitalist. It’s amazing how almost 30 years can disappear and you pick up where you left off. He sure knows how to live. I was picked up in his private limo, whisked off to a very well-known restaurant across the city, where we had a private room and private waitress. We even had some exotic, special dishes that needed to be ordered at least a day in advance. Wow. But we broke Chinese tradition and had imported beer in honor of our Canadian education.
Sizing up China’s cloud infrastructure
The most unusual meeting I attended was an invitation-only session – the Sino-American roundtable on cloud computing. There were just about 40 people in a room – half from the US, half from China. Mostly what I learned is that the cloud infrastructure in China is fragmented, and probably sub-scale. And it’s like that for a reason. It was difficult to understand at first, but I think I’ve made sense of it.
I started asking why to friends and consultants and got some interesting answers. Essentially different regional governments are trying to capture the cloud “industry” in their locality, so they promote activity, and they promote creation of new tools and infrastructure for that. Why reuse something that’s open source and works if you don’t have to and you can create high-tech jobs? (That’s sarcasm, by the way.) Many technologists I spoke with felt this will hold them back, and that they are probably 3-5 years behind the US. As well, each government-run industry specifies the datacenter and infrastructure needed to be a supplier or ecosystem partner with them, and each is different. The national train system has a different cloud infrastructure from the agriculture department, and from the shipping authority, etc… and if you do business with them – that is you are part of their ecosystem of vendors, then you use their infrastructure. It all spells fragmentation and sub-scale. In contrast, the Web 2.0 / social media companies seem to be doing just fine.
Baidu was also showing off its open rack. It’s an embodiment of the Scorpio V1 standard, which was jointly developed with Tencent, Alibaba and China Telecom. It views this as a first experiment, and is looking forward to V2, which will be a much more mature system.
I was also lucky to have personal meetings with general managers,chief architects and effective CTOs of the biggest cloud companies in China. What did I learn? They are all at an inflexion point. Many of the key technologists have experience at American Web 2.0 companies, so they’re able to evolve quickly, leveraging their industry knowledge. They’re all working to build or grow their own datacenters, their own infrastructure. And they’re aggressively expanding products, not just users, so they’re getting a compound growth rate.
Here’s a little of what I learned. In general, there is a trend to try and simplify infrastructure, harmonize divergent platforms, and deploy more infrastructure by spending less on each unit. (In general, they don’t make as much per user as American companies, but they have more users). As a result they are more cost-focused than US companies. And they are starting to put more emphasis on operational simplicity in general. As one GM described it to me – “Yes, techs are inexpensive in China for maintainence, but more often than not they make mistakes that impact operations.” So we (LSI) will be focussing more on simplifying management and maintainence for them.
Baidu’s biggest Hadoop cluster is 20k nodes. I believe that’s as big as Yahoo’s – and it is the originator of Hadoop. Baidu has a unique use profile for flash – it’s not like the hyperscale datacenters in the US. But Baidu is starting to consume a lot. Like most other hyperscale datacenters, it is working on storage erasure coding across servers, racks and datacenters, and it is trying to make a unified namespace across everything. One of its main interests is architecture at datacenter level, harmonizing the various platforms and looking for the optimum at the datacenter level. In general, Baidu is very proud of the advances it has made, and it has real confidence in its vision and route forward, and from what I heard, its architectural ambitions are big.
JD.com (which used to be 360buy.com) is the largest direct ecommerce company in China and (only) had about $10 billion (US) in revenue last year, with 100% CAGR growth. As the GM there said, its growth has to slow sometime, or in 5 years it’ll be the biggest company in the world. I think it is the closest equivalent to Amazon there is out there, and they have similar ambitions. They are in the process of transforming to a self-built, self-managed datacenter infrastructure. It is a company I am going to keep my eyes on.
Tencent is expanding into some interesting new businesses. Sure, people know about the Tencent cloud services that the Chinese government will be using, but Tencent also has some interesting and unique cloud services coming. Let’s just say even I am interested in using them. And of course, while Tencent is already the largest Web 2.0 company in China, its new services promise to push it to new scale and new markets.
Extra! Extra! Read all about it …
And then there was press. I had a very enjoyable conversation with Yuan Shaolong, editor at WatchStor, that I think ran way over. Amazingly – we discovered we have the same favorite band, even half a world away from each other. The results are here, though I’m not sure if Google translate messed a few things up, or if there was some miscommunication, but in general, I think most of the basics are right: http://translate.google.com/translate?hl=en&sl=zh-CN&u=http://tech.watchstor.com/storage-module-144394.htm&prev=/search%3Fq%3Drobert%2Bober%2BLSI%26client%3Dfirefox-a%26rls%3Dorg.mozilla:en-US:official%26biw%3D1346%26bih%3D619
I just keep learning new things every time I go to China. I suspect it has as much to do with how quickly things are changing as new stuff to learn. So I expect it won’t be too long until I go to China, again…
Tags: Abhi Talwalkar, Alibaba, Amazon, Baidu, China, China Cloud Computing Conference, China National Convention Center, China Telecom, datacenter, Hadoop, hyperscale, JD.com, WatchStor, web 2.0, Yahoo
I was lucky enough to get together for dinner and beer with old friends a few weeks ago. Between the 4 of us, we’ve been involved in or responsible for a lot of stuff you use every day, or at least know about.
Supercomputers, minicomputers, PCs, Macs, Newton, smart phones, game consoles, automotive engine controllers and safety systems, secure passport chips, DRAM interfaces, netbooks, and a bunch of processor architectures: Alpha, PowerPC, Sparc, MIPS, StrongARM/XScale, x86 64-bit, and a bunch of other ones you haven’t heard of (um – most of those are mine, like TriCore). Basically if you drive a European car, travel internationally, use the Internet , if you play video games, or use a smart phone, well… you’re welcome.
Why do I tell you this? Well – first I’m name dropping – I’m always stunned I can call these guys friends and be their peers. But more importantly, we’ve all been in this industry as architects for about 30 years. Of course our talk went to what’s going on today. And we all agree that we’ve never seen more changes – inflexions – than the raft unfolding right now. Maybe its pressure from the recession, or maybe un-naturally pent up need for change in the ecosystem, but change there is.
Changes in who drives innovation, what’s needed, the companies on top and on bottom at every point in the food chain, who competes with whom, how workloads have changed from compute to dataflow, software has moved to opensource, how abstracted code is now from processor architecture, how individual and enterprise customers have been revolting against the “old” ways, old vendors, old business models, and what the architectures look like, how processors communicate, and how systems are purchased, and what fundamental system architectures look like. But not much besides that…
Ok – so if you’re an architect, that’s as exciting as it gets (you hear it in my voice – right ?), and it makes for a lot of opportunities to innovate and create new or changed businesses. Because innovation is so often at the intersection of changing ways of doing things. We’re at a point where the changes are definitely not done yet. We’re just at the start. (OK – now try to imagine a really animated 4-way conversation over beers at the Britannia Arms in Cupertino… Yea – exciting.)
I’m going to focus on just one sliver of the market – but it’s important to me – and that’s enterprise IT. I think the changes are as much about business models as technology.
Hyperscale datacenters drive innovation
I’ll start in a strange place. Hyperscale datacenters (think social media, search, etc.) and the scale of deployment changes the optimization point. Most of us starting to get comfortable with rack as the new purchase quantum. And some of us are comfortable with the pod or container as the new purchase quantum. But the hyperscale dataenters work more at the datacenter as the quantum. By looking at it that way, they can trade off the cost of power, real estate, bent sheet metal, network bandwidth, disk drives, flash, processor type and quantity, memory amount, where work gets done, and what applications are optimized for. In other words, we shifted from looking at local optima to looking for global optima. I don’t know about you, but when I took operations research in university, I learned there was an unbelievable difference between the two – and global optima was the one you wanted…
Hyperscale datacenters buy enough (top 6 are probably more than 10% of the market today) that 1) they need to determine what they deploy very carefully on their own, and 2) vendors work hard to give them what they need.
That means innovation used to be driven by OEMs, but now it’s driven by hyperscale datacenters and it’s driven hard. That global optimum? It’s work/$ spent. That’s global work, and global spend. It’s OK to spend more, even way more on one thing if over-all you get more done for the $’s you spend.
That’s why the 3 biggest consumers of flash in servers are Facebook, Google, and Apple, with some of the others not far behind. You want stuff, they want to provide it, and flash makes it happen efficiently. So efficiently they can often give that service away for free.
Hyperscale datacenters have started to publish their cost metrics, and open up their architectures (like OpenCompute), and open up their software (like Hadoop and derivatives). More to the point, services like Amazon have put a very clear $ value on services. And it’s shockingly low.
Enterprises are paying attention
Enterprises have looked at those numbers. Hard. That’s catalyzed a customer revolt against the old way of doing things – the old way of buy and billing. OEMs and ISVs are creating lots of value for enterprise, but not that much. They’ve been innovating around “stickiness” and “lock-in” (yea – those really are industry terms) for too long, while hyperscale datacenters have been focused on getting stuff done efficiently. The money they save per unit just means they can deploy more units and provide better services.
That revolt is manifesting itself in 2 ways. The first is seen in the quarterly reports of OEMs and ISVs. Rumors of IBM selling its X-series to Lenovo, Dell going private, Oracle trying to shift business, HP talking of the “new style of IT”… The second is enterprises are looking to emulate hyperscale datacenters as much as possible, and deploy private cloud infrastructure. And often as not, those will be running some of the same open source applications and file systems as the big hyperscale datacenters use.
Where are the hyperscale datacenters leading them? It’s a big list of changes, and they’re all over the place.
But they’re also looking at a few different things. For example, global name space NAS file systems. Personally? I think this one’s a mistake. I like the idea of file systems/object stores, but the network interconnect seems like a bottleneck. Storage traffic is shared with network traffic, creates some network spine bottlenecks, creates consistency performance bottlenecks between the NAS heads, and – let’s face it – people usually skimp on the number of 10GE ports on the server and in the top of rack switch. A typical SAS storage card now has 8 x 12G ports – that’s 96G of bandwidth. Will servers have 10 x 10G ports? Yea. I didn’t think so either.
Anyway – all this is not academic. One Wall Street bank shared with me that – hold your breath – it could save 70% of its spend going this route. It was shocked. I wasn’t shocked, because at first blush this seems absurd – not possible. That’s how I reacted. I laughed. But… The systems are simpler and less costly to make. There is simply less there to make or ship than OEMs force into the machines for uniqueness and “value.” They are purchased from much lower margin manufacturers. They have massively reduced maintenance costs (there’s less to service, and, well, no OEM service contracts). And also important – some of the incredibly expensive software licenses are flipped to open source equivalents. Net savings of 70%. Easy. Stop laughing.
Disaggregation: Or in other words, Pooled Resources
But probably the most important trend from all of this is what server manufacturers are calling “disaggregation” (hey – you’re ripping apart my server!) but architects are more descriptively calling pooled resources.
First – the intent of disaggregation is not to rip the parts of a server to pieces to get lowest pricing on the components. No. If you’re buying by the rack anyway – why not package so you can put like with like. Each part has its own life cycle after all. CPUs are 18 months. DRAM is several years. Flash might be 3 years. Disks can be 5 to 7 years. Networks are 5 to 10 years. Power supplies are… forever? Why not replace each on its own natural failure/upgrade cycle? Why not make enclosures appropriate to the technology they hold? Disk drives need solid vibration-free mechanical enclosures of heavy metal. Processors need strong cooling. Flash wants to run hot. DRAM cool.
Second – pooling allows really efficient use of resources. Systems need slush resources. What happens to a systems that uses 100% of physical memory? It slows down a lot. If a database runs out of storage? It blue screens. If you don’t have enough network bandwidth? The result is, every server is over provisioned for its task. Extra DRAM, extra network bandwidth, extra flash, extra disk drive spindles.. If you have 1,000 nodes you can easily strand TBytes of DRAM, TBytes of flash, a TByte/s of network bandwidth of wasted capacity, and all that always burning power. Worse, if you plan wrong and deploy servers with too little disk or flash or DRAM, there’s not much you can do about it. Now think 10,000 or 100,000 nodes… Ouch.
If you pool those things across 30 to 100 servers, you can allocate as needed to individual servers. Just as importantly, you can configure systems logically, not physically. That means you don’t have to be perfect in planning ahead what configurations and how many of each you’ll need. You have sub-assemblies you slap into a rack, and hook up by configuration scripts, and get efficient resource allocation that can change over time. You need a lot of storage? A little? Higher performance flash? Extra network bandwidth? Just configure them.
That’s a big deal.
And of course, this sets the stage for immense pooled main memory – once the next generation non-volatile memories are ready – probably starting around 2015.
You can’t underestimate the operational problems associated with different platforms at scale. Many hyperscale datacenters today have around 6 platforms. If you think they are rolling out new versions of those before old ones are retired they often have 3 generations of each. That’s 18 distinct platforms, with multiple software revisions of each. That starts to get crazy when you may have 200,000 to 400,000 servers to manage and maintain in a lights out environment. Pooling resources and allocating them in the field goes a huge way to simplifying operations.
Alternate Processor Architecture
It didn’t always used to be Intel x86. There was a time when Intel was an upstart in the server business. It was Power, MIPs, Alpha, SPARC… (and before that IBM mainframes and minis, etc). Each of the changes was brought on by changing the cost structure. Mainframes got displaced by multi-processor RISC, which gave way to x86.
Today, we have Oracle saying they’re getting out of x86 commodity servers and doubling down on SPARC. IBM is selling off its x86 business and doubling down on Power (hey – don’t confuse that with PowerPC – which started as an architectural cut-down of Power – I was there…). And of course there is a rash of 64-bit ARM server SOCs coming – with HP and Dell already dabbling in it. What’s important to realize is that all of these offerings are focusing on the platform architecture, and how applications really perform in total, not just the processor.
Let me warp up with an email thread cut/paste from a smart friend – Wayne Nation. I think he summed up some of what’s going on well, in a sobering way most people don’t even consider.
“Does this remind you of a time, long ago, when the market was exploding with companies that started to make servers out of those cheap little desktop x86 CPUs? What is different this time? Cost reduction and disaggregation? No, cost and disagg are important still, but not new.
A new CPU architecture? No, x86 was “new” before. ARM promises to reduce cost, as did Intel.
Disaggregation enables hyperscale datacenters to leverage vanity-free, but consistent delivery will determine the winning supplier. There is the potential for another Intel to rise from these other companies. “
I’ve been travelling to China quite a bit over the last year or so. I’m sitting in Shenzhen right now (If you know Chinese internet companies, you’ll know who I’m visiting). The growth is staggering. I’ve had a bit of a trains, planes, automobiles experience this trip, and that’s exposed me to parts of China I never would have seen otherwise. Just to accommodate sheer population growth and the modest increase in wealth, there is construction everywhere – a press of people and energy, constant traffic jams, unending urban centers, and most everything is new. Very new. It must be exciting to be part of that explosive growth. What a market. I mean – come on – there are 1.3 billion potential users in China.
The amazing thing for me is the rapid growth of hyperscale datacenters in China, which is truly exponential. Their infrastructure growth has been 200%-300% CAGR for the past few years. It’s also fantastic walking into a building in China, say Baidu, and feeling very much at home – just like you walked into Facebook or Google. It’s the same young vibe, energy, and ambition to change how the world does things. And it’s also the same pleasure – talking to architects who are super-sharp, have few technical prejudices, and have very little vanity – just a will to get to business and solve problems. Polite, but blunt. We’re lucky that they recognize LSI as a leader, and are willing to spend time to listen to our ideas, and to give us theirs.
Even their infrastructure has a similar feel to the US hyperscale datacenters. The same only different. ;-)
A lot of these guys are growing revenue at 50% per year, several getting 50% gross margin. Those are nice numbers in any country. One has $100’s of billions in revenue. And they’re starting to push out of China. So far their pushes into Japan have not gone well, but other countries should be better. They all have unique business models. “We” in the US like to say things like “Alibaba is the Chinese eBay” or “Sina Weibo is the Chinese Twitter”…. But that’s not true – they all have more hybrid business models, unique, and so their datacenter goals, revenue and growth have a slightly different profile. And there are some very cool services that simply are not available elsewhere. (You listening Apple®, Google®, Twitter®, Facebook®?) But they are all expanding their services, products and user base. Interestingly, there is very little public cloud in China. So there are no real equivalents to Amazon’s services or Microsoft’s Azure. I have heard about current development of that kind of model with the government as initial customer. We’ll see how that goes.
100’s of thousands of servers. They’re not the scale of Google, but they sure are the scale of Facebook, Amazon, Microsoft…. It’s a serious market for an outfit like LSI. Really it’s a very similar scale now to the US market. Close to 1 million servers installed among the main 4 players, and exabytes of data (we’ve blown past mere petabytes). Interestingly, they still use many co-location facilities, but that will change. More important – they’re all planning to probably double their infrastructure in the next 1-2 years – they have to – their growth rates are crazy.
Often 5 or 6 distinct platforms, just like the US hyperscale datacenters. Database platforms, storage platforms, analytics platforms, archival platforms, web server platforms…. But they tend to be a little more like a rack of traditional servers that enterprise buys with integrated disk bays, still a lot of 1G Ethernet, and they are still mostly from established OEMs. In fact I just ran into one OEM’s American GM, who I happen to know, in Tencent’s offices today. The typical servers have 12 HDDs in drive bays, though they are starting to look at SSDs as part of the storage platform. They do use PCIe® flash cards in some platforms, but the performance requirements are not as extreme as you might imagine. Reasonably low latency and consistent latency are the premium they are looking for from these flash cards – not maximum IOPs or bandwidth – very similar to their American counterparts. I think hyperscale datacenters are sophisticated in understanding what they need from flash, and not requiring more than that. Enterprise could learn a thing or two.
Some server platforms have RAIDed HDDs, but most are direct map drives using a high availability (HA) layer across the server center – Hadoop® HDFS or self-developed Hadoop like platforms. Some have also started to deploy microserver archival “bit buckets.” A small ARM® SoC with 4 HDDs totaling 12 TBytes of storage, giving densities like 72 TBytes of file storage in 2U of rack. While I can only find about 5,000 of those in China that are the first generation experiments, it’s the first of a growing wave of archival solutions based on lower performance ARM servers. The feedback is clear – they’re not perfect yet, but the writing is on the wall. (If you’re wondering about the math, that’s 5,000 x 12 TBytes = 60 Petabytes….)
Yes, it’s important, but maybe more than we’re used to. It’s harder to get licenses for power in China. So it’s really important to stay within the envelope of power your datacenter has. You simply can’t get more. That means they have to deploy solutions that do more in the same power profile, especially as they move out of co-located datacenters into private ones. Annually, 50% more users supported, more storage capacity, more performance, more services, all in the same power. That’s not so easy. I would expect solar power in their future, just as Apple has done.
Here’s where it gets interesting. They are developing a cousin to OpenCompute that’s called Scorpio. It’s Tencent, Alibaba, Baidu, and China Telecom so far driving the standard. The goals are similar to OpenCompute, but more aligned to standardized sub-systems that can be co-mingled from multiple vendors. There is some harmonization and coordination between OpenCompute and Scorpio, and in fact the Scorpio companies are members of OpenCompute. But where OpenCompute is trying to change the complete architecture of scale-out clusters, Scorpio is much more pragmatic – some would say less ambitious. They’ve finished version 1 and rolled out about 200 racks as a “test case” to learn from. Baidu was the guinea pig. That’s around 6,000 servers. They weren’t expecting more from version 1. They’re trying to learn. They’ve made mistakes, learned a lot, and are working on version 2.
Even if it’s not exciting, it will have an impact because of the sheer size of deployments these guys are getting ready to roll out in the next few years. They see the progression as 1) they were using standard equipment, 2) they’re experimenting and learning from trial runs of Scorpio versions 1 and 2, and then they’ll work on 3) new architectures that are efficient and powerful, and different.
Information is pretty sketchy if you are not one of the member companies or one of their direct vendors. We were just invited to join Scorpio by one of the founders, and would be the first group outside of China to do so. If that all works out, I’ll have a much better idea of the details, and hopefully can influence the standards to be better for these hyperscale datacenter applications. Between OpenCompute and Scorpio we’ll be seeing a major shift in the industry – a shift that will undoubtedly be disturbing to a lot of current players. It makes me nervous, even though I’m excited about it. One thing is sure – just as the server market volume is migrating from traditional enterprise to hyperscale datacenter (25-30% of the server market and growing quickly), we’re starting to see a migration to Chinese hyperscale datacenters from US-based ones. They have to grow just to stay still. I mean – come on – there are 1.3 billion potential users in China….
Tags: Alibaba, Amazon, Apple, ARM, Baidu, China, China Telecom, datacenter, Facebook, Google, Hadoop, hard disk drive, HDD, hyperscale, Microsoft, OpenCompute, Scorpio, Shenzhen, Sina Weibo, solid state drive, SSD, Tencent, Twitter