I was asked some interesting questions recently by CEO & CIO, a Chinese business magazine. The questions ranged from how Chinese Internet giants like Alibaba, Baidu and Tencent differ from other customers and what leading technologies big Internet companies have created to questions about emerging technologies such as software-defined storage (SDS) and software-defined datacenters (SDDC) and changes in the ecosystem of datacenter hardware, software and service providers. These were great questions. Sometimes you need the press or someone outside the industry to ask a question that makes you step back and think about what’s going on.
I thought you might interested, so this blog, the first of a 3-part series covering the interview, shares details of the first two questions.
CEO & CIO: In recent years, Internet companies have built ultra large-scale datacenters. Compared with traditional enterprises, they also take the lead in developing datacenter technology. From an industry perspective, what are the three leading technologies of ultra large-scale Internet data centers in your opinion? Please describe them.
There are so many innovations and important contributions to the industry from these hyperscale datacenters in hardware, software and mechanical engineering. To choose three is difficult. While I would prefer to choose hardware innovations as their big ones, I would suggest the following as they have changed our world and our industry and are changing our hardware and businesses:
Autonomous behavior and orchestration
An architect at Microsoft once told me, “If we had to hire admins for our datacenter in a normal enterprise way, we would hire all the IT admins in the world, and still not have enough.” There are now around 1 million servers in Microsoft datacenters. Hyperscale datacenters have had to develop autonomous, self-managing, sometimes self-deploying datacenter infrastructure simply to expand. They are pioneering datacenter technology for scale – innovating, learning by trial and error, and evolving their practices to drive more work/$. Their practices are specialized but beginning to be emulated by the broader IT industry. OpenStack is the best example of how that specialized knowledge and capability is being packaged and deployed broadly in the industry. At LSI, we’re working with both hyperscale and orchestration solutions to make better autonomous infrastructure.
High availability at datacenter level vs. machine level
As systems get bigger they have more components, more modes of failure and they get more complex and expensive to maintain reliability. As storage is used more, and more aggressively, drives tend to fail. They are simply being used more. And yet there is continued pressure to reduce costs and complexity. By the time hyperscale datacenters had evolved to massive scale – 100’s of thousands of servers in multiple datacenters – they had created solutions for absolute reliability, even as individual systems got less expensive, less complex and much less reliable. This is what has enabled the very low cost structures of the cloud, and made it a reliable resource.
These solutions are well timed too, as more enterprise organizations need to maintain on-premises data across multiple datacenters with absolute reliability. The traditional view that a single server requires 99.999% reliability is giving way to a more pragmatic view of maintaining high reliability at the macro level – across the entire datacenter. This approach accepts the failure of individual systems and components even as it maintains data center level reliability. Of course – there are currently operational issues with this approach. LSI has been working with hyperscale datacenters and OEMs to engineer improved operational efficiency and resilience, and minimized impact of individual component failure, while still relying on the datacenter high-availability (HA) layer for reliability.
It’s such an overused term. It’s difficult to believe the term barely existed a few years ago. The gift of Hadoop® to the industry – an open source attempt to copy Google® MapReduce and Google File System – has truly changed our world unbelievably quickly. Today, Hadoop and the other big data applications enable search, analytics, advertising, peta-scale reliable file systems, genomics research and more – even services like Apple® Siri run on Hadoop. Big data has changed the concept of analytics from statistical sampling to analysis of all data. And it has already enabled breakthroughs and changes in research, where relationships and patterns are looked for empirically, rather than based on theories.
Overall, I think big data has been one of the most transformational technologies this century. Big data has changed the focus from compute to storage as the primary enabler in the datacenter. Our embedded hard disk controllers, SAS (Serial Attached SCSI) host bus adaptors and RAID controllers have been at the heart of this evolution. The next evolutionary step in big data is the broad adoption of graph analysis, which integrates the relationship of data, not just the data itself.
CEO & CIO: Due to cloud computing, mobile connectivity and big data, the traditional IT ecosystem or industrial chain is changing. What are the three most important changes in LSI’s current cooperation with the ecosystem chain? How does LSI see the changes in the various links of the traditional ecosystem chain? What new links are worth attention? Please give some examples.
Cloud computing and the explosion of data driven by mobile devices and media has and continues to change our industry and ecosystem contributors dramatically. It’s true the enterprise market (customers, OEMs, technology, applications and use cases) has been pretty stable for 10-20 years, but as cloud computing has become a significant portion of the server market, it has increasingly affected ecosystem suppliers like LSI.
Timing: It’s no longer enough to follow Intel’s ticktock product roadmap. Development cycles for datacenter solutions used to be 3 to 5 years. But these cycles are becoming shorter. Now, demand for solutions is closer to 6 months – forcing hardware vendors to plan and execute to far tighter development cycles. Hyperscale datacenters also need to be able to expand resources very quickly, as customer demand dictates. As a result they incorporate new architectures, solutions and specifications out of cycle with the traditional Intel roadmap changes. This has also disrupted the ecosystem.
End customers: Hyperscale datacenters now have purchasing power in the ecosystem, with single purchase orders sometimes amounting to 5% of the server market. While OEMs still are incredibly important, they are not driving large-scale deployments or innovating and evolving nearly as fast. The result is more hyperscale design-win opportunities for component or sub-system vendors if they offer something unique or a real solution to an important problem. This also may shift profit pools away from OEMs to strong, nimble technology solution innovators. It also has the potential to reduce overall profit pools for the whole ecosystem, which is a potential threat to innovation speed and re-investment.
New players: Traditionally, a few OEMs and ISVs globally have owned most of the datacenter market. However, the supply chain of the hyperscale cloud companies has changed that. Leading datacenters have architected, specified or even built (in Google’s case) their own infrastructure, though many large cloud datacenters have been equipped with hyperscale-specific systems from Dell and HP. But more and more systems built exactly to datacenter specifications are coming from suppliers like Quanta. Newer network suppliers like Arista have increased market share. Some new hyperscale solution vendors have emerged, like Nebula. And software has shifted to open source, sometimes supported for-pay by companies copying the Redhat® Linux model – companies like Cloudera, Mirantis or United Stack. Personally, I am still waiting for the first 3rd-party hardware service emulating a Linux support and service company to appear.
Open initiatives: Yes, we’ve seen Hadoop and its derivatives deployed everywhere now – even in traditional industries like oil and gas, pharmacology, genomics, etc. And we’ve seen the emergence of open-source alternatives to traditional databases being deployed, like Casandra. But now we’re seeing new initiatives like Open Compute and OpenStack. Sure these are helpful to hyperscale datacenters, but they are also enabling smaller companies and universities to deploy hyperscale-like infrastructure and get the same kind of automated control, efficiency and cost structures that hyperscale datacenters enjoy. (Of course they don’t get fully there on any front, but it’s a lot closer). This trend has the potential to hurt OEM and ISV business models and markets and establish new entrants – even as we see Quanta, TYAN, Foxconn, Wistron and others tentatively entering the broader market through these open initiatives.
New architectures and new algorithms: There is a clear movement toward pooled resources (or rack scale architecture, or disaggregated servers). Developing pooled resource solutions has become a partnership between core IP providers like Intel and LSI with the largest hyperscale datacenter architects. Traditionally new architectures were driven by OEMs, but that is not so true anymore. We are seeing new technologies emerge to enable these rack-scale architectures (RSA) – technologies like silicon photonics, pooled storage, software-defined networks (SDN), and we will soon see pooled main memory and new nonvolatile main memories in the rack.
We are also seeing the first tries at new processor architectures about to enter the datacenter: ARM 64 for cool/cold storage and web tier and OpenPower P8 for high power processing – multithreaded, multi-issue, pooled memory processing monsters. This is exciting to watch. There is also an emerging interest in application acceleration: general-purposing computing on graphics processing units (GPGPUs), regular expression processors (regex) live stream analytics, etc. We are also seeing the first generation of graph analysis deployed at massive scale in real time.
Innovation: The pace of innovation appears to be accelerating, although maybe I’m just getting older. But the easy gains are done. On one hand, datacenters need exponentially more compute and storage, and they need to operate 10x to 1000x more quickly. On the other, memory, processor cores, disks and flash technologies are getting no faster. The only way to fill that gap is through innovation. So it’s no surprise there are lots of interesting things happening at OEMs and ISVs, chip and solution companies, as well as open source community and startups. This is what makes it such an interesting time and industry.
Consumption shifts: We are seeing a decline in laptop and personal computer shipments, a drop that naturally is reducing storage demand in those markets. Laptops are also seeing a shift to SSD from HDD. This has been good for LSI, as our footprint in laptop HDDs had been small, but our presence in laptop SSDs is very strong. Smart phones and tablets are driving more cloud content, traffic and reliance on cloud storage. We have seen a dramatic increase in large HDDs for cloud storage, a trend that seems to be picking up speed, and we believe the cloud HDD market will be very healthy and will see the emergence of new, cloud-specific HDDs that are radically different and specifically designed for cool and cold storage.
There is also an explosion of SSD and PCIe flash cards in cloud computing for databases, caches, low-latency access and virtual machine (VM) enablement. Many applications that we take for granted would not be possible without these extreme low-latency, high-capacity flash products. But very few companies can make a viable storage system from flash at an acceptable cost, opening up an opportunity for many startups to experiment with different solutions.
Summary: So I believe the biggest hyperscale innovations are autonomous behavior and orchestration, HA at the datacenter level vs. machine level, and big data. These are radically changing the whole industry. And what are those changes for our industry and ecosystem? You name it: timing, end customers, new players, open initiatives, new architectures and algorithms, innovation, and consumption patterns. All that’s staying the same are legacy products and solutions.
These were great questions. Sometimes you need the press or someone outside the industry to ask a question that makes you step back and think about what’s going on. Great questions.
Tags: Alibaba, Apple Siri, Arista, ARM 64, Baidu, big data, Casandra, CEO & CIO Magazine, China, cloud storage, Cloudera, cold storage, cool storage, datacenter, datacenter ecosystem, Dell, flash, Foxconn, Google File System, Google MapReduce, Hadoop, hard disk drive, HDD, high availability, HP, hyperscale datacenter, Intel, Internet, latency, Microsoft, Mirantis, Nebula, OEM, Open Compute, OpenPower P8, OpenStack, original equipment manufacturer, Quanta, rack scale, RAID, Redhat Linux, SAS, SDDC, SDN, SDS, Serial Attached SCSI, software-defined datacenter, software-defined networks, software-defined storage, solid state drive, SSD, Tencent, TYAN, United Stack, virtual machine, VM, Wistron
I am sitting in the terminal waiting for my flight home from – yes, you guessed it – China. I am definitely racking up frequent flier miles this year.
This trip ended up centering on resource pooling in the datacenter. Sure, you might hear a lot about disaggregation, but the consensus seems to be: that’s the wrong name (unless you happen to make standalone servers). For anyone else, it’s about a much more flexible infrastructure, simplified platforms, better lifecycle management, and higher efficiency. I call it “resource pooling,” which is descriptive, but others simply call it rack scale architecture.
It’s been a long week, but very interesting. I was asked to keynote at the SACC conference (Systems Architect Conference China) in Beijing. It was also a great chance to meet 1-on-1 with the CTOs and chief architects from the big datacenters, and visit for a few hours with other acquaintances. I even had the chance to have dinner with the CEO /CIO China Magazine editor in chief, and CIO’s from around Beijing. As always in life, if you’re willing to listen, you can learn a lot. And I did.
Thinking on disaggregation aligns
With CTOs, there was a lot of discussion about disaggregation in the datacenter. There is a lot of aligned thinking on the topic, and it’s one of those occasions where you had to laugh because I think anyone of the CTOs keynoting could have given anyone else’s presentation. So what’s the big deal? Resource pooling and rack scale architecture.
I’ll use this trip as an excuse to dig a little deeper into my view on what this means.
First – you need to understand where these large datacenters are in their evolution. They usually have 4 to 6 platforms and2 or 3 generations of each in the datacenter. That can be 18 different platforms to manage, maintain, and tune. Worse – they have to plan 6 to 9 months in advance to deploy equipment. If you guess wrong, you’ve got a bunch of useless equipment, and you spent a bunch of money – the size of mistake that will get you fired… And even if you get it right, you’re left with the problem – Do I upgrade servers when the CPU is new? Or at, say, 18 months? Or do I wait until the biggest cost item – the drives – need to be replaced in 4 or 5 years? That’s difficult math. So resource pooling is about lifecycle management of different types of components and sub-systems. You can optimally replace each resource on its own schedule.
Increasing resource utilization and efficiency
But it’s also about resource utilization and efficiency. Datacenters have multiple platforms because each platform needs a different configuration of resources. I use the term configuration on purpose. If you have storage in your server, it’s in some standard configuration – say, 6 3 TByte drives or 18 raw TBytes. Do you use all that capacity? Or do you leave some space so databases can grow? Of course you leave empty space. You might not even have any use for that much storage in that particular server – maybe you just use half the capacity. After all, it’s a standard configuration. What about disk bandwidth? Can your Hadoop node saturate 6 drives? Probably. It could probably use 12 or maybe even 24. But sorry – it’s a standard configuration. What about latency-sensitive databases? Sure, I can plug a PCIe card in, but I only have 1.6 TByte PCIe cards as my standard configuration. My database is 1.8 TBytes and growing. Sorry – you have to refactor and put on 2 servers. Or my database is only 1 TByte. I’m wasting 600 GBytes of really expensive resource.
For network resources – the standard configuration gets maybe exactly 1 10GE port. You need more? Can’t have it. You don’t need that much? Sorry – wasted bandwidth capacity. What about standard memory? You either waste DRAM you don’t use, or you starve for more DRAM you can’t get.
But if I have pools of rack scale resources that I can allocate to a standard compute platform – well – that’s a different story. I can configure exactly the amount of network bandwidth, memory, flash high- performance storage, and disk bulk storage. I can even add more configured storage if a database grows, instead of being forced to refactor a database into shards across multiple standard configurations.
Pooling resources = simplified operations
So the desire to pool resources is really as much about simplified operations as anything else. I can have standardized modules that are all “the same” to manage, but can be resource configured into a well-tailored platform that can even change over time.
But pooling is also about accommodating how the application architectures have changed, and how much more important dataflow is than compute for so much of the datacenter. As a result there is a lot of uncertainty about how parts of these rack scale architectures and interconnect will evolve, even as there is a lot of certainty that they will evolve, and they will include pooled resource “modules.” Whatever the overall case, we’re pretty sure we understand how the storage will evolve. And at a high level, that’s what I presented in my keynote. (Hey – I’m not going to publicly share all our magic!)
One storage architecture of pooled resources at the rack scale level. One storage architecture that combines boot management, flash storage for performance, and disk storage for efficient bandwidth and capacity. And those resources can be allocated however and whenever the datacenter manager needs them. And the existing software model doesn’t need to change. Existing apps, OS’s, file systems, and drivers are all supported, meaning a change to pooled resource rack scale deployments is de-risked dramatically. Overall, this one architecture simplifies the number of platforms, simplifies the management of platforms, utilizes the resources very efficiently, and simplifies image and boot management. I’m pretty sure it even reduces datacenter-level CapEx. I know it dramatically reduces OpEx.
Yea – I know what you’re thinking – it’s awesome ! (That’s what you thought – right?)
Oh – what about those CIO meetings? Well, there is tremendous pressure to not buy American IT equipment in China because of all the news from the Snowden NSA leaks. As most of the CIO’s pointed out, though, in today’s global sourcing market, it’s pretty hard to not buy US IT equipment. So they’re feeling a bit trapped. In a no-risk profession, I suspect that means they just won’t buy anything for a year or so and hope it blows over.
But in general, yep, I think this trip was centered on resource pooling in the datacenter. Sure, you might hear about disaggregation, but there’s a lot of agreement that’s the wrong name. It’s much more about resource pooling for flexible infrastructure, simplified platforms, better lifecycle management, and higher efficiency. And we aim to be right in the middle. Literally.