I am sitting in the terminal waiting for my flight home from – yes, you guessed it – China. I am definitely racking up frequent flier miles this year.
This trip ended up centering on resource pooling in the datacenter. Sure, you might hear a lot about disaggregation, but the consensus seems to be: that’s the wrong name (unless you happen to make standalone servers). For anyone else, it’s about a much more flexible infrastructure, simplified platforms, better lifecycle management, and higher efficiency. I call it “resource pooling,” which is descriptive, but others simply call it rack scale architecture.
It’s been a long week, but very interesting. I was asked to keynote at the SACC conference (Systems Architect Conference China) in Beijing. It was also a great chance to meet 1-on-1 with the CTOs and chief architects from the big datacenters, and visit for a few hours with other acquaintances. I even had the chance to have dinner with the CEO /CIO China Magazine editor in chief, and CIO’s from around Beijing. As always in life, if you’re willing to listen, you can learn a lot. And I did.
Thinking on disaggregation aligns
With CTOs, there was a lot of discussion about disaggregation in the datacenter. There is a lot of aligned thinking on the topic, and it’s one of those occasions where you had to laugh because I think anyone of the CTOs keynoting could have given anyone else’s presentation. So what’s the big deal? Resource pooling and rack scale architecture.
I’ll use this trip as an excuse to dig a little deeper into my view on what this means.
First – you need to understand where these large datacenters are in their evolution. They usually have 4 to 6 platforms and2 or 3 generations of each in the datacenter. That can be 18 different platforms to manage, maintain, and tune. Worse – they have to plan 6 to 9 months in advance to deploy equipment. If you guess wrong, you’ve got a bunch of useless equipment, and you spent a bunch of money – the size of mistake that will get you fired… And even if you get it right, you’re left with the problem – Do I upgrade servers when the CPU is new? Or at, say, 18 months? Or do I wait until the biggest cost item – the drives – need to be replaced in 4 or 5 years? That’s difficult math. So resource pooling is about lifecycle management of different types of components and sub-systems. You can optimally replace each resource on its own schedule.
Increasing resource utilization and efficiency
But it’s also about resource utilization and efficiency. Datacenters have multiple platforms because each platform needs a different configuration of resources. I use the term configuration on purpose. If you have storage in your server, it’s in some standard configuration – say, 6 3 TByte drives or 18 raw TBytes. Do you use all that capacity? Or do you leave some space so databases can grow? Of course you leave empty space. You might not even have any use for that much storage in that particular server – maybe you just use half the capacity. After all, it’s a standard configuration. What about disk bandwidth? Can your Hadoop node saturate 6 drives? Probably. It could probably use 12 or maybe even 24. But sorry – it’s a standard configuration. What about latency-sensitive databases? Sure, I can plug a PCIe card in, but I only have 1.6 TByte PCIe cards as my standard configuration. My database is 1.8 TBytes and growing. Sorry – you have to refactor and put on 2 servers. Or my database is only 1 TByte. I’m wasting 600 GBytes of really expensive resource.
For network resources – the standard configuration gets maybe exactly 1 10GE port. You need more? Can’t have it. You don’t need that much? Sorry – wasted bandwidth capacity. What about standard memory? You either waste DRAM you don’t use, or you starve for more DRAM you can’t get.
But if I have pools of rack scale resources that I can allocate to a standard compute platform – well – that’s a different story. I can configure exactly the amount of network bandwidth, memory, flash high- performance storage, and disk bulk storage. I can even add more configured storage if a database grows, instead of being forced to refactor a database into shards across multiple standard configurations.
Pooling resources = simplified operations
So the desire to pool resources is really as much about simplified operations as anything else. I can have standardized modules that are all “the same” to manage, but can be resource configured into a well-tailored platform that can even change over time.
But pooling is also about accommodating how the application architectures have changed, and how much more important dataflow is than compute for so much of the datacenter. As a result there is a lot of uncertainty about how parts of these rack scale architectures and interconnect will evolve, even as there is a lot of certainty that they will evolve, and they will include pooled resource “modules.” Whatever the overall case, we’re pretty sure we understand how the storage will evolve. And at a high level, that’s what I presented in my keynote. (Hey – I’m not going to publicly share all our magic!)
One storage architecture of pooled resources at the rack scale level. One storage architecture that combines boot management, flash storage for performance, and disk storage for efficient bandwidth and capacity. And those resources can be allocated however and whenever the datacenter manager needs them. And the existing software model doesn’t need to change. Existing apps, OS’s, file systems, and drivers are all supported, meaning a change to pooled resource rack scale deployments is de-risked dramatically. Overall, this one architecture simplifies the number of platforms, simplifies the management of platforms, utilizes the resources very efficiently, and simplifies image and boot management. I’m pretty sure it even reduces datacenter-level CapEx. I know it dramatically reduces OpEx.
Yea – I know what you’re thinking – it’s awesome ! (That’s what you thought – right?)
Oh – what about those CIO meetings? Well, there is tremendous pressure to not buy American IT equipment in China because of all the news from the Snowden NSA leaks. As most of the CIO’s pointed out, though, in today’s global sourcing market, it’s pretty hard to not buy US IT equipment. So they’re feeling a bit trapped. In a no-risk profession, I suspect that means they just won’t buy anything for a year or so and hope it blows over.
But in general, yep, I think this trip was centered on resource pooling in the datacenter. Sure, you might hear about disaggregation, but there’s a lot of agreement that’s the wrong name. It’s much more about resource pooling for flexible infrastructure, simplified platforms, better lifecycle management, and higher efficiency. And we aim to be right in the middle. Literally.