Back in the 1990s, a new paradigm was forced into space exploration. NASA faced big cost cuts. But grand ambitions for missions to Mars were still on its mind. The problem was it couldn‚Äôt dream and spend big. So the NASA mantra became ‚Äúfaster, better, cheaper.‚ÄĚ The idea was that the agency could slash costs while still carrying out a wide variety of programs and space missions. This led to some radical rethinks, and some fantastically successful programs that had very outside-the-box solutions. (Bouncing Mars landers anyone?)
That probably sounds familiar to any IT admin. And that spirit is alive at LSI‚Äôs AIS ‚Äď The Accelerating Innovation Summit, which is our annual congress of customers and industry pros, coming up Nov. 20-21 in San Jose. Like the people at Mission Control, they all want to make big things happen‚Ä¶ without spending too much.
Take technology and line of business professionals. They need to speed up critical business applications. A lot. Or IT staff for enterprise and mobile networks, who must deliver more work to support the ever-growing number of users, devices and virtualized machines that depend on them. Or consider mega datacenter and cloud service providers, whose customers demand the highest levels of service, yet get that service for free. Or datacenter architects and managers, who need servers, storage and networks to run at ever-greater efficiency even as they grow capability exponentially.
(LSI has been working on many solutions to these problems, some of which I spoke about in this blog.)
It‚Äôs all about moving data faster, better, and cheaper. If NASA could do it, we can too. In that vein, here‚Äôs a look at some of the topics you can expect AIS to address around doing more work for fewer dollars:
And, I think you‚Äôll find some astounding products, demos, proof of concepts and future solutions in the showcase too ‚Äď not just from LSI but from partners and fellow travelers in this industry. Hey ‚Äď that‚Äôs my favorite part. I can‚Äôt wait to see people‚Äôs reactions.
Since they rethought how to do business in 2002, NASA has embarked on nearly 60 Mars missions. Faster, better, cheaper. It can work here in IT too.
Tags: 12Gb/s SAS, AIS, big data analytics, cloud infrastructure, cloud services, datacenter, flash, flash memory, hyperscale datacenters, NAS, NASA, SAN, SDN, shareable DAS, software-defined networks, sub-20nm flash, triple-level cell flash, VDI, web 2.0
You may have noticed I‚Äôm interested in Open Compute. What you may not know is I‚Äôm also really interested in OpenStack. You‚Äôre either wondering what the heck I‚Äôm talking about or nodding your head. I think these two movements are co-dependent. Sure they can and will exist independently, but I think the success of each is tied to the other. ¬†In other words, I think they are two sides of the same coin.
Why is this on my mind? Well ‚Äď I‚Äôm the lucky guy who gets to moderate a panel at LSI‚Äôs AIS conference, with the COO of Open Compute, and the founder of OpenStack. More on that later. First, I guess I should describe my view of the two. The people running these open-source efforts probably have a different view. We‚Äôll find that out during the panel.
I view Open Compute as the very first viable open-source hardware initiative that general business will be able to use. It‚Äôs not just about saving money for rack-scale deployments. It‚Äôs about having interoperable, multi-source systems that have known, customer-malleable ‚Äď even completely customized and unique ‚Äď characteristics including management. ¬†It also promises to reduce OpEx costs.
Ready for Prime Time?
But the truth is Open Compute is not ready for prime time yet. Facebook developed almost all the designs for its own use and gifted them to Open Compute, and they are mostly one or two generations old. And somewhere between 2,000 and 10,000 Open Compute servers have shipped. That‚Äôs all. But, it‚Äôs a start.
More importantly though, it‚Äôs still just hardware. There is still a need to deploy and manage the hardware, as well as distribute tasks, and load balance a cluster of Open Compute infrastructure. That‚Äôs a very specialized capability, and there really aren‚Äôt that many people who can do that. And the hardware is so bare bones ‚Äď with specialized enclosures, cooling, etc ‚Äď that it‚Äôs pretty hard to deploy small amounts. You really want to deploy at scale ‚Äď thousands. If you‚Äôre deploying a few servers, Open Compute probably isn‚Äôt for you for quite some time.
I view OpenStack in a similar way. It‚Äôs also not ready for prime time. OpenStack is an orchestration layer for the datacenter. You hear about the ‚Äúsoftware defined datacenter.‚ÄĚ Well, this is it ‚Äď at least one version. It pools the resources (compute, object and block storage, network, and memory at some time in the future), presents them, allows them to be managed in a semi-automatic way, and automates deployment of tasks on the scaled infrastructure. Sure there are some large-scale deployments. But it‚Äôs still pretty tough to deploy at large scale. That‚Äôs because it needs to be tuned and tailored to specific hardware. In fact, the biggest datacenters in the world mostly use their own orchestration layer.¬† So that means today OpenStack is really better at smaller deployments, like 50, 100 or 200 server nodes.
The synergy ‚Äď 2 sides of the same coin
You‚Äôll probably start to see the synergy. Open Compute needs management and deployment. OpenStack prefers known homogenous hardware or else it‚Äôs not so easy to deploy. So there is a natural synergy between the two. It‚Äôs interesting too that some individuals are working on both‚Ä¶ Ultimately, the two Open initiatives will meet in the big, but not-too-big (many hundreds to small thousands of servers) deployments in the next few years.
And then of course there is the complexity of the interaction of for-profit companies and open-source designs and distributions. Companies are trying to add to the open standards. Sometimes to the betterment of standards, but sometimes in irrelevant ways. Several OEMs are jumping in to mature and support OpenStack. And many ODMs are working to make Open Compute more mature. And some companies are trying to accelerate the maturity and adoption of the technologies in pre-configured solutions or appliances. What‚Äôs even more interesting are the large customers ‚Äď guys like Wall Street banks ‚Äď that are working to make them both useful for deployment at scale. These won‚Äôt be the only way scaled systems are deployed, but they‚Äôre going to become very common platforms for scale-out or grid infrastructure for utility computing.
Here is how I charted the ecosystem last spring. There‚Äôs not a lot of direct interaction between the two, and I know there are a lot of players missing. Frankly, it‚Äôs getting crazy complex. There has been an explosion of players, and I‚Äôve run out of space, so¬†I‚Äôve just not gotten around to updating it. (If anyone engaged in these ecosystems wants to update it and send me a copy ‚Äď I‚Äôd be much obliged! Maybe you guys at Nebula ? ).
An AIS keynote panel ‚Äď What?
Which brings me back to that keynote panel at AIS. Every year LSI has a conference that‚Äôs by invitation only (sorry). It‚Äôs become a pretty big deal. We have some very high-profile keynotes from industry leaders. There is a fantastic tech showcase of LSI products, partner and ecosystem company‚Äôs products, and a good mix of proof of concepts, prototypes and what-if products. And there are a lot of breakout sessions on industry topics, trends and solutions. Last year I personally escorted an IBM fellow, Google VPs, Facebook architects, bank VPs, Amazon execs, flash company execs, several CTOs, some industry analysts, database and transactional company execs‚Ä¶
It‚Äôs a great place to meet and interact with peers if you‚Äôre involved in the datacenter, network or cellular infrastructure businesses.¬† One of the keynotes is actually a panel of 2. The COO of Open Compute, Cole Crawford, and the co-founder of OpenStack, Chris Kemp (who is also the founder and CSO of Nebula). Both of them are very smart, experienced and articulate, and deeply involved in these movements. It should be a really engaging, interesting keynote panel, and I‚Äôm lucky enough to have a front-row seat. I‚Äôll be the moderator, and I‚Äôm already working on questions. If there is something specific you would like asked, let me know, and I‚Äôll try to accommodate you.
You can see more here.
Yea – I‚Äôm very interested in Open Compute and OpenStack. I think these two movements are co-dependent. And I think they are already changing our industry ‚Äď even before they are ready for real large-scale deployment. ¬†Sure they can and will exist independently, but I think the success of each is tied to the other. The people running these open-source efforts might have a different view. Luckily, we‚Äôll get to find out what they think next month‚Ä¶ And I‚Äôm lucky enough to have a front row seat.
Optimizing the work per dollar spent is a high priority in datacenters around the world. But there aren‚Äôt many ways to accomplish that. I‚Äôd argue that integrating flash into the storage system drives the best ‚Äď sometimes most profound ‚Äď improvement in the cost of getting work done.
Yea, I know work/$ is a US-centric metric, but replace the $ with your favorite currency. The principle remains the same.
I had the chance to talk with one of the execs who‚Äôs responsible for Google‚Äôs infrastructure last week. He talked about how his fundamental job was improving performance/$. I asked about that, and he explained ‚Äúperformance‚ÄĚ as how much work an application could get done. I asked if work/$ at the application was the same, and he agreed ‚Äď yes ‚Äď pretty much.
You remember as a kid that you brought along a big brother as authoritative backup? OK ‚Äď so my big brother Google and I agree ‚Äď you should be trying to optimize your work/$. Why? Well ‚Äď it could be to spend less, or to do more with the same spend, or do things you could never do before, or simply to cope with the non-linear expansion in IT demands even as budgets are shrinking. Hey ‚Äď that‚Äôs the definition of improving work/$… (And as a bonus, if you do it right, you‚Äôll have a positive green impact that is bound to be worth brownie points.)
Here‚Äôs the point. Processors are no longer scaling the same ‚Äď sure, there are more threads, but not all applications can use all those threads. Systems are becoming harder to balance for efficiency. And often storage is the bottleneck. Especially for any application built on a database. So sure ‚Äď you can get 5% or 10% gain, or even in the extreme 100% gain in application work done by a server if you‚Äôre willing to pay enough and upgrade all aspects of the server: processors, memory, network‚Ä¶ But it‚Äôs almost impossible to increase the work of a server or application by 200%, 300% or 400% ‚Äď for any money.
I‚Äôm going to explain how and why you can do that, and what you get back in work/$. So much back that you‚Äôll probably be spending less and getting more done. And I‚Äôm going to explain how even for the risk-averse, you can avoid risk and get the improvements.
More work/$ from general-purpose DAS servers and large databases
Let me start with a customer. It‚Äôs a bank, and it likes databases. A lot. And it likes large databases even more. So much so that it needs disks to hold the entire database. Using an early version of an LSI Nytro‚ĄĘ MegaRAID¬ģ card, it got 6x the work from the same individual node and database license. You can read that as 600% if you want. It‚Äôs big. To be fair ‚Äď that early version had much more flash than our current products, and was much more expensive. Our current products give much closer to 3x-4x improvement. Again, you can think of that as 300%-400%. Again, slap a Nytro MegaRAID into your server and it‚Äôs going to do the work of 3 to 4 servers. I just did a web search and, depending on configuration, Nytro MegaRAIDs are $1,800 to $2,800 online.¬†I don‚Äôt know about you, but I would have a hard time buying 2 to 3 configured servers + software licenses for that little, but that‚Äôs the net effect of this solution. It‚Äôs not about faster (although you get that). It‚Äôs about getting more work/$.
But you also want to feel safe ‚Äď that you‚Äôre absolutely minimizing risk. OK. Nytro MegaRAID is a MegaRAID card. That‚Äôs overwhelmingly the most common RAID controller in the world, and it‚Äôs used by 9 of the top 10 OEMs, and protects 10‚Äôs to 100‚Äės of millions of disks every day. The Nytro version adds private flash caching in the card and stores hot reads and writes there. Writes to the cache use a RAID 1 pair. So if a flash module dies, you‚Äôre protected. If the flash blocks or chip die wear out, the bad blocks are removed from the cache pool, and the cache shrinks by that much, but everything keeps operating ‚Äď it‚Äôs not like a normal LUN that can‚Äôt change size. What‚Äôs more, flash blocks usually finally wear out during the erase cycle ‚Äď so no data is lost.¬† And as a bonus, you can eliminate the traditional battery most RAID cards use ‚Äď the embedded flash covers that ‚Äď so no more annual battery service needed.¬†This is a solution that will continue to improve work/$ for years and years, all the while getting 3x-4x the work from that server.
More work/$ from SAN-attached servers (without actually touching the SAN)
That example was great ‚Äď but you don‚Äôt use DAS systems. Instead, you use a big iron SAN. (OK, not all SANs are big iron, but I like the sound of that expression.) There are a few ways to improve the work from servers attached to SANs. The easiest of course is to upgrade the SAN head, usually with a flash-based cache in the SAN controller. This works, and sometimes is ‚Äúgood enough‚ÄĚ to cover needs for a year or two. However, the server still needs to reach across the SAN to access data, and it‚Äôs still forced to interact with other servers‚Äô IO streams in deeper queues. That puts a hard limit on the possible gains.¬†
Nytro XD caches hot data in the server. It works with virtual machines. It intercepts storage traffic at the block layer ‚Äď the same place LSI‚Äôs drivers have always been. If the data isn‚Äôt hot, and isn‚Äôt cached, it simply passes the traffic through to the SAN. I say this so you understand ‚Äď it doesn‚Äôt actually touch the SAN. No risk there. More importantly, the hot storage traffic never has to be squeezed through the SAN fabric, and it doesn‚Äôt get queued in the SAN head. In other words, it makes the storage really, really fast.
We‚Äôve typically found work from a server can increase 5x to 10x, and that‚Äôs been verified by independent reviewers. What‚Äôs more, the Nytro XD solution only costs around 4x the price of a high-end SAN NIC. It‚Äôs not cheap, but it‚Äôs way cheaper than upgrading your SAN arrays, it‚Äôs way cheaper than buying more servers, and it‚Äôs proven to enable you to get far more work from your existing infrastructure. When you need to get more work ‚Äď way more work ‚Äď from your SAN, this is a really cost-effective approach. Seriously ‚Äď how else would you get 5x-10x more work from your existing servers and software licenses?
More work/$ from databases
A lot of hyperscale datacenters are built around databases of a finite size. That may be 1, 2 or even 4 TBytes. If you use Apple‚Äôs online services for iTunes or iCloud, or if you use Facebook, you‚Äôre using this kind of infrastructure.
If your datacenter has a database that can fit within a few TBytes (or less), you can use the same approach. Move the entire LUN into a Nytro WarpDrive¬ģ card, and you will get 10x the work from your server and database software. It makes such a difference that some architects argue Facebook and Apple cloud services would never have been possible without this type of solution. I don‚Äôt know, but they‚Äôre probably right. You can buy a Nytro WarpDrive for as little as a low-end server. I mean low end. But it will give you the work of 10. If you have a fixed-size database, you owe it to yourself to look into this one.
More work/$ from virtualized and VDI (Virtual Desktop) systems
Virtual machines are installed on a lot of servers, for very good reason. They help improve the work/$ in the datacenter by reducing the number of servers needed and thereby reducing management, maintenance and power costs. But what if they could be made even more efficient?
Wall Street banks have benchmarked virtual desktops. They found that Nytro products drive these results: support of 2x the virtual desktops, 33% improvement in boot time during boot storms, and 33% lower cost per virtual desktop. In a more general application mix, Nytro increases work per server 2x-4x.¬† And it also gives 2x performance for virtual storage appliances.
While that‚Äôs not as great as 10x the work, it‚Äôs still a real work/$ value that‚Äôs hard to ignore. And it‚Äôs the same reliable MegaRAID infrastructure that‚Äôs the backbone of enterprise DAS storage.
A real example from our own datacenter
Finally ‚Äď a great example of getting far more work/$ was an experiment our CIO Bruce Decock did. We use a lot of servers to fuel our chip-design business. We tape out a lot of very big leading-edge process chips every year. Hundreds. ¬†And that takes an unbelievable amount of processing to get what we call ‚Äúdesign closure‚ÄĚ ‚Äď that is, a workable chip that will meet performance requirements and yield. We use a tool called PrimeTime that figures out timing for every signal on the chip across different silicon process points and operating conditions. There are 10‚Äôs to 100‚Äôs of millions of signals. And we run every active design ‚Äď 10‚Äôs to 100‚Äôs of chips ‚Äď each night so we can see how close we‚Äôre getting, and we make multiple runs per chip. That‚Äôs a lot of computation‚Ä¶ The thing is, electronic CAD has been designed to try not to use storage or it will never finish ‚Äď just /tmp space, but CAD does use huge amounts of memory for the data structures, and that means swap space on the order of TBytes. These CAD tools usually don‚Äôt need to run faster. They run overnight and results are ready when the engineers come in the next day. These are impressive machines: 384G or 768G of DRAM and 32 threads.¬† How do you improve work/$ in that situation? What did Bruce do?
He put LSI Nytro WarpDrives in the servers and pointed /tmp at the WarpDrives. Yep. Pretty complex. I don‚Äôt think he even had to install new drivers. The drivers are already in the latest OS distributions. Anyway ‚Äď like I said ‚Äď complex.
The result? WarpDrive allowed the machines to fully use the CPU and memory with no I/O contention.¬†With WarpDrive, the PrimeTime jobs for static timing closure of a typical design could be done on 15 vs. 40 machines.¬†That‚Äôs each Nytro node doing 260% of the work vs. a normal node and license. Remember ‚Äď those are expensive machines (have you priced 768G of DRAM and do you know how much specialized electronic design CAD licenses are?)¬†So the point wasn‚Äôt to execute faster. That‚Äôs not necessary. The point is to use fewer servers to do the work. In this case we could do 11 runs per server per night instead of just 4. A single chip design needs more than 150 runs in one night.
To be clear, the Nytro WarpDrives are a lot less expensive than the servers they displace. And the savings go beyond that ‚Äď less power and cooling. Lower maintenance. Less admin time and overhead. Fewer Licenses.¬† That‚Äôs definitely improved work/$ for years to come. Those Nytro cards are part of our standard flow, and they should probably be part of every chip company‚Äôs design flow.
So you can improve work/$ no matter the application, no matter your storage model, and no matter how risk-averse you are.
Optimizing the work per dollar spent is a high ‚Äď maybe the highest ‚Äď priority in datacenters around the world. And just to be clear ‚Äď Google agrees with me. There aren‚Äôt many ways to accomplish that improvement, and almost no ways to dramatically improve it. I‚Äôd argue that integrating flash into the storage system is the best ‚Äď sometimes most profound ‚Äď improvement in the cost of getting work done. Not so much the performance, but the actual work done for the money spent. And it ripples through the datacenter, from original CapEx, to licenses, maintenance, admin overhead, power and cooling, and floor space for years. That‚Äôs a pretty good deal. You should look into it.
For those of you who are interested, I already wrote about flash in these posts:
What are the driving forces behind going diskless?
LSI is green ‚Äď no foolin‚Äô
Tags: Bruce Decock, DAS, datacenter, direct attached storage, enterprise IT, flash, Google, hyperscale datacenter, Nytro MegaRAID, Nytro WarpDrive, Nytro XD, PrimeTime, RAID, SAN, server storage, storage area network, VDI, virtual desktop infrastructure, work per dollar
I am sitting in the terminal waiting for my flight home from ‚Äď yes, you guessed it ‚Äď China. I am definitely racking up frequent flier miles this year.
This trip ended up centering on resource pooling in the datacenter. Sure, you might hear a lot about disaggregation, but the consensus seems to be: that‚Äôs the wrong name (unless you happen to make standalone servers). For anyone else, it‚Äôs about a much more flexible infrastructure, simplified platforms, better lifecycle management, and higher efficiency. I call it ‚Äúresource pooling,‚ÄĚ which is descriptive,¬†but others simply call it rack scale architecture.
It‚Äôs been a long week, but very interesting. I was asked to keynote at the SACC conference (Systems Architect Conference China) in Beijing. It was also a great chance to meet 1-on-1 with the CTOs and chief architects from the big datacenters, and visit for a few hours with other acquaintances. I even had the chance to have dinner with the CEO /CIO China Magazine editor in chief, and CIO‚Äôs from around Beijing. As always in life, if you‚Äôre willing to listen, you can learn a lot. And I did.
Thinking on disaggregation aligns
With CTOs, there was a lot of discussion about disaggregation in the datacenter. There is a lot of aligned thinking on the topic, and it‚Äôs one of those occasions where you had to laugh because I think anyone of the CTOs keynoting could have given anyone else‚Äôs presentation. So what‚Äôs the big deal? Resource pooling and rack scale architecture.
I‚Äôll use this trip as an excuse to dig a little deeper into my view on what this means.
First ‚Äď you need to understand where these large datacenters are in their evolution. They usually have 4 to 6 platforms and2 or 3 generations of each in the datacenter. That can be 18 different platforms to manage, maintain, and tune. Worse ‚Äď they have to plan 6 to 9 months in advance to deploy equipment. If you guess wrong, you‚Äôve got a bunch of useless equipment, and you spent a bunch of money ‚Äď the size of mistake that will get you fired‚Ä¶ And even if you get it right, you‚Äôre left with the problem ‚Äď Do I upgrade servers when the CPU is new? Or at, say, 18 months? Or do I wait until the biggest cost item ‚Äď the drives ‚Äď need to be replaced in 4 or 5 years? That‚Äôs difficult math. So resource pooling is about lifecycle management of different types of components and sub-systems. You can optimally replace each resource on its own schedule.
Increasing resource utilization and efficiency
But it‚Äôs also about resource utilization and efficiency. Datacenters have multiple platforms because each platform needs a different configuration of resources. I use the term configuration on purpose. If you have storage in your server, it‚Äôs in some standard configuration ‚Äď say, 6 3 TByte drives or 18 raw TBytes. Do you use all that capacity? Or do you leave some space so databases can grow? Of course you leave empty space. You might not even have any use for that much storage in that particular server ‚Äď maybe you just use half the capacity. After all, it‚Äôs a standard configuration. What about disk bandwidth? Can your Hadoop node saturate 6 drives? Probably. It could probably use 12 or maybe even 24. But sorry ‚Äď it‚Äôs a standard configuration. What about latency-sensitive databases? Sure, I can plug a PCIe card in, but I only have 1.6 TByte PCIe cards as my standard configuration. My database is 1.8 TBytes and growing. Sorry ‚Äď you have to refactor and put on 2 servers. Or my database is only 1 TByte. I‚Äôm wasting 600 GBytes of really expensive resource.
For network resources ‚Äď the standard configuration gets maybe exactly 1 10GE port. You need more? Can‚Äôt have it. You don‚Äôt need that much? Sorry ‚Äď wasted bandwidth capacity. What about standard memory? You either waste DRAM you don‚Äôt use, or you starve for more DRAM you can‚Äôt get.
But if I have pools of rack scale resources that I can allocate to a standard compute platform ‚Äď well ‚Äď that‚Äôs a different story. I can configure exactly the amount of network bandwidth, memory, flash high- performance storage, and disk bulk storage. I can even add more configured storage if a database grows, instead of being forced to refactor a database into shards across multiple standard configurations.
Pooling resources = simplified operations
So the desire to pool resources is really as much about simplified operations as anything else. I can have standardized modules that are all ‚Äúthe same‚ÄĚ to manage, but can be resource configured into a well-tailored platform that can even change over time.
But pooling is also about accommodating how the application architectures have changed, and how much more important dataflow is than compute for so much of the datacenter. As a result there is a lot of uncertainty about how parts of these rack scale architectures and interconnect will evolve, even as there is a lot of certainty that they will evolve, and they will include pooled resource ‚Äúmodules.‚ÄĚ Whatever the overall case, we‚Äôre pretty sure we understand how the storage will evolve. And at a high level, that‚Äôs what I presented in my keynote. (Hey ‚Äď I‚Äôm not going to publicly share all our magic!)
One storage architecture of pooled resources at the rack scale level. One storage architecture that combines boot management, flash storage for performance, and disk storage for efficient bandwidth and capacity. And those resources can be allocated however and whenever the datacenter manager needs them. And the existing software model doesn‚Äôt need to change. Existing apps, OS‚Äôs, file systems, and drivers are all supported, meaning a change to pooled resource rack scale deployments is de-risked dramatically. Overall, this one architecture simplifies the number of platforms, simplifies the management of platforms, utilizes the resources very efficiently, and simplifies image and boot management.¬† I‚Äôm pretty sure it even reduces datacenter-level CapEx. I know it dramatically reduces OpEx.
Yea ‚Äď I know what you‚Äôre thinking ‚Äď it‚Äôs awesome ! (That‚Äôs what you thought ‚Äď right?)
Oh – what about those CIO meetings? Well, there is tremendous pressure to not buy American IT equipment in China because of all the news from the Snowden NSA leaks. As most of the CIO‚Äôs pointed out, though, in today‚Äôs global sourcing market, it‚Äôs pretty hard to not buy US IT equipment. So they‚Äôre feeling a bit trapped. In a no-risk profession, I suspect that means they just won‚Äôt buy anything for a year or so and hope it blows over.
But in general, yep, I think this trip was centered on resource pooling in the datacenter. Sure, you might hear about disaggregation, but there‚Äôs a lot of agreement that‚Äôs the wrong name. It‚Äôs much more about resource pooling for flexible infrastructure, simplified platforms, better lifecycle management, and higher efficiency. And we aim to be right in the middle. Literally.
Have you ever seen the old BBC TV show ‚ÄúConnections‚ÄĚ? It‚Äôs a little old now, but I loved how it followed threads through time, and I marveled at the surprising historical depth of important ‚Äúinventions.‚ÄĚ I think we need to remember that as engineers and technologists. We get caught up in the short-term tactical delivery of technology. We don‚Äôt see the sometimes immense ripples in society from our work ‚Äď even years later.
I got a flurry of emails yesterday, arranging an anniversary get-together in August at the Apple campus. Why? It‚Äôs the 20th anniversary of the Newton. Ok ‚Äď so this has nothing to do with LSI really, but it does have a lot to do with our everyday lives. More than you think.
So you either know the Newton and think it was a failure (think Trudeau‚Äôs famous handwriting cartoon), or you don‚Äôt and you‚Äôre wondering what the *bleep* I‚Äôm talking about. Sometimes things that don‚Äôt seem very significant early on end up having profound consequences.¬† And I admit, the Newton was a failure, too expensive and not quite good enough, and the world couldn‚Äôt even get the concept of a general-purpose computer in your hand.
But oh ‚Äď you could smell the future and get a tantalizing hint of what it would be. Remember ‚Äď we‚Äôre talking 1993 here.
First ‚Äď why does Rob Ober care? It‚Äôs personal. While I didn‚Äôt remotely help create the Newton, I did help bring it to market, mature the technology, and set the stage for the future (well ‚Äď it‚Äôs not the future any more ‚Äď it‚Äôs now). I was at Apple wrapping up the creation of the PowerPC processor and architecture, and the first Power Macs. I have a great memory around that time of getting the first Power Mac booted. Someone had the great idea of running the beta 68K emulator (to run standard Mac stuff). That was great, it worked, and then someone else said ‚Äď wait ‚Äď I have an Apple II emulator for the 68K Mac. So we had the very first PowerPC Mac running 68K code as a Mac to emulate a 6502 as an Apple II ‚Ä¶ and we played for hours. I also have a very clear memory of that PowerPC Mac standing shoulder-to-shoulder with the Robotron game in the Valley Green 5 building break room. It was a state-of-the-art video game and looked like this.
Yea, that shows you it was a while ago. (But it was a good game.)
A guy named Shane Robison pulled me over (yea, the same HP CTO, now CEO of FusionIO) to come fix some things on the super-hush Newton program. In the end, I took over responsibility for the processors, custom chips, communication stacks and hardware, plastics and tooling, display, touch screen, power supply, wireless, NiMH and LiION batteries‚Ä¶ ¬†A lot. ¬†We pushed the limits of state of the art on all those fronts. It was a really important wonderful/terrible part of my career. I learned an amazing amount.
(If you‚Äôre interested in viewing a Newton from today‚Äôs perspective, there is a fascinating review here: http://techland.time.com/2012/06/01/newton-reconsidered/)
Let me start with some boring effects. We were using the ARM processor because of its low power. But. It wasn‚Äôt perfect, and ARM itself was on the edge of insolvency. We invested a sizable chunk of money, and gave it guidance on how to transition from ARM 6 to 7 to 9. ARM is alive today because of that, and the ARM 9 is still in 100‚Äôs of millions of products. And we also worked with DEC to create the StrongARM processor family, which became XScale at Intel, then went to Marvel, and also bootstrapped Atom, and, and‚Ä¶
The Newton needed non-volatile storage. Disks were immense, expensive and power-hungry. 2-1/2‚ÄĚ disk? Didn‚Äôt exist. ¬†3-1/2‚ÄĚ was small. The only remotely cost-effective technology was called NAND flash, which was fundamentally incompatible with program execution, and nightmarish for data storage/retrieval, and unbelievably expensive per bit. I think the early Newtons were 8 Mbytes? (that‚Äôs mega not giga‚Ä¶). The team figured out how to make that work. Yep ‚Äď that was the first use of Toshiba NAND for program/data. (I‚Äôve been playing with flash for storage since then.)
Then some more interesting things‚Ä¶
I wired the Apple campus with wireless LAN base stations (it would be 6 years until Wifi, and 802.11 wasn‚Äôt even dreamt up yet) and built the wireless LAN receivers into Newtons, gave them to the Apple execs and set up their mail to be forwarded. You couldn‚Äôt even do that on laptops. We could be anywhere in the campus and instantly receive and send emails. More ‚Äď we could browse the (rudimentary) web. I also worked with RIM (yea ‚Äď Research In Motion ‚Äď Blackberry) and Metricom to use their wireless wide area net technology to give Newtons access to email and the Web anywhere in the Bay Area. Quite a few times I was driving to meetings, wasn‚Äôt sure where to go, so pulled over and looked up the meeting in my Newton calendar, then checked the address on my browser with MapQuest. 1995. Sound familiar?
We also spent time with FedEx pitching it on the idea of a Newton-based tablet to manage inventory (integrated bar code scanner), accept signatures on screen with tablet/pen (even the upside down thing to hand it to the customer), show route maps, and cellularly send all that info back and forth for live tracking. FedEx was stunned by the concept. Sound familiar? I still have the proposal book with industrial designs in my garage. Yes, another Silicon Valley garage. Here‚Äôs what it rolled out 10 years later‚Ä¶ which is ultimately pretty similar to our proposal.
And don‚Äôt forget Object Programming. (You remember when OOPS was a high-tech term?) I‚Äôm not really a software guy ‚Äď just not my thing ‚Äď but I loved programming on the Newton. In 10 minutes you could actually bang out a useful, great-looking program. Personally, I think the world would have been way better off if those object libraries had been folded into the Java object library. Even so, I get a nostalgic feel when I do iOS programming.
I even built a one-off proto that had cellphone guts inside the plastic of the Newton. (OK ‚Äď it was chunky, but the smallest phones at the time were HUGE). I could make phone calls from the contacts or calendar or emails, send and receive SMS messages, and rudimentary MMS messages before there was such a thing ‚Äď used just like a very overweight iPhone (OK ‚Äď more like the big Samsung galaxy phones). I could even, in a pinch, do data over the GSM network ‚Äď email, web, etc. It was around that time Nokia came calling and asked about our UI, our OS, our ability to used data over the GSM network‚Ä¶ Those talks fell apart, but it was serious enough I made trips to Nokia‚Äôs mothership in Helsinki and Tampere a few times. (That‚Äôs north even for a Canadian boy‚Ä¶)
And then years later I got a phone call from one of the key people at Apple ‚Äď Mike Culbert (who, sadly, recently passed away) ‚Äď to ask about cellular/baseband chipsets and solutions. He knew I knew the technology. I introduced him to my friends at Infineon (now Intel Mobile) for a discussion on a mystery project‚Ä¶ Those parts ended up in the iPhone. A lot of the same people and technology, just way more advanced‚Ä¶
iPad? Sure. A lot of the same people were involved in a Newton that never saw the light of day. The BIC. Here it is with the iPad. Again ‚Äď 15 years apart.
And you remember the $100 laptop (OLPC?). As a founding board member, I brought an eMate kids Newton laptop to show the team early on. And of course the debate on disk vs. flash followed the same path as it had in Newton. ¬†Here they are together, separated by more than 10 years. And then of course, OLPC has direct genetic parentage of netbooks, which then lead to Ultrabooks‚Ä¶ (Did you know at one point Apple was considering joining OLPC and offering Darwin/OSX as the OS? Didn‚Äôt last long.)
And then there are the people. Off the top of my head there were founders or key movers of Palm, Xbox, Kindle, Hotmail, Yahoo,¬† Netscape, Android, WebTV (think most set-top boxes), Danger phone (you remember the sidekick?), Evernote, Mercedes research and a bunch of others. And some friends who became well-known VCs.¬† And I still have a lot of super-talented friends from that time, many of whom are still at Apple.
Sometimes things that don‚Äôt seem very significant have profound follow-on consequences. I think we need to remember that as engineers and technologists. We don‚Äôt see the sometimes immense ripples in society from our work ‚Äď even years later. Today we‚Äôre planting the seeds for all those great things in the future. I admit, the Newton was a failure, but oh ‚Äď you could smell the future and get a tantalizing hint of what it would be. Remember ‚Äď we‚Äôre talking 1993 here.
Tags: 802.11, Android, Apple, Apple II, ARM, BIC, Blackberry, Darwin, DEC, eMate, Evernote, FedEx, FusionIO, Hotmail, HP, Intel, iPad, iPhone, Kindle, Marvel, Mercedes, Metricom, Mike Culbert, MMS, Netscape, Newton, Nokia, object programming, OLPC, Palm, Power Mac, PowerPC, Research in Motion, Robotron, Shane Robison, SMS, StrongARM, Toshiba, Ultrabook, Web TV, Wifi, Xbox, XScale, Yahoo
I’ve just been to China. Again. ¬†It‚Äôs only been a few months since I was last there.
I was lucky enough to attend the 5th China Cloud Computing Conference at the China National Convention Center in Beijing. You probably have not heard of it, but it‚Äôs an impressive conference. It‚Äôs ‚Äúthe one‚ÄĚ for the cloud computing industry. It was a unique view for me ‚Äď more of an inside-out view of the industry. Everyone who‚Äôs anyone in China‚Äôs cloud industry was there. Our CEO, Abhi Talwalkar, had been invited to keynote the conference, so I tagged along.
First, the air was really hazy, but I don‚Äôt think the locals considered it that bad. The US consulate iPhone app said the particulates were in the very unhealthy range. Imagine looking across the street. Sure, you can see the building there, but the next one? Not so much. Look up. Can you see past the 10th floor? No, not really. The building disappears into the smog. That‚Äôs what it was like at the China National Convention Center, which is part of the same Olympics complex as the famous Birdcage stadium: http://www.cnccchina.com/en/Venues/Traffic.aspx
I had a fantastic chance to catch up with a university friend, who has been living in Beijing since the 90‚Äôs, and is now a venture capitalist. It‚Äôs amazing how almost 30 years can disappear and you pick up where you left off. He sure knows how to live. I was picked up in his private limo, whisked off to a very well-known restaurant across the city, where we had a private room and private waitress. We even had some exotic, special dishes that needed to be ordered at least a day in advance. Wow.¬† But we broke Chinese tradition and had imported beer in honor of our Canadian education.
Sizing up China’s cloud infrastructure
The most unusual meeting I attended was an invitation-only session ‚Äď the Sino-American roundtable on cloud computing. There were just about 40 people in a room ‚Äď half from the US, half from China. Mostly what I learned is that the cloud infrastructure in China is fragmented, and probably sub-scale. And it‚Äôs like that for a reason. It was difficult to understand at first, but I think I‚Äôve made sense of it.
I started asking why to friends and consultants and got some interesting answers. Essentially different regional governments are trying to capture the cloud ‚Äúindustry‚ÄĚ in their locality, so they promote activity, and they promote creation of new tools and infrastructure for that. Why reuse something that‚Äôs open source and works if you don‚Äôt have to and you can create high-tech jobs? (That‚Äôs sarcasm, by the way.) Many technologists I spoke with felt this will hold them back, and that they are probably 3-5 years behind the US. As well, each government-run industry specifies the datacenter and infrastructure needed to be a supplier or ecosystem partner with them, and each is different. The national train system has a different cloud infrastructure from the agriculture department, and from the shipping authority, etc‚Ä¶ and if you do business with them ‚Äď that is you are part of their ecosystem of vendors, then you use their infrastructure. It all spells fragmentation and sub-scale. In contrast, the Web 2.0 / social media companies seem to be doing just fine.
Baidu was also showing off its open rack. It‚Äôs an embodiment of the Scorpio V1 standard, which was jointly developed with Tencent, Alibaba and China Telecom. It views this as a first experiment, and is looking forward to V2, which will be a much more mature system.
I was also lucky to have personal meetings with general managers,chief architects and effective CTOs of the biggest cloud companies in China. What did I learn? They are all at an inflexion point. Many of the key technologists have experience at American Web 2.0 companies, so they‚Äôre able to evolve¬† quickly, leveraging their industry knowledge. They‚Äôre all working to build or grow their own datacenters, their own infrastructure. And they‚Äôre aggressively expanding products, not just users, so they‚Äôre getting a compound growth rate.
Here‚Äôs a little of what I learned. In general, there is a trend to try and simplify infrastructure, harmonize divergent platforms, and deploy more infrastructure by spending less on each unit. (In general, they don‚Äôt make as much per user as American companies, but they have more users). As a result they are more cost-focused than US companies. And they are starting to put more emphasis on operational simplicity in general. As one GM described it to me ‚Äď ‚ÄúYes, techs are inexpensive in China for maintainence, but more often than not they make mistakes that impact operations.‚ÄĚ So we (LSI) will be focussing more on simplifying management and maintainence for them.
Baidu‚Äôs biggest Hadoop cluster is 20k nodes. I believe that‚Äôs as big as Yahoo‚Äôs ‚Äď and it is the originator of Hadoop. Baidu has a unique use profile for flash ‚Äď it‚Äôs not like the¬†hyperscale datacenters in the US. But Baidu is starting to consume a lot. Like most other hyperscale datacenters, it is working on storage erasure coding across servers, racks and datacenters, and¬† it is trying to make a unified namespace across everything. One of its main interests is architecture at datacenter level, harmonizing the various platforms and looking for the optimum at the datacenter level. In general, Baidu is very proud of the advances it has made, and it has real confidence in its vision and route forward, and from what I heard, its architectural ambitions are big.
JD.com (which used to be 360buy.com) is the largest direct ecommerce company in China and (only) had about $10 billion (US) in revenue last year, with 100% CAGR growth. As the GM there said, its growth has to slow sometime, or in 5 years it‚Äôll be the biggest company in the world. I think it is¬† the closest equivalent to Amazon there is out there, and they have similar ambitions. They are in the process of transforming to a self-built, self-managed datacenter infrastructure. It is a company I am going to keep my eyes on.
Tencent is expanding into some interesting new businesses. Sure, people know about the Tencent cloud services that the Chinese government will be using, but Tencent also has some interesting and unique cloud services coming. Let‚Äôs just say even I am interested in using them. And of course, while Tencent is already the largest Web 2.0 company in China, its new services promise to push it to new scale and new markets.
Extra! Extra! Read all about it …
And then there was press. I had a very enjoyable conversation with Yuan Shaolong, editor at WatchStor, that I think ran way over. Amazingly ‚Äď we discovered we have the same favorite band, even half a world away from each other. The results are here, though I‚Äôm not sure if Google translate messed a few things up, or if there was some miscommunication, but in general, I think most of the basics are right: http://translate.google.com/translate?hl=en&sl=zh-CN&u=http://tech.watchstor.com/storage-module-144394.htm&prev=/search%3Fq%3Drobert%2Bober%2BLSI%26client%3Dfirefox-a%26rls%3Dorg.mozilla:en-US:official%26biw%3D1346%26bih%3D619
I just keep learning new things every time I go to China. I suspect it has as much to do with how quickly things are changing as new stuff to learn. So I expect it won‚Äôt be too long until I go to China, again‚Ä¶
Tags: Abhi Talwalkar, Alibaba, Amazon, Baidu, China, China Cloud Computing Conference, China National Convention Center, China Telecom, datacenter, Hadoop, hyperscale, JD.com, WatchStor, web 2.0, Yahoo
I was lucky enough to get together for dinner and beer with old friends a few weeks ago. Between the 4 of us, we‚Äôve been involved in or responsible for a lot of stuff you use every day, or at least know about.
Supercomputers, minicomputers, PCs, Macs, Newton, smart phones, game consoles, automotive engine controllers and safety systems, secure passport chips, DRAM interfaces, netbooks, and a bunch of processor architectures: Alpha, PowerPC, Sparc, MIPS, StrongARM/XScale, x86 64-bit, and a bunch of other ones you haven’t heard of (um – most of those are mine, like TriCore). Basically if you drive a European car, travel internationally, use the Internet , if you play video games, or use a smart phone, well‚Ä¶¬† you‚Äôre welcome.
Why do I tell you this? Well – first I’m name dropping – I’m always stunned I can call these guys friends and be their peers. But more importantly, we’ve all been in this industry as architects for about 30 years. Of course our talk went to what‚Äôs going on today. And we all agree that we’ve never seen more changes – inflexions – than the raft unfolding right now. Maybe its pressure from the recession, or maybe un-naturally pent up need for change in the ecosystem, but change there is.
Changes in who drives innovation, what‚Äôs needed, the companies on top and on bottom at every point in the food chain, who competes with whom, how workloads have changed from compute to dataflow, software has moved to opensource, how abstracted code is now from processor architecture, how individual and enterprise customers have been revolting against the “old” ways, old vendors, old business models, and what the architectures look like, how processors communicate, and how systems are purchased, and what fundamental system architectures look like. But not much besides that…
Ok – so if you’re an architect, that‚Äôs as exciting as it gets (you hear it in my voice ‚Äď right ?), and it makes for a lot of opportunities to innovate and create new or changed businesses. Because innovation is so often at the intersection of changing ways of doing things. We’re at a point where the changes are definitely not done yet. We’re just at the start. (OK ‚Äď now try to imagine a really animated 4-way conversation over beers at the Britannia Arms in Cupertino‚Ä¶ Yea ‚Äď exciting.)
I‚Äôm going to focus on just one sliver of the market ‚Äď but it‚Äôs important to me ‚Äď and that‚Äôs enterprise IT. ¬†I think the changes are as much about business models as technology.
I‚Äôll start in a strange place.¬†Hyperscale datacenters (think social media, search, etc.) and the scale of deployment changes the optimization point. Most of us starting to get comfortable with rack as the new purchase quantum. And some of us are comfortable with the pod or container as the new purchase quantum. But the¬†hyperscale dataenters work more at the datacenter as the quantum. By looking at it that way, they can trade off the cost of power, real estate, bent sheet metal, network bandwidth, disk drives, flash, processor type and quantity, memory amount, where work gets done, and what applications are optimized for. In other words, we shifted from looking at local optima to looking for global optima. I don‚Äôt know about you, but when I took operations research in university, I learned there was an unbelievable difference between the two ‚Äď and global optima was the one you wanted‚Ä¶
Hyperscale datacenters buy enough (top 6 are probably more than 10% of the market today) that 1) they need to determine what they deploy very carefully on their own, and 2) vendors work hard to give them what they need.
That means innovation used to be driven by OEMs, but now it‚Äôs driven by hyperscale datacenters and¬†it‚Äôs driven hard. That global optimum? It‚Äôs work/$ spent. That‚Äôs global work, and global spend. It‚Äôs OK to spend more, even way more on one thing if over-all you get more done for the $‚Äôs you spend.
That‚Äôs why the 3 biggest consumers of flash in servers are Facebook, Google, and Apple, with some of the others not far behind. You want stuff, they want to provide it, and flash makes it happen efficiently. So efficiently they can often give that service away for free.
Hyperscale datacenters have started to publish their cost metrics, and open up their architectures (like OpenCompute), and open up their software (like Hadoop and derivatives). More to the point, services like Amazon have put a very clear $ value on services. And it‚Äôs shockingly low.
Enterprises have looked at those numbers. Hard. That‚Äôs catalyzed a customer revolt against the old way of doing things ‚Äď the old way of buy and billing. OEMs and ISVs are creating lots of value for enterprise, but not that much. They’ve been innovating around ‚Äústickiness‚ÄĚ and ‚Äúlock-in‚ÄĚ (yea ‚Äď those really are industry terms) for too long, while hyperscale datacenters have been focused on getting stuff done efficiently. The money they save per unit just means they can deploy more units and provide better services.
That revolt is manifesting itself in 2 ways. The first is seen in the quarterly reports of OEMs and ISVs. Rumors of IBM selling its X-series to Lenovo, Dell going private, Oracle trying to shift business, HP talking of the ‚Äúnew style of IT‚ÄĚ‚Ä¶ The second is enterprises are looking to emulate hyperscale datacenters as much as possible, and deploy private cloud infrastructure. And often as not, those will be running some of the same open source applications and file systems as the big hyperscale datacenters use.
Where are the hyperscale datacenters leading them? It‚Äôs a big list of changes, and they‚Äôre all over the place.
But they‚Äôre also looking at a few different things. For example, global name space NAS file systems. Personally? I think this one‚Äôs a mistake. I like the idea of file systems/object stores, but the network interconnect seems like a bottleneck. Storage traffic is shared with network traffic, creates some network spine bottlenecks, creates consistency performance bottlenecks between the NAS heads, and ‚Äď let‚Äôs face it ‚Äď people usually skimp on the number of 10GE ports on the server and in the top of rack switch. A typical SAS storage card now has 8 x 12G ports ‚Äď that‚Äôs 96G of bandwidth. Will servers have 10 x 10G ports? Yea. I didn‚Äôt think so either.
Anyway ‚Äď all this is not academic. One Wall Street bank shared with me that ‚Äď hold your breath ‚Äď it could save 70% of its spend going this route. It was shocked. I wasn‚Äôt shocked, because at first blush this seems absurd ‚Äď not possible. That‚Äôs how I reacted. I laughed. But‚Ä¶ The systems are simpler and less costly to make. There is simply less there to make or ship than OEMs force into the machines for uniqueness and ‚Äúvalue.‚ÄĚ They are purchased from much lower margin manufacturers. They have massively reduced maintenance costs (there‚Äôs less to service, and, well, no OEM service contracts). And also important ‚Äď some of the incredibly expensive software licenses are flipped to open source equivalents. Net savings of 70%. Easy. Stop laughing.
Disaggregation: Or in other words, Pooled Resources
But probably the most important trend from all of this is what server manufacturers are calling ‚Äúdisaggregation‚ÄĚ (hey ‚Äď you‚Äôre ripping apart my server!) but architects are more descriptively calling pooled resources.
First ‚Äď the intent of disaggregation is not to rip the parts of a server to pieces to get lowest pricing on the components. No. If you‚Äôre buying by the rack anyway ‚Äď why not package so you can put like with like. Each part has its own life cycle after all. CPUs are 18 months. DRAM is several years. Flash might be 3 years. Disks can be 5 to 7 years. Networks are 5 to 10 years. Power supplies are‚Ä¶ forever? Why not replace each on its own natural failure/upgrade cycle? Why not make enclosures appropriate to the technology they hold? Disk drives need solid vibration-free mechanical enclosures of heavy metal. Processors need strong cooling. Flash wants to run hot. DRAM cool.
Second ‚Äď pooling allows really efficient use of resources. Systems need slush resources. What happens to a systems that uses 100% of physical memory? It slows down a lot. If a database runs out of storage? It blue screens. If you don‚Äôt have enough network bandwidth? The result is, every server is over provisioned for its task. Extra DRAM, extra network bandwidth, extra flash, extra disk drive spindles.. If you have 1,000 nodes you can easily strand TBytes of DRAM, TBytes of flash, a TByte/s of network bandwidth of wasted capacity, and all that always burning power. Worse, if you plan wrong and deploy servers with too little disk or flash or DRAM, there‚Äôs not much you can do about it. Now think 10,000 or 100,000 nodes‚Ä¶ Ouch.
If you pool those things across 30 to 100 servers, you can allocate as needed to individual servers. Just as importantly, you can configure systems logically, not physically. That means you don‚Äôt have to be perfect in planning ahead what configurations and how many of each you‚Äôll need. You have sub-assemblies you slap into a rack, and hook up by configuration scripts, and get efficient resource allocation that can change over time. You need a lot of storage? A little? Higher performance flash? Extra network bandwidth? Just configure them.
That‚Äôs a big deal.
And of course, this sets the stage for immense pooled main memory ‚Äď once the next generation non-volatile memories are ready ‚Äď probably starting around 2015.
You can‚Äôt underestimate the operational problems associated with different platforms at scale. Many hyperscale datacenters today have around 6 platforms. If you think they are rolling out new versions of those before old ones are retired they often have 3 generations of each. That‚Äôs 18 distinct platforms, with multiple software revisions of each. That starts to get crazy when you may have 200,000 to 400,000 servers to manage and maintain in a lights out environment. Pooling resources and allocating them in the field goes a huge way to simplifying operations.
Alternate Processor Architecture
It didn‚Äôt always used to be Intel x86. There was a time when Intel was an upstart in the server business. It was Power, MIPs, Alpha, SPARC‚Ä¶ (and before that IBM mainframes and minis, etc). Each of the changes was brought on by changing the cost structure. Mainframes got displaced by multi-processor RISC, which gave way to x86.
Today, we have Oracle saying they‚Äôre getting out of x86 commodity servers and doubling down on SPARC. IBM is selling off its x86 business and doubling down on Power (hey ‚Äď don‚Äôt confuse that with PowerPC ‚Äď which started as an architectural cut-down of Power ‚Äď I was there‚Ä¶). And of course there is a rash of 64-bit ARM server SOCs coming ‚Äď with HP and Dell already dabbling in it. What‚Äôs important to realize is that all of these offerings are focusing on the platform architecture, and how applications really perform in total, not just the processor.
Let me warp up with an email thread cut/paste from a smart friend ‚Äď Wayne Nation. I think he summed up some of what‚Äôs going on well, in a sobering way most people don‚Äôt even consider.
‚ÄúDoes this remind you of a time, long ago, when the market was exploding with companies that started to make servers out of those cheap little desktop x86 CPUs? What is different this time? Cost reduction and disaggregation? No, cost and disagg are important still, but not new.
A new CPU architecture? No, x86 was “new” before. ARM promises to reduce cost, as did Intel.
Disaggregation enables hyperscale datacenters to leverage vanity-free, but consistent delivery will determine the winning supplier. There is the potential for another Intel to rise from these other companies. ‚Äú
I often think about green, environmental impact, and what we‚Äôre doing to the environment. One major reason I became an engineer was to leave the world a little better than when I arrived. I‚Äôve gotten sidetracked a few times, but I‚Äôve tried to help, even if just a little.
The good people in LSI‚Äôs EHS (Environment, Health & Safety) asked me a question the other day about carbon footprint, energy impact, and materials use. Which got me thinking ‚Ä¶ OK ‚Äď I know most people in LSI don‚Äôt really think of ourselves as a ‚Äúgreen tech‚ÄĚ company. But we are ‚Äď really. No foolin‚Äô. We are having a big impact on the global power consumption and material consumption of the IT industry. And I mean that in a good way.
There are many ways to look at this, both from what we enable datacenters to do, to what we enable integrators to do, all the way to hard core technology improvements and massive changes in what it‚Äôs possible to do.
Back in 2008 I got to speak at the AlwaysOn GoingGreen conference. (I was lucky enough to be just after Elon Musk‚Äď he‚Äôs a lot more famous now with Tesla doing so well.
http://www.smartplanet.com/video/making-the-case-for-green-it/305467¬† (at 2:09 in video)
The massive consumption of IT equipment, all the ancillary metal, plastic wiring, etc. that goes with them, consumes energy as its being shipped and moved halfway around the world, and, more importantly, then gets scrapped out quickly. This has been a concern for me for quite a while. I mean ‚Äď think about that. As an industry we are generating about 9 million servers a year, about 3 million go into¬†hyperscale datacenters (or hyperscale if you prefer). Many of those are scrapped on a 2, 3 or 4 year cycle ‚Äď so in steady state, maybe 1 million to 2 million a year are scrapped. Worse ‚Äď there is amazing use of energy by that many servers (even as they have advanced the state of the art unbelievably since 2008). And frankly, you and I are responsible for using all that power. Did you know thousands of servers are activated every time you make a Google¬ģ query from your phone?
I want to take a look at basic silicon improvements we make, the impact of disk architecture improvement, SSDs, system and improvements, efficiency improvements, and also where we‚Äôre going in the near future with eliminating scrap in hard drives and batteries. In reality, it‚Äôs the massive pressure on work/$ that has made us optimize everything ‚Äď being able to do much more work at a lower cost, when a lot of cost is the energy and material that goes into the products that forces our hand. But the result is a real, profound impact on our carbon footprint that we should be proud of.
Sure we have a general silicon roadmap where each node enables reduced power, even as some standards and improvements actually increase individual device power. For example, our transition from 28nm semi process to 14 FinFET can literally cut the power consumption of a chip in half. But that‚Äôs small potatoes.
How about Ethernet? It‚Äôs everywhere ‚Äď right? Did you know servers often have 4 ethernet ports, and that there are a matching 4 ports on a network switch? LSI pioneered something called Energy Efficient Ethernet (EEE). We‚Äôre also one of the biggest manufacturers of Ethernet PHYs ‚Äď the part that drives the cable ‚Äď and we come standard in everything from personal computers to servers to enterprise switches. The savings are hard to estimate, because they depend very much on how much traffic there is, but you can realistically save Watts per interface link, and there are often 256 links in a rack. ¬†500 Watts per rack is no joke, and in some datacenters it adds up to 1 or 2 MegaWatts.
How about something a little bigger and more specific? Hard disk drives. Did you know a typical¬†hyperscale datacenter has between 1 million and 1.5 million disk drives? Each one of those consumes about ¬†9 Watts, and most have 2 TBytes of capacity. So for easy math, 1 million drives is about 9 MegaWatts (!?) and about 2 Exabytes of capacity (remember ‚Äď data is often replicated 3 or more times). Data capacities in these facilities are needed to grow about 50% per year. So if we did nothing, we would need to go from 1 million drives to 1.5 million drives: 9 MegaWatts goes to 13.5 MegaWatts. Wow! Instead ‚Äď our high linearity, low noise PA and read channel designs are allowing drives to go to 4 TBytes per drives. (Sure the chip itself may use slightly more power, but that‚Äôs not the point, what it enables is a profound difference.) So to get that 50% increase in capacity we could actually reduce the number of drives deployed, with a net savings of 6.75 MegaWatts. Consider an average US home, with air conditioning, uses 1 kiloWatt. That‚Äôs almost 7,000 homes. In reality ‚Äď they won‚Äôt get deployed that way ‚Äď but it will still be a huge savings. Instead of buying another 0.5 million drives they would buy 0.25 million drives with a net savings of 2.2 MegaWatts. That‚Äôs still HUGE! (way to go, guys!) How many datacenters are doing that? Dozens. So that‚Äôs easily 20 or 30 MegaWatts globally. Did I say we saved them money too? A lot of money.
SSDs don‚Äôt always get the credit they deserve. Yes, they really are fast, and they are awesome in your laptop, but they also end up being much lower power than hard drives. Our controllers were in about half the flash solutions shipped last year. Think tens of millions. If you just assume they were all laptop SSDs (at least half were not) then that‚Äôs another 20 MegaWatts in savings.
Did you know that in a traditional datacenter, about 30% of the power going into the building is used for air conditioning? It doesn‚Äôt actually get used on the IT equipment at all, but is used to remove the heat that the IT equipment generates. We design our solutions so they can accommodate 40C ambient inlet air (that‚Äôs a little over 100F‚Ä¶ hot). What that means is that the 30% of power used for the air conditioners disappears. Gone. That‚Äôs not theoretical either. Most of the large social media, search engine, web shopping, and web portal companies are using our solutions this way. That‚Äôs a 30% reduction in the power of storage solutions globally. Again, its MegaWatts in savings. And mega money savings too.
But let‚Äôs really get to the big hitters: improved work per server. Yep ‚Äď we do that. In fact adding a Nytro‚ĄĘ MegaRAID¬ģ solution will almost always give you 4x the work out of a server. It‚Äôs a slam dunk if you‚Äôre running a database. You heard me ‚Äď 1 server doing the work that it previously took 4 servers to do. Not only is that a huge savings in dollars (especially if you pay for software licenses!) but it‚Äôs a massive savings in power. You can replace 4 servers with 1, saving at least 900 Watts, and that lone server that‚Äôs left is actually dissipating less power too, because it‚Äôs actively using fewer HDDs, and using flash for most traffic instead. If you go a step further and use Nytro WarpDrive Flash cards in the servers, you can get much more ‚Äď 6 to 8 times the work. (Yes, sometimes up to 10x, but let‚Äôs not get too excited). If you think that‚Äôs just theoretical again, check your Facebook¬ģ account, or download something from iTunes¬ģ. Those two services are the biggest users of PCIe¬ģ flash in the world. Why? It works cost effectively. And in case you haven‚Äôt noticed those two companies like to make money, not spend it. So again, we‚Äôre talking about MegaWatts of savings. Arguably on the order of 150 MegaWatts. Yea ‚Äď that‚Äôs pretty theoretical, because they couldn‚Äôt really do the same work otherwise, but still, if you had to do the work in a traditional way, it would be around that.
It‚Äôs hard to be more precise than giving round numbers at these massive scales, but the numbers are definitely in the right zone. I can say with a straight face we save the world 10‚Äôs, and maybe even 100‚Äôs of MegaWatts per year. But no one sees that, and not many people even think about it. Still ‚Äď I‚Äôd say LSI is a green hero.
Hey ‚Äď we‚Äôre not done by a long shot. Let‚Äôs just look at scrap. If you read my earlier post on false disk failure, you‚Äôll see some scary numbers. (http://blog.lsi.com/what-is-false-disk-failure-and-why-is-it-a-problem/ ) A normal¬†hyperscale datacenter can expect 40-60 disks per day to be mistakenly scrapped out. That‚Äôs around 20,000 disk drives a year that should not have been scrapped, from just one web company. Think of the material waste, shipping waste, manufacturing waste, and eWaste issues. Wow ‚Äď all for nothing. We‚Äôre working on solutions to that. And batteries.¬† Ugly, eWaste, recycle only, heavy metal batteries. They are necessary for RAID protected storage systems. And much of the world‚Äôs data is protected that way ‚Äď the battery is needed to save meta-data and transient writes in the event of a power failure, or server failure. We ship millions a year. (Sorry, mother earth). But we‚Äôre working diligently to make that a thing of the past. And that will also result in big savings for datacenters in both materials and recycling costs.
Can we do more? Sure. I know I am trying to get us the core technologies that will help reduce power consumption, raise capability and performance, and reduce waste. But we‚Äôll never be done with that march of technology. (Which is a good thing if engineering is your career‚Ä¶)
I still often think about green, environmental impact, and what we‚Äôre doing to the environment. And I guess in my own small way, I am leaving the world a little better than when I arrived. And I think we at LSI should at least take a moment and pat ourselves on the back for that. You have to celebrate the small victories, you know? Even as the fight goes on.
I want to warn you, there is some thick background information here first. But don‚Äôt worry. I‚Äôll get to the meat of the topic and that‚Äôs this: Ultimately, I think that¬†PCIe¬ģ cards will evolve to more external, rack-level, pooled flash solutions, without sacrificing all their great attributes today. This is just my opinion, but other leaders in flash are going down this path too‚Ä¶
I‚Äôve been working on enterprise flash storage since 2007 ‚Äď mulling over how to make it work. Endurance, capacity, cost, performance have all been concerns that have been grappled with. Of course the flash is changing too as the nodes change: 60nm, 50nm, 35nm, 24nm, 20nm‚Ä¶ and single level cell (SLC) to multi level cell (MLC) to triple level cell (TLC) and all the variants of these ‚Äútrimmed‚ÄĚ for specific use cases. The spec ‚Äúendurance‚ÄĚ has gone from 1 million program/erase cycles (PE) to 3,000, and in some cases 500.
It‚Äôs worth pointing out that almost all the ‚Äúmagic‚ÄĚ that has been developed around flash was already scoped out in 2007. It just takes a while for a whole new industry to mature. Individual die capacity increased, meaning fewer die are needed for a solution ‚Äď and that means less parallel bandwidth for data transfer‚Ä¶ And the ‚Äúrequirement‚ÄĚ for state-of-the-art single operation write latency has fallen well below the write latency of the flash itself. (What the ?? Yea ‚Äď talk about that later in some other blog. But flash is ~1500uS write latency, where state of the art flash cards are ~50uS.) When I describe the state of technology it sounds pretty pessimistic. ¬†I‚Äôm not. We‚Äôve overcome a lot.
We built our first PCIe card solution at LSI in 2009. It wasn‚Äôt perfect, but it was better than anything else out there in many ways. We‚Äôve learned a lot in the years since ‚Äď both from making them, and from dealing with customer and users ‚Äď both of our own solutions and our competitors.¬† We‚Äôre lucky to be an important player in storage, so in general the big OEMs, large enterprises and the¬†hyperscale datacenters all want to talk with us ‚Äď not just about what we have or can sell, but what we could have and what we could do. They‚Äôre generous enough to share what works and what doesn‚Äôt. What the values of solutions are and what the pitfalls are too. Honestly? It‚Äôs the¬†hyperscale datacenters in the lead both practically and in vision.
If you haven‚Äôt¬† nodded off to sleep yet, that‚Äôs a long-winded way of saying ‚Äď things have changed fast, and, boy, we‚Äôve learned a lot in just a few years.
Most important thing we‚Äôve learned‚Ä¶
Most importantly, we‚Äôve learned it‚Äôs latency that matters. No one is pushing the IOPs limits of flash, and no one is pushing the bandwidth limits of flash. But they sure are pushing the latency limits.
PCIe cards are great, but‚Ä¶
We‚Äôve gotten lots of feedback, and one of the biggest things we‚Äôve learned is ‚Äď PCIe flash cards are awesome. They radically change performance profiles of most applications, especially databases allowing servers to run efficiently and actual work done by that server to multiply 4x to 10x (and in a few extreme cases 100x). So the feedback we get from large users is ‚ÄúPCIe cards are fantastic. We‚Äôre so thankful they came along. But‚Ä¶‚ÄĚ There‚Äôs always a ‚Äúbut,‚ÄĚ right??
It tends to be a pretty long list of frustrations, and they differ depending on the type of datacenter using them. We‚Äôre not the only ones hearing it. To be clear, none of these are stopping people from deploying PCIe flash‚Ä¶ the attraction is just too compelling. But the problems are real, and they have real implications, and the market is asking for real solutions.
Of course, everyone wants these fixed without affecting single operation latency, or increasing cost, etc. That‚Äôs what we‚Äôre here for though ‚Äď right? Solve the impossible?
A quick summary is in order. It‚Äôs not looking good. For a given solution, flash is getting less reliable, there is less bandwidth available at capacity because there are fewer die, we‚Äôre driving latency way below the actual write latency of flash, and we‚Äôre not satisfied with the best solutions we have for all the reasons above.
If you think these through enough, you start to consider one basic path. It also turns out we‚Äôre not the only ones realizing this. Where will PCIe flash solutions evolve over the next 2, 3, 4 years? The basic goals are:
One easy answer would be ‚Äď that‚Äôs a flash SAN or NAS. But that‚Äôs not the answer. Not many customers want a flash SAN or NAS ‚Äď not for their new infrastructure, but more importantly, all the data is at the wrong end of the straw. The poor server is left sucking hard. Remember ‚Äď this is flash, and people use flash for latency. Today these SAN type of flash devices have 4x-10x worse latency than PCIe cards. Ouch. You have to suck the data through a relatively low bandwidth interconnect, after passing through both the storage and network stacks. And there is interaction between the I/O threads of various servers and applications ‚Äď you have to wait in line for that resource. It‚Äôs true there is a lot of startup energy in this space. ¬†It seems to make sense if you‚Äôre a startup, because SAN/NAS is what people use today, and there‚Äôs lots of money spent in that market today. However, it‚Äôs not what the market is asking for.
Another easy answer is NVMe SSDs. Right? Everyone wants them ‚Äď right? Well, OEMs at least. Front bay PCIe SSDs (HDD form factor or NVMe ‚Äď lots of names) that crowd out your disk drive bays. But they don‚Äôt fix the problems. The extra mechanicals and form factor are more expensive, and just make replacing the cards every 5 years a few minutes faster. Wow. With NVME SSDs, you can fit fewer HDDs ‚Äď not good. They also provide uniformly bad cooling, and hard limit power to 9W or 25W per device. But to protect the storage in these devices, you need to have enough of them that you can RAID or otherwise protect. Once you have enough of those for protection, they give you awesome capacity, IOPs and bandwidth, too much in fact, but that‚Äôs not what applications need ‚Äď they need low latency for the working set of data.
What do I think the PCIe replacement solutions in the near future will look like? You need to pool the flash across servers (to optimize bandwidth and resource usage, and allocate appropriate capacity). You need to protect against failures/errors and limit the span of failure,¬† commit writes at very low latency (lower than native flash) and maintain low latency, bottleneck-free physical links to each server‚Ä¶ To me that implies:
That means the performance looks exactly as if each server had multiple PCIe cards. But the capacity and bandwidth resources are shared, and systems can remain resilient. So ultimately, I think that PCIe cards will evolve to more external, rack level, pooled flash solutions, without sacrificing all their great attributes today. This is just my opinion, but as I say ‚Äď other leaders in flash are going down this path too‚Ä¶
What‚Äôs your opinion?
Tags: DAS, datacenter, direct attached storage, enterprise IT, flash, hard disk drive, HDD, hyperscale, latency, NAS, network attached storage, NVMe, PCIe, SAN, solid state drive, SSD, storage area network
I‚Äôve been travelling to China quite a bit over the last year or so. I‚Äôm sitting in Shenzhen right now (If you know Chinese internet companies, you‚Äôll know who I‚Äôm visiting). The growth is staggering. I‚Äôve had a bit of a trains, planes, automobiles experience this trip, and that‚Äôs exposed me to parts of China I never would have seen otherwise. Just to accommodate sheer population growth and the modest increase in wealth, there is construction everywhere ‚Äď a press of people and energy, constant traffic jams, unending urban centers, and most everything is new. Very new. It must be exciting to be part of that explosive growth. What a market. ¬†I mean ‚Äď come on ‚Äď there are 1.3 billion potential users in China.
The amazing thing for me is the rapid growth of¬†hyperscale datacenters in China, which is truly exponential. Their infrastructure growth has been 200%-300% CAGR for the past few years. It‚Äôs also fantastic walking into a building in China, say Baidu, and feeling very much at home ‚Äď just like you walked into Facebook or Google. It‚Äôs the same young vibe, energy, and ambition to change how the world does things. And it‚Äôs also the same pleasure ‚Äď talking to architects who are super-sharp, have few technical prejudices, and have very little vanity ‚Äď just a will to get to business and solve problems. Polite, but blunt. We‚Äôre lucky that they recognize LSI as a leader, and are willing to spend time to listen to our ideas, and to give us theirs.
Even their infrastructure has a similar feel to the US¬†hyperscale datacenters. The same only different. ¬†;-)
A lot of these guys are growing revenue at 50% per year, several getting 50% gross margin. Those are nice numbers in any country. One has $100‚Äôs of billions in revenue. ¬†And they‚Äôre starting to push out of China. ¬†So far their pushes into Japan have not gone well, but other countries should be better. They all have unique business models. ‚ÄúWe‚ÄĚ in the US like to say things like ‚ÄúAlibaba is the Chinese eBay‚ÄĚ or ‚ÄúSina Weibo is the Chinese Twitter‚ÄĚ‚Ä¶. But that‚Äôs not true ‚Äď they all have more hybrid business models, unique, and so their datacenter goals, revenue and growth have a slightly different profile. And there are some very cool services that simply are not available elsewhere. (You listening Apple¬ģ, Google¬ģ, Twitter¬ģ, Facebook¬ģ?) But they are all expanding their services, products and user base.¬†Interestingly, there is very little public cloud in China. So there are no real equivalents to Amazon‚Äôs services or Microsoft‚Äôs Azure. I have heard about current development of that kind of model with the government as initial customer. We‚Äôll see how that goes.
100‚Äôs of thousands of servers. They‚Äôre not the scale of Google, but they sure are the scale of Facebook, Amazon, Microsoft‚Ä¶. It‚Äôs a serious market for an outfit like LSI. Really it‚Äôs a very similar scale now to the US market. Close to 1 million servers installed among the main 4 players, and exabytes of data (we‚Äôve blown past mere petabytes). Interestingly, they still use many co-location facilities, but that will change. More important ‚Äď they‚Äôre all planning to probably double their infrastructure in the next 1-2 years ‚Äď they have to ‚Äď their growth rates are crazy.
Often 5 or 6 distinct platforms, just like the US¬†hyperscale datacenters. Database platforms, storage platforms, analytics platforms, archival platforms, web server platforms‚Ä¶. But they tend to be a little more like a rack of traditional servers that enterprise buys with integrated disk bays, still a lot of 1G Ethernet, and they are still mostly from established OEMs. In fact I just ran into one OEM‚Äôs American GM, who I happen to know, in Tencent‚Äôs offices today. The typical servers have 12 HDDs in drive bays, though they are starting to look at SSDs as part of the storage platform. They do use PCIe¬ģ flash cards in some platforms, but the performance requirements are not as extreme as you might imagine. Reasonably low latency and consistent latency are the premium they are looking for from these flash cards ‚Äď not maximum IOPs or bandwidth ‚Äď very similar to their American counterparts. I think¬†hyperscale datacenters are sophisticated in understanding what they need from flash, and not requiring more than that. Enterprise could learn a thing or two.
Some server platforms have RAIDed HDDs, but most are direct map drives using a high availability (HA) layer across the server center ‚Äď Hadoop¬ģ HDFS or self-developed Hadoop like platforms. Some have also started to deploy microserver archival ‚Äúbit buckets.‚ÄĚ A small ARM¬ģ SoC with 4 HDDs totaling 12 TBytes of storage, giving densities like 72 TBytes of file storage in 2U of rack. While I can only find about 5,000 of those in China that are the first generation experiments, it‚Äôs the first of a growing wave of archival solutions based on lower performance ARM servers. The feedback is clear – they‚Äôre not perfect yet, but the writing is on the wall. (If you‚Äôre wondering about the math, that‚Äôs 5,000 x 12 TBytes = 60 Petabytes‚Ä¶.)
Yes, it‚Äôs important, but maybe more than we‚Äôre used to. It‚Äôs harder to get licenses for power in China. So it‚Äôs really important to stay within the envelope of power your datacenter has. You simply can‚Äôt get more. That means they have to deploy solutions that do more in the same power profile, especially as they move out of co-located datacenters into private ones. Annually, 50% more users supported, more storage capacity, more performance, more services, all in the same power. That‚Äôs not so easy. I would expect solar power in their future, just as Apple has done.
Here‚Äôs where it gets interesting. They are developing a cousin to OpenCompute that‚Äôs called Scorpio. It‚Äôs Tencent, Alibaba, Baidu, and China Telecom so far driving the standard. ¬†The goals are similar to OpenCompute, but more aligned to standardized sub-systems that can be co-mingled from multiple vendors. There is some harmonization and coordination between OpenCompute and Scorpio, and in fact the Scorpio companies are members of OpenCompute. But where OpenCompute is trying to change the complete architecture of scale-out clusters, Scorpio is much more pragmatic ‚Äď some would say less ambitious. They‚Äôve finished version 1 and rolled out about 200 racks as a ‚Äútest case‚ÄĚ to learn from. Baidu was the guinea pig. That‚Äôs around 6,000 servers. They weren‚Äôt expecting more from version 1. They‚Äôre trying to learn. They‚Äôve made mistakes, learned a lot, and are working on version 2.
Even if it‚Äôs not exciting, it will have an impact because of the sheer size of deployments these guys are getting ready to roll out in the next few years. They see the progression as 1) they were using standard equipment, 2) they‚Äôre experimenting and learning from trial runs of¬†Scorpio versions 1 and 2, and then they‚Äôll work on 3) new architectures that are efficient and powerful, and different.
Information is pretty sketchy if you are not one of the member companies or one of their direct vendors. We were just invited to join Scorpio by one of the founders, and would be the first group outside of China to do so. If that all works out, I‚Äôll have a much better idea of the details, and hopefully can influence the standards to be better for these¬†hyperscale datacenter applications. Between OpenCompute and Scorpio we‚Äôll be seeing a major shift in the industry ‚Äď a shift that will undoubtedly be disturbing to a lot of current players. It makes me nervous, even though I‚Äôm excited about it. One thing is sure ‚Äď just as the server market volume is migrating from traditional enterprise to¬†hyperscale datacenter (25-30% of the server market and growing quickly), we‚Äôre starting to see a migration to Chinese¬†hyperscale datacenters from US-based ones. They have to grow just to stay still. I mean ‚Äď come on ‚Äď there are 1.3 billion potential users in China‚Ä¶.
Tags: Alibaba, Amazon, Apple, ARM, Baidu, China, China Telecom, datacenter, Facebook, Google, Hadoop, hard disk drive, HDD, hyperscale, Microsoft, OpenCompute, Scorpio, Shenzhen, Sina Weibo, solid state drive, SSD, Tencent, Twitter