There’s a famous quote that says, “If you can’t explain it simply, you don’t understand it well enough.” That statement is true. But I’m sure many of us have examples of where we’re challenged to explain something we know very well in a simple manner. Take the example of working in technology and explaining what you do to family or friends who are in completely different fields. Can you easily explain virtualization, for example, to your car mechanic? What about explaining the concepts of hybrid cloud to an airline pilot (who has an altogether different view of clouds)?
Usually, the easiest way to explain a complex subject to someone who has never been exposed to it is to put it into terms of examples that they already understand. And so it was that I found myself talking with my wife about the benefits of hybrid cloud in terms of our TiVo DVR.
For years we’ve used TiVo to record shows for ourselves and our kids. TiVo has had a feature called Season Pass that automatically records all episodes of a TV show when they air on a particular channel. Recently, TiVo upgraded the Season Pass into a new feature called OnePass that brings together content on your TiVo with content available via streaming services like Netflix and Amazon.
We were recently at a party where someone recommended we watch a show we had never heard of. When we got home we decided to record it, and we got to see how the OnePass feature worked with this new show. The feature would allow us to record all new episodes of the show directly on our TiVo while also allowing us to use streaming services to catch up on the previous season that we haven’t seen. It does this all from the same familiar interface, grouping the newly recorded episodes and those episodes available from streaming services together into the same view.
You can see an example of what this looks like using my son’s favorite show Chuggington. From the same view, I can choose to stream episodes from multiple sources (in this case Amazon and Netflix, as shown in the lower right) indicated by the three little blue lines next to the name of the show. Or I can watch episodes I already recorded on my TiVo as indicated by the green dot next to each episode. It works great—I get all of the content that I want in one interface that I’m already familiar with.
Lest you think this is just a blog post advertising TiVo, let me bring it back to hybrid cloud. Those of you already familiar with the EMC’s Federation Enterprise Hybrid Cloud probably already see the connection here. What TiVo has done with the OnePass feature is provide choice, allowing us to record shows directly on our TiVo (on-premises infrastructure) while also letting us consume content from streaming services (off-premises cloud) all from the same interface. We could make the decision to use paid services like Netflix to catch up on old episodes, or simply record them directly on the TiVo (at no extra cost above our normal cable bill) when they air again.
The hybrid cloud provides that same type of functionality. You can choose to deploy workloads or services on-prem or use public clouds like VMware vCloud Air all from the same interface. Consumers of cloud services usually don’t care where the workload runs, they just want the ability to consume it based on criteria like cost, data protection, recovery levels, or performance just to name a few. At EMC we talk a lot about how the Federation Enterprise Hybrid Cloud provides choice, and this is an easy to understand example of how we provide that choice.
And so I bring it back to where I started. Hybrid cloud can be a complex topic, but I was able to explain it to my wife in terms she is familiar with and she instantly understood the benefits. On that night everybody won—I got to explain the benefits of hybrid cloud in a new way that gave me the idea for this blog post, and we ended up with a new show to watch. Now if she’d ever give me the remote control, I might actually be able to watch it.
This weekend I had the opportunity to take the VMware Certified Professional Delta exam (VCP550D) to renew my VCP5-DCV certification for another two years. I wanted to share my thoughts on this exam specifically as well as the future of online exams.
I’ll echo what I’ve read from other folks regarding this exam in that you need to know more than just what has changed between vSphere 5.x and 5.5. Be prepared to answer questions about vCOPs and VSAN and possibly others too. The exam still feels like a regular VCP exam so if you’ve taken those already you should be familiar with the format, but expect that some questions may fall outside of vSphere specifically.
There are 65 questions in 75 minutes so you don’t have a lot of time to give each of the questions a lot of thought. You’ll need to go in with a good working knowledge of vSphere to have a shot at passing this. Luckily only people with a current VCP can take this exam so most will already have good experience.
My advice if you haven’t already taken the exam
- VMware has a 7 day retake policy so make sure you take the exam at least 7 days prior to March 10th, 2015. That gives you a shot at taking it again in case things don’t go well.
- Make sure to read the exam blueprint (opens a PDF) to see what might be asked of you.
- Take the practice exam on VMware’s website to get you in the right mindset but make sure you get at least one wrong on purpose. Once you get 100% it will never let you take it again. Side note: this is why I can still take the VCP3 (VI3) practice exam.
The other obviously different aspect of this exam is that you can take it online from the comfort of your web browser. I wasn’t sure what to expect from that but once the exam started it felt very much like I was sitting in a testing center (look and feel), except that the performance was significantly better. Most of the testing centers I’ve used have very old computers and it can take a little while moving between questions. The performance was great it was really comfortable taking the exam like this.
You might ask the obvious question – doesn’t this make it easier to cheat? Can’t you just keep another browser window open and search Google for the answers? My thought on that: I dare you to try. My exam had 65 questions in 75 minutes, barely giving you more than 1 minute per question. If the questions were simply “What is the maximum number of virtual machines per ESXi host” type questions then you might be able to do that but that wasn’t what the exam looked like. What you end up with are scenario type questions that take time just to read the question, which would make it all but impossible to try to search the Internet for the answers. If you tried that you would very likely run out of time.
That brings me to my thoughts on taking exams like this in general. I ‘d love to see this become the standard of how tests are delivered in the future. I think it should be easy enough to make it too hard for people to cheat by using scenario type questions, though I admit it would be difficult to prevent a bunch of people from sitting around a computer and taking the test together. For that I think we need to rely on people being honest, which sadly is why I doubt we’ll see this actually end up as the future of the way certification exams are delivered.
The good news (for me) is that I passed the exam so my VCP is now good until 2017. I thought overall the exam was similar to other VCP exams I’ve taken in the past, and delivering it online made it much more convenient for me to take it. Proof of that is I was able to take the exam on a Saturday afternoon instead of having to lose hours during the work day to drive to a testing center.
Good luck to anyone who is still planning on taking the exam!
It seems there is a bit of an uproar about a document Microsoft recently released called Planning a Lync Server 2013 Deployment on Virtual Servers. In that document, Microsoft makes some odd recommendations when virtualizing Lync, sync has disabling NUMA on physical servers and not using hyperthreading.
To address these concerns, VMware has published a really nice blog post that addresses the unusual guidance in the document. You can read the blog post by clicking here. The author of the blog post was my co-presenter at VMworld as year and was a technical advisor on my book on virtualizing Microsoft apps, so he has very good credibility in this space. I highly recommend reading it.
I agree with everything Deji says in his blog post but I wanted to add some additional thoughts.
I agree with Deji that we should always size virtual machines based on a host's physical cores, not logical cores (despite how hyperthreaded cores are represented within vSphere). That is true for business critical apps like Lync and basic workloads too. A logical core is not the same as a physical core and we shouldn't treat it that way. And don't forget that even if we don't assign the logical cores to virtual machines, ESXi can still use them when managing its own processes. That can help overall system performance for Lync and all other workloads on the server.
I can't think of a single reason to disable NUMA. Even if the workload doesn't support it, ESXi does and will place VMs within NUMA nodes to increase performance. I almost think the author of the document was confused and meant to say to disable node interleaving, which is how NUMA is referred to in many BIOS settings. Disabling node interleaving = enabling NUMA. A good example of this is Microsoft Exchange, which is not NUMA aware (unlike SQL Server which is). You wouldn't disable NUMA on the ESXi host system or a physical server running Exchange just because the application doesn't support NUMA since Windows itself and ESXi can take advantage of it.
The author makes a good point about over-committing resources (especially CPU) on Lync servers and the impact to performance. On that I completely agree, though CPU reservations can alleviate that issue. This is true with Lync just like it is true with Exchange Server and the requirement not to exceed 2:1 vCPU:pCPU. It's possible the author is loosely referring to processor affinity or what is sometimes called "CPU pinning." I agree that you shouldn’t mess with processor affinity for Lync, but CPU reservations can be used to guarantee access to CPU resources.
Hopefully between my thoughts and the blog post from VMware you’ll see that there is nothing to worry about when virtualizing Lync. It’s fully supported and you don’t need to change your standard practices in order to make it work.
I’ve never been a prolific blogger like others in the virtualization community, but this year was especially bad. One of the reasons is that I’m in a new role with EMC which has changed my focus a bit. Still focused on virtualization, but less so on the business critical apps that I used to write more frequently about and more on hybrid cloud.
The other major change is my involvement with something called EMC Enterprise Hybrid Cloud, or what we refer to as EHC. EHC is more than just marketing – it’s a powerful combination of hardware and software from EMC, software from VMware, and services from EMC to implement the solution. It’s a fully engineered solution that includes more than just out of the box functionality of the products. EMC has created custom workflows and automation to add value to the solution, including things like Backup as a Service integration with Avamar and storage provisioning and integration with EMC ViPR.
My new role at EMC has me focused on helping to make sure that our Professional Services teams have the skills, training, and experience they need to be successful in delivering projects and EHC has become a big part of that. After all, one of the reasons why EHC is resonating with customers is that it’s a complete solution that includes our services to implement it quickly. How quickly? We can go from nothing to a fully functional cloud environment providing Infrastructure as a Service in just 28 days. Customers tell us a project like that often takes them months or longer and they’re often not as successful as they’d like.
Having a focus on Professional Services and not on specific EMC products, my view of the EMC world may be somewhat skewed. With that said, I think the PS portion of EHC is one of the most important parts of the solution. The value of bringing in an experienced PS team is especially evident here, as EHC is a complex solution that involves numerous technologies. I’ve been personally involved in developing the training paths and enablement plans for all EMC PS resources that deliver EHC solutions and can say the caliber of people we’ve put through is quite high. I’m proud to be a part of the EMC PS team.
I’ll share some links below to give you some more information about EHC if you’re interested. You’ll notice that much of it will focus on the technology and capabilities of the EHC solution and you’ll see very little about EMC’s services to go with this. In my mind the PS portion is just as important as the technology itself, and is one of the key things that makes EHC different from similar solutions from other organizations.
Here’s to hoping I get to blog more next year! Hope everyone has a great new year and a prosperous 2015!
Here are some links for more info:
EHC Solution Overview (opens a PDF): https://www.emc.com/collateral/solution-overview/h12476-so-hybrid-cloud.pdf
First in my 3 part series introducing EHC: https://infocus.emc.com/matt-_liebowitz/introducing-emc-hybrid-cloud-part-1-why-hybrid-cloud/
EHC was recently updated to 2.5.1 and this post by Jim Sanzone does a nice job explaining what’s new: http://projectvirtual.com/?p=301
This week VMware released a blog post and several knowledgebase articles about Transparent Page Sharing (TPS). In the next patch for ESXi, TPS will be disabled by default between virtual machines. TPS is the technology that allows virtual machines running on the same ESXi host to share identical memory pages, allowing ESXi to overcommit memory. This technology has been part of ESX/ESXi for years and is one of the (many) differentiating factors between it and other hypervisors.
VMware made this decision due to independent research that showed that in a highly controlled environment the contents of those shared pages could be read. Though VMware admits it’s highly unlikely that this would be possible in a production environment, being security conscious they made the decision to disable the feature by default. It can be re-enabled if your environment could still benefit (see KB articles above for instructions).
There have been a number of blog posts on this topic already and I don’t want to add to the noise, but I did want to add my perspective. I’ve been working with ESX/ESXi since 2002 so I’ve seen how TPS has been used over the years and can hopefully add some perspective.
Will This Affect Me?
That’s likely the key question you’re asking yourself. One of the things I’ve seen over the years is that many folks don’t really understand how TPS works with modern processors. Using modern processors, ESXi uses large memory pages (2MB) to back all guest pages (even if the guest OS itself doesn’t support large pages) in order to improve performance. ESXi doesn’t attempt to share large pages, so in the vast majority of cases TPS is actually not used. Only when an ESXi host comes under extreme memory pressure does it break large pages into small pages and begin to share them.
I’ll state that again to be clear – unless you consistently over-commit memory on your ESXi hosts, TPS is not actually in use. So does this really affect you? Chances are it probably doesn’t unless you rely heavily on overcommitting memory.
What About Zero Pages?
There’s one area where TPS kicks in regardless of whether the ESXi host is over-committed on memory or not: zero pages. When a VM powers on, most modern operating systems will zero out all memory pages of the allocated memory (meaning write zeros to all memory pages). On ESXi, TPS kicks in and shares those pages immediately right after the VM powers on. It doesn’t matter if memory is over-committed, and it doesn’t wait for the normal 60 minute cycle where TPS scans for identical pages. TPS kicks in and shares the zero pages immediately.
As an example, see this screenshot from esxtop and pay close attention to the ZERO and SHRD columns. I just powered on NJMGMT01 and you can see in the highlighted columns that 3.2GB of its allocated 4.0GB are currently being shared (see the SHRD column). Of those pages, 100% of them are zero pages (see the ZERO column).
If that host was already overcommited on memory, sharing the zero pages and having them slowly get utilized by the OS could give time for ballooning to kick in and reclaim memory from other guests. Without TPS in this scenario, it is likely that ESXi would be forced to swap/compress memory before ballooning can kick in to reclaim unused memory from other guests.
Does this mean there’s a real risk to disabling TPS? Below I go through a number of scenarios where it may or may not make sense to re-enable TPS, but I don’t think zero pages (explicitly) is one of them. If you already make use of memory over-commitment then you’ll want to re-enable it either way. If not, as long as you properly plan your environment then sharing zero pages shouldn’t matter except possibly in the event of an ESXi host failure.
Should I re-enable TPS?
The next question you might be asking yourself is: should I re-enable TPS once it’s disabled by default in the next update? As always it comes down to your individual requirements so there is no “one size fits all” answer. Let’s look at some scenarios to better illustrate the situation.
I’ve worked with a lot of customers over to years to help virtualize business critical applications, and I’ve seen one thing remain consistent: for business critical applications (and most production workloads in general), customers do not overcommit memory. For the most critical workloads, customers will use memory reservations to ensure those workloads always have access to the memory they need.
These days, it’s not uncommon to see ESXi hosts with 256GB to 1TB of RAM so overcommitting memory is an unlikely scenario. I’d say in most cases it’s unnecessary to re-enable TPS for production workloads.
If there’s a scenario where it seems like TPS was created just to solve a specific problem, it would be VDI. Consider a typical VDI environment using non-persistent desktops – there could be 50-100 identical virtual desktops running on each ESXi host. The potential memory savings in this scenario is huge, especially since having high consolidation ratios can help drive down the cost of VDI.
For VDI, I think it makes a lot of sense to re-enable TPS to take advantage of the huge memory savings potential.
In development environments, you’ll often want to increase consolidation ratios since performance is not typically the most important factor. Development environments also have the potential to have many similar workloads (web servers, database servers, etc.) as new versions of applications are tested.
For dev/test environments, I think it makes sense to re-enable TPS.
I think this one is easy – most of us don’t have large budgets to support our home labs so memory is tight. In my lab I frequently overcommit memory, so I will be re-enabling TPS after the update.
I only covered a small portion of the potential situations that could be out there. You’ll need to make a decision based on your own individual requirements. Think about what’s most important – maximum performance, consolidation ratios, security, etc.?
Remember – due to the way TPS works with large pages, in most scenarios you’re probably not even using it today. In the event that you need to overcommit memory – such as during an ESXi host failure or DR scenario – ESXi has other methods of memory reclamation such as ballooning and swapping. Swapping is generally the last resort and results in significant performance reduction, but you can offset that a bit by using an SSD and leveraging the swap to host cache feature of ESXi.
I admit I used a lot of words to basically say, “Don’t worry about it – this probably doesn’t affect you anyway” but I wanted to lay out the common scenarios I see at customers. I believe VMware made the right move here disabling TPS out of an abundance of caution even if the scenario to exploit this is unlikely to occur in production.
Are you re-enabling TPS? Feel free to leave a comment on why or why not.
As I type this, I see I haven't written a blog post in over 5 months. I wish I had a good excuse for that but I don't. The truth is it was really a number of different things:
1) Writing 3 books in one and a half years (from Nov. 2012 to April 2014) consumed a huge amount of time and made it hard to find time to blog. Once the books were done I was a bit burned out and needed some time off.
2) I took a new role at EMC at the beginning of 2014 which has taken a lot of my time. It's an exciting new role where I get to help EMC's Professional Services deliver the best virtualization services to our customers. That new role has left me very little time to blog.
3) I have young kids which happily take a lot of my free time.
Another aspect that has changed is that I've changed my virtualization focus as a result of this new role. Instead of focusing mostly on virtualizing business critical applications, my focus is now more on hybrid cloud and the technologies that accompany it. I've been getting up to speed on vCloud Automation Center, vCenter Orchestrator, and a little bit of Application Director. In this new role I cover all of virtualization, but I've been focusing on trying to learn these cloud/automation tools.
As a result I haven't had a lot to blog about. I may start blogging on my experience in learning these applications. I suspect my "pivot" towards cloud automation will be somewhat common in our industry, so my journey may be relevant to others. I'll do my best, but I recognize I'll never be a prolific blogger that makes 6 or 7 posts per month.
To those that are still following this blog, thanks! I still love blogging and even writing this post has been fun. Hopefully I can get around to producing new content soon.
I’m very happy to announce that my new book called VMware vSphere Performance, along with co-authors Rynardt Spies and Christopher Kusek, has been released! It’s available right now on Kindle format and the hard copy will be available on May 12th!
The book is focused on providing guidance for how to design vSphere to get the best performance out of your virtual machines. I wanted the book to be more than a paperback version of vSphere’s own performance best practices paper – we’ve also included tips, tricks, and stories of what we’ve seen in the field with real customers. We also include some tips for troubleshooting performance problems that you may encounter. We hope you’ll find it valuable across current and future versions of vSphere.
I first became involved with this book back in late 2011 as the book’s technical editor. There were numerous delays and issues that aren’t worth getting into, but there was very little progress. I had been emailing back and forth with one of the authors of the book, Jonathon Fitch, trying to help where I could and give guidance whenever possible. It was during those email exchanges that he told me he was fighting cancer.
Fast forward to early 2013, and I’m in the middle of writing Virtualizing Microsoft Business Critical Applications on VMware vSphere. I had been emailing with Jonathon to see if he was making any progress when he told me that his disease had progressed to a point where he didn’t think he’d be able to finish the book. He tried to get into some experimental treatment program but unfortunately he didn’t quality, and his doctors told him he had just a few weeks left. The news hit me pretty hard even though I didn’t know him well.
I told Jonathon I would take over his chapters, making sure as much of his content remained in the book as possible, and that his name would remain on the cover. (Sybex originally agreed to that but changed their minds and his name is not printed on the cover – I'm not happy about it but I understand their decision) After not hearing from him for a little while and thinking the worst, I got the bad news that on April 1, 2013, Jonathon had died. His wife told me that he was very proud of being a part of the book and it was important to him.
I didn’t know Jonathon before becoming involved in this book and mostly communicated with him through email. He wasn’t active on Twitter or other social media so he wasn’t well known. Even still, he was a member of our virtualization community and it’s sad to lose one of our own. I tried to keep as much of his original content in the book so that his hard work would survive. We dedicate this book to Jonathon and I sincerely hope that his family is proud of what he was able to accomplish while fighting his disease.
It was a long road to finishing this book – I was still writing the VBCA book and was also involved in Mastering VMware vSphere 5.5, so the idea of taking on another book was not high on my list but I made a commitment to Jonathon so I stuck with it. It wasn’t without it’s challenges to say the least. I wanted to keep as much of his content in the book as possible, but his chapters were still largely based on the content he’d written in 2011, some of it going as far back as vSphere 4.1. What do you do when you don’t fully understand what he meant in a sentence or paragraph and you can’t ask him to clarify? Or how do you keep the original author’s words intact when he’s writing based on a version of vSphere that’s 3 revisions older than the current version?
Much of the time I felt like Dante from Clerks when he would say, “I’m not even supposed to be here today!” If you’re familiar with Clerks then you’ll know what I mean. Or, see below.
Thankfully I was able to push through the issues and challenges and produce something I think Jonathon would be proud of. I did what I could to keep his words intact while updating the content to be relevant to the most recent version of vSphere. I wish we could have kept his name on the cover, but his hard work paid off and we have all dedicated the book to him. I really hope it brings his family some comfort.
Rather than end on a sad note, instead I’d rather post some pictures of my adorable kids holding a copy of the book.
We’ve all seen those ads on the Internet that promise to teach us “one weird trick” that “<someone> hates” to do something easily. Make money fast? Yep. Lose belly fat? Check. Grow “other” parts of your body? Easy.
I couldn’t resist stealing the headline for something that I’ve seen come up time and time again when virtualizing large applications or monster VMs.
Raise your virtual hand if you’ve heard, or said, something like this in the past:
All VMs should start out with only 1 vCPU.
Most VMs don’t need more than 1 vCPU.
Giving a VM more than 1 vCPU could hurt performance.
I hear this one all the time. I even used to say something like it myself. When I think back to my early days with virtualization I used to say things like, “If a server needs more than 3.6GB of RAM, it probably isn’t a good candidate for virtualization.” (3.6GB was the max RAM supported in a guest OS in ESX 2.x) But is that the right way to approach sizing all virtual machines?
These days, most of the time I hear something like this as a result of a software vendor’s sizing guidance for their application.
“Why should I give that VM 4 vCPUs and 16GB of RAM? There is no way it really needs it!”
The thing to remember here is this: While starting small when sizing VMs is a good practice, the application’s actual requirements should dictate the sizing. That’s the “one weird trick” – understand the application’s requirements before rushing to judge the sizing. Just because you don’t like the idea of giving a VM a lot of resources doesn’t mean it doesn’t actually need them. Trust me – I’m a big believer in starting small and sizing VMs with fewer resources if the resource requirements are not well known or defined. That doesn’t apply to all VMs.
I hear it too often – folks who automatically rush to judge the sizing of a virtual machine just because they think it’s “too big.” We need to get past the mentality of “anything larger than 1 vCPU is bad.”
Before you head to the comments and say something like, “Vendor supplied requirements are BS. Vendor x told me that their application needed y requirements and it doesn’t use even half of that.” hear me out. Is that true sometimes? Definitely. But should that be your default answer? Definitely not. If the application you’re trying to virtualize is currently physical, a good way to determine if the sizing is accurate is to simply measure the existing load. But don’t forget – the current utilization alone is not enough to size the VM. You need the application’s actual requirements. I’ll illustrate that point with a story.
I worked with a customer last year to help them virtualize an important Microsoft SharePoint FAST Search implementation. We ran capacity assessments on the existing physical servers with both VMware Capacity Planner and Microsoft Assessment and Planning Toolkit. Both tools independently came to the same conclusion: based on the current utilization, these servers likely need 2 vCPUs and 8GB of RAM.
When I worked with the FAST architects, they wanted their servers sized with 8 vCPUs and 24GB of RAM. Naturally there was confusion as to why the current physical servers were only using a quarter of those resources. It came down to one thing: requirements.
The customer had a specific requirement to be able to support a certain number of QPS (queries per second) at a certain level of latency. In order to meet those requirements we needed to size the servers with 8 vCPUs and 24GB of RAM. We validated that design by running FAST performance tests at different CPU/memory sizing on the FAST VMs. Each time we were not able to meet the QPS requirements unless it was sized correctly.
What’s the moral of this story? Don’t be afraid of giving a virtual machine the resources it actually requires. Remember – there are tools like vCenter Operations that can tell you, over time, how a VM is using its resources. This provides you the opportunity to right-size the workload if the VM truly doesn’t need the resources.
Understand the vendor stated requirements as well as those of your customer/application owner. If you get a request for a SQL Server (or similar) and your immediate response is, “I’m not giving it that many vCPUs…” stop and make sure you truly understand the application’s requirements. Validate the sizing with performance testing and adjust if necessary. You’ll have happier application owners and better performing applications.
It’s been too long since I’ve blogged anything on a regular basis. Sorry about that. I would love to blame it on the three books I’ve written in the last year but that isn’t the only reason. Hopefully I can turn it around and start writing on a regular basis again.
I’ve been sitting on this one for a while but finally had some time to get it down. When it comes to virtualizing business critical applications, the perception of poor performance from virtualized application becomes reality whether it is true or not. It might be because organizations tried to virtualize performance intensive applications years ago when the ESX/ESXi platform was less mature and that memory lingers. It’s up to us to help educate those organizations that VMware has improved vSphere considerably and it is now capable of scaling to meet the demands of just about any application.
Fighting perception often means fighting nonsensical things that vendors say. Before I get into this I wanted to say that I consider Microsoft to be very virtualization friendly. They provide excellent guidance for virtualizing many of their products. Unfortunately not all of their guidance is so great.
Take a look at the following guidance on virtualizing FAST Search Server (component of SharePoint 2010) – bold emphasis is mine:
FAST Search Server 2010 for SharePoint supports virtualization, but for larger deployments, we recommend that you only use virtualization for your test and development environments- not the production environment. Our rationale is as follows:
FAST Search Server 2010 for SharePoint is a heavy user of CPU and IO
Virtual machines only have access to a limited number of CPU cores (Hyper-V = 4, VM ware = 8)
Virtual machines will give 30-40% decrease in overall performance
Yes – it’s bad enough that I’m actually ignoring that they spelled VMware as “VM ware” and focusing on the fact that they say virtual machines will give a 30-40% decrease in overall performance. I’m not sure where they’re getting this data (perhaps on older versions of vSphere or Hyper-V) but it isn’t true today. Having been personally involved in the performance tests of production FAST Search Server deployments I can say definitively that that is not the case as long as it’s sized corrected (like any other app).
My advice to you: If you see a vendor make a claim like this, ask for the data to back it up. Perform your own testing and validate whether it’s true. Make them prove it.
If we all fight this perception maybe vendors will notice and finally stop saying things like this. We all benefit when that happens.
I have to admit - Exchange virtualization is one of my favorite topics. I love talking about it with customers, colleagues, or anyone who will listen. It has a very well known workload profile which makes sizing consistent, it’s virtualization friendly in that it supports vSphere features like vMotion and HA, and I like talking about the challenges and creative solutions that are sometimes required to virtualize it. Yep, Exchange virtualization is my pal.
I’m also very familiar with its support limitations as I’ve written about them in the past (see here). One of the support policies has to do with CPU over-subscription. That is, assigning more vCPUs than there are physical pCPUs in the host. From Microsoft’s support policy page for Exchange 2013 virtualization (though this applies to Exchange 2010 also):
Many hardware virtualization products allow you to specify the number of virtual processors that should be allocated to each guest virtual machine. The virtual processors located in the guest virtual machine share a fixed number of logical processors in the physical system. Exchange supports a virtual processor-to-logical processor ratio no greater than 2:1, although we recommend a ratio of 1:1. For example, a dual processor system using quad core processors contains a total of 8 logical processors in the host system. On a system with this configuration, don't allocate more than a total of 16 virtual processors to all guest virtual machines combined.
Note the section I bolded above – that says that you cannot go beyond a 2:1 vCPU to pCPU ratio and 1:1 is recommended. Not only are we used to much larger consolidation ratios, once you factor in DRS automatically moving VMs to balance out utilization it becomes almost impossible to enforce.
Combine this with the relative disparity between the maximum amount of RAM in a host compared to CPUs. We have customers that have deployed ESXi hosts that have 16-24 physical CPUs but 1TB of RAM. If you want to virtualize Exchange on those servers, you’re left with only two possible outcomes:
1) You ignore Microsoft’s support restriction and exceed the 2:1 vCPU/pCPU ratio.
2) You don’t use anywhere near the amount of RAM you have in the host.
The good news is there’s a third option that allows you to work around this problem: CPU reservations. CPU reservations in vSphere allow you to guarantee that important virtual machines will have access to the CPU resources they need even if CPU resources are constrained. This works great for virtualizing Exchange for a few reasons:
1) You can effectively exceed the 2:1 vCPU:pCPU ratio for workloads other than Exchange while guaranteeing Exchange VMs access to those CPU resources. Doing so does not violate Microsoft’s support policies – see this great TechEd presentation on virtualizing Exchange where Microsoft’s Jeff Mealiffe discusses this as an option.
2) Hosts that have a disproportionate amount of RAM to CPUs can still be utilized without wasting resources.
The other nice thing about CPU reservations is that when resources are not over-subscribed, CPU resources can be shared evenly with VMs that do not have a CPU reservation. In this way they work differently than memory reservations, which do not allow reserved memory resources to be shared with other VMs even if memory is plentiful. With CPU reservations, when resources are available then all VMs get equal access to the CPU. Once CPU resources become over-subscribed, the VMs with a reservation get guaranteed access to their reserved CPU resources while VMs without CPU reservations do not.
It’s worth mentioning that this post came about mostly as a result of a Twitter conversation that Chris Wahl (@ChrisWahl) brought me into recently. This is one of the reasons why I love Twitter – great conversation that led to sharing of information.
I hope this helps those that are looking to virtualize Exchange and are concerned about the CPU over-subscription limitations. Using CPU reservations in vSphere will let you stay compliant with Exchange support while allowing you to over-subscribe CPU resources for non-Exchange workloads. Everybody wins!