XenServer 6.2 Dynamic Memory and NVIDIA GRID vGPU, Don’t Do It!
Updated 2015.1.13: XenServer 6.5 (aka Creedence) has been released to web and this has been resolved in this release.
Click here to read my updated article for DMC and vGPU on XenServer 6.5
A couple months back I was working with a colleague of mine (Richard Hoffman on LinkedIn or Twitter) on a blue screen issue he identified on a fairly up-to-date XenServer 6.2, XenDesktop, NVIDIA GRID K1/K2 and vGPU deployment. In his case, he was experiencing crashes related to dxgmms1.sys, nvlddmkm.sys, and various other XenServer errors:
xenopsd internal error: Device.Ioemu_failed(“vgpu exited unexpectedly”)
xenopsd internal error: Failure(“Couldn’t lock GPU with device ID 0000:05:00.0”)
Dynamic Memory Control Explained
If you are unfamilar with XenServer Dynamic Memory Control (DMC), you can review this guide:
Like vSphere and Hyper-V memory optimization techniques, DMC allows you to set a minimum and a maximum, low and high watermark, and let the XenServer host manage the allocation to the virtual machine. I have done fairly extensive testing on DMC and its impact to single server density, and there’s definitely some good, bad, and ugly characteristics. The screenshot below is an example of how you would configure Dynamic Memory for a specific virtual machine:
Although it’s been posted for several months in the forums, by my friend and NVIDIA colleague Steve Harpster, in my opinion it not yet common knowledge that for XenServer 6.2 or older, you should NOT configure Dynamic Memory in conjunction with NVIDIA GRID vGPU. This has been resolved for XenServer 6.5! This blog post is my attempt to spread awareness as this can be a very common mistake with unintended consequences. Here’s the explanation from Steve:
Hi Ricard and all.
Just to update everybody on this issue. It turns out that Memory Balooning was enabled on these servers. vGPU today does not support Memory Ballooning. Here is an article on the subject:http://www.citrix.com/content/dam/citrix/en_us/documents/products-solutions/citrix-xenserver-dynamic-memory-control-quick-start-guide.pdf
The reasons are that if you overprovision system memory, graphics performance will take a huge hit when the VMM is paging system memory on behalf of the guests.
Once the servers had Memory Balooning disabled, the VMs seem to be stable now. We hope to support this feature at some point in the future.
In looking into 2015 and the forthcoming release of vGPU for vSphere, I would expect these limitations to be cross-platform. In vSphere it’s a little different as the hypervisor handles this automatically using the four memory optimization techniques:
- Transparent page sharing (TPS)—reclaims memory by removing redundant pages with identical content
- Ballooning—reclaims memory by artificially increasing the memory pressure inside the guest
- Hypervisor swapping—reclaims memory by having ESX directly swap out the virtual machine’s memory
- Memory compression—reclaims memory by compressing the pages that need to be swapped out
The way you disable memory optimization in vSphere is by enforcing a reservation, however the concept is the same. Thick vs. Thin is an easy way of thinking about it.
Now, the real question comes into play here. If Dynamic Memory (or other techniques) are not available and you have to put static Memory reservations on your VDI or RDSH VMs to use NVIDIA GRID vGPU, how does that impact your single server density scalability? Is this going to throw a monkey wrench into your plans to get X number of users on Y piece of hardware? Do you always plan for 100% memory reservations for each VM, or do you allow the hypervisor to do a fair job of resource allocation? While I agree, having 100% reservations is a good practice to ensure a consistent user experience, monster VDI VMs are becoming more and more common. A couple years back, two-by-fours (2vCPU, 4GB RAM) were fairly common. Now, I am seeing more and more higher end knowledge worker use cases with 4×16, 4×32, even a couple recently that were 8×32. If thickly allocating 16GB or 32GB per user, you can see how even the beefiest of 512GB, 768GB, or 1TB box will get eaten up pretty quick. And, more often then not, these are Persistent VDI VMs due to the nature of the work these folks are doing. So it’s not like you can use Pooled/Non-Persistent VDI as a way to over-allocate access to the user population.
I know there’s a bit to think about here, but I want to start the dialog and gain awareness as the world continues to evolve. Hopefully this article was informative about the use of Dynamic Memory with NVIDIA GRID vGPU. More so, this post should get you thinking about design and sizing implications as you look to plan out your deployment using the best of the best technologies.
As always, if you have any questions, comments, or simply want to leave feedback, please do so in the comments section below. Please help me spread the news in this article so it can become common knowledge. If NVIDIA changes their stance relative to Dynamic Memory, I will be sure to update this article.
Big thanks to Richard Hoffman, Steve Harpster, Jason Southern, and Luke Wignall for sticking with this one to find a resolution.