NVIDIA GRID vGPU Deep Dive! Lessons Learned from the Trenches: Guest Blog Post by Richard Hoffman

Home / NVIDIA GRID vGPU Deep Dive! Lessons Learned from the Trenches: Guest Blog Post by Richard Hoffman

NVIDIA GRID vGPU Deep Dive! Lessons Learned from the Trenches: Guest Blog Post by Richard Hoffman

Posted Jan 28 2015 by Dane Young with 7 Comments

I have mentioned a colleague of mine Richard Hoffman in a previous article where I talked about XenServer 6.2 Dynamic Memory Control and some Blue Screen of Death (BSOD) events that can occur (which he was instrumental in discovering). I’ve also written a new blog post that this has been resolved in XenServer 6.5, so vGPU should now work without issue with DMC. With a little encouragement, I have convinced Richard to share some of his key findings from a Citrix / NVIDIA GRID project he has been involved in for the last several months. If you have content you would like to contribute and be a guest blogger (or regular) on itvce.com, feel free to reach out to me at dane @ itvce.com or on Twitter @youngtech so we can discuss. Without further ado, below is the guest blog post by Richard Hoffman, you can find him on LinkedIn or on Twitter.

I want to share some information for any of you that are getting up to speed on an NVIDIA GRID vGPU project. There are lots of good guides and articles out there and I won’t try to replicate all that information. What I have included is information that was either not documented well or not documented at all.

Here are the subjects that I am covering:
-How are GPU’s shared?
-Where does GPU virtualization occur?
-Comparing vGPU and passthrough specifications
-Dynamic Memory Control incompatibility with vGPU in XenServer 6.2
-Disabling ECC
-Direction-specific fans in the GRID cards

How Are GPU’s Shared?

We have two ways of presenting the virtual desktop with GPU resources. “Pass-through” presents the entire, physical GPU to the virtual desktop, giving you a 1:1 relationship. “Virtual GPU” allows multiple VM’s to access the same physical GPU.

Virtual GPUs give each VM a dedicated portion of video RAM. In other words, the vGPU’s do not share RAM in any way. The video RAM on the card is divided up and that portion is dedicated to a particular VM. However, the GPU cores are time-sliced similar to how a CPU is time-sliced on a hypervisor. So if a GRID GPU has 768 GPU cores, each virtual desktop gets all 768 cores for a split second, and then the next virtual desktop gets them for a split second, and so on. If there is no contention for GPU cores, then one virtual desktop gets full access to the cores for the duration of the session. If there is contention, then each desktop gets a time-slice of the GPU cores.

Where does GPU virtualization occur?

This answer is not documented anywhere and I have actually been told conflicting information along my search for the answer. I was first told that the scheduling of the GPU cores occurs within the hypervisor but this is incorrect. The virtualization occurs at the hardware level and the technology is proprietary to NVIDIA. Scheduling is handled by the scheduler in the GPU chip itself, at the GPU hardware level.

The NVIDIA GRID Manager, installed on XenServer, communicates with the physical GPUs to determine where a VM can be placed. Once the VM is assigned to a physical GPU, the GRID Manager steps out of the way and communication from the NVIDIA driver in the guest OS is direct to the GPU.

I have only seen the diagram below from one presentation and nowhere else online. This slide is unique in that it shows the NVIDIA Kernel Driver residing in Dom0 of XenServer. This driver allows the NVIDIA GRID Manager to communicate with the GRID board and GPUs in order to assign VMs and monitor ongoing usage of the GPUs. This driver is not responsible for graphics delivery between the VMs and the GPUs.

Graphics delivery occurs between the NVIDIA driver installed within the virtual desktop guest OS and the physical GPUs. That is the benefit of the GRID solution over VMWare’s current VSGA that uses a translation or “shim” driver installed on the hypervisor. This direct communication from the NVIDIA driver to the GPU gives better fidelity and less overhead when compared to VMware’s VSGA solution.

Diagram Courtesy of NVIDIA

Comparing vGPU and passthrough Specifications:

I also found that NVIDIA’s documentation compares vGPU profiles to each other but does not compare vGPU profiles to passthrough. This chart below shows a comparison of the vGPU profiles but no passthrough specs are included.

From: http://www.nvidia.com/object/virtual-gpus.html

The below chart shows the specs for the GRID cards and those can be used to calculate the specs of the passthrough GPUs.

From: http://www.nvidia.com/content/grid/resources/10268_NVIDIA_GRID_DS_SEP14_US_LR.pdf

The “Total Memory Size” is listed for the K1 and K2 cards as 16GB and 8GB, respectively. This is the memory for the entire card, not the memory allocated to a passthrough GPU. For instance, both a passthrough K1 GPU and a passthrough K2 GPU get 4GB of video RAM. The 16GB of video RAM on the K1 card is divided between its four physical GPUs. The 8GB of video RAM on the K2 card is divided between its two physical GPUs.

To assist discussing this with clients and end-users, I have combined the above two charts and also added specs for two NVIDIA Quadro GPU cards for physical workstations. See the chart below.

GPU Board	GPU Profile	GPU Cores	Video RAM	Max Displays Per User	Max Resolution Per Display
K1	Passthrough	192 (Dedicated)	4 GB	4	2560 x 1600
K1	K180Q	192 (Time Slice)	4 GB	4	2560 x 1600
K1	K160Q	192 (Time Slice)	2 GB	4	2560 x 1600
K1	K140Q	192 (Time Slice)	1 GB	2	2560 x 1600
K1	K120Q	192 (Time Slice)	512 MB	2	2560 x 1600
K1	K100	192 (Time Slice)	256 MB	2	2560 x 1600
K2	Passthrough	1536 (Dedicated)	4 GB	4	2560 x 1600
K2	K280Q	1536 (Time Slice)	4 GB	4	2560 x 1600
K2	K260Q	1536 (Time Slice)	2 GB	4	2560 x 1600
K2	K240Q	1536 (Time Slice)	1 GB	2	2560 x 1600
K2	K220Q	1536 (Time Slice)	512 MB	2	2560 x 1600
K2	K200	1536 (Time Slice)	256 MB	2	2560 x 1600
Quadro K600 (for physical workstations)	Same core count as K1 Passthrough. Video RAM is less.	192	1 GB	2	DP 1.2: 3840 × 2160 DVI-I DL: 2560 × 1600 DVI-I SL: 1920 × 1200 VGA: 2048 × 1536
Quadro K5000 (for physical workstations)	Same core count and video RAM as K2 Passthrough.	1536	4GB	4	DP 1.2: 3840 × 2160 DVI-I DL: 2560 × 1600

It’s important to note that the comparison of the Quadro K600 and K5000 cards to the GRID GPUs is really for the core count. The video RAM on a K600 card is 1GB while a K1 passthrough gets 4GB of video RAM. This is shown in the above chart.

Another chart that will help your discussions is below. It shows how the K600 and K5000, used in the chart above, compare with the entire line of NVIDIA Quadro GPU cards. The “K” that precedes the card model number stands for “Kepler.” That is NVIDIA’s current GPU architecture. The cards listed at the bottom of the chart, without a “K,” use NVIDIA’s older architecture, called “Fermi.”

Chart courtesy of NVIDIA

Dynamic Memory Control incompatibility with vGPU in XenServer 6.2

Aside from a post on the GRID forums that I started and the subsequent article that Dane wrote, there is no documentation online that Dynamic Memory Control is incompatible with vGPU in XenServer 6.2. In short, it causes the VMs to blue screen. I understand that this is fixed in XenServer 6.5.

http://blog.itvce.com/2015/01/13/xenserver-6-5-dynamic-memory-and-nvidia-grid-vgpu-now-fixed-in-6-5-go-for-it/

By default, vSphere lets you overcommit RAM to virtual machines. The default behavior in XenServer is to dedicate RAM to each VM. It may be compelling to turn on Dynamic Memory Control to get better user density but it should not be done in XenServer 6.2. Dane’s write-up on this issue is below.

http://blog.itvce.com/2015/01/02/xenserver-dynamic-memory-and-nvidia-grid-vgpu-dont-do-it/

Disabling ECC

If you run into an issue where the virtual desktops fail to start and XenServer gives an error that the “vgpu exited unexpectedly,” check if ECC is enabled on your cards. ECC (Error Correcting Code) in the video RAM will cause this error to occur. ECC is not an option on the K1 cards but is on the K2 cards. I encountered some hosts that had one of the three K2 cards with ECC turned on. The cards came like this direct from the OEM. I recommend adding this check to your build steps to ensure ECC is turned off before workloads are put on these cards.

If you run an “nvidia-smi” command on the XenServer, the far right column, under “ECC,” will say, “N/A.” That confirms that it is not applicable for the K1 cards.

Running nvidia-smi on the K2 cards shows that it is appropriately set to zero.

This is the command to turn off ECC.

nvidia-smi -i <ID> -e 0 (where “ID” is the ID that nvidia-smi reports for each GPU. The ID starts at zero and goes up one by one.)

The output will look like the below:
[root@ussclpdvsxen009 ~]# nvidia-smi -i 0 -e 0

Disabled ECC support for GPU 0000:06:00.0.

All done.

Reboot required.

Direction-specific fans in the GRID cards

NVIDIA GRID K1 cards are not particularly sensitive for airflow direction. The information below is primarily designated for K2 cards.

GRID cards have two variations in the direction that their fans blow. If you are installing or swapping out the cards be sure to note which is which or the cards could overheat if placed in the wrong position in the server. The form factor of the cards is identical so they could easily be confused. Fortunately, there are two ways to determine airflow direction: The white arrow on the front of the card, and the part numbers that are shown below. The white arrow is shown in the image below:

The part numbers show you how to tell which is which. At the time of my project, the part numbers were not put on the cards but the “PCB part number” is. I have cross referenced the two different part numbers below. The airflow directions, right-to-left and left-to-right, are noted too.

PCB part number 699-52055-0552-311 = Regular part number 900-52055-0020-000 (R2L)

PCB part number 699-52055-0550-311 = Regular part number 900-52055-0010-000 (L2R)

You cannot retrieve either of the part numbers from running the nvidia-smi CLI tool. The PCB part number is printed on the circuit board shown circled below.

Hopefully this will help make a smooth vGPU project for you!
Cheers,
Richard

I’d like to thank Richard for taking the time to put this content together. Well done pal, glad we could encourage you to write your first blog post! In our little world of Information Technology and virtualization, little nuggets of knowledge like this can go a long way to help fellow brothers and sisters in arms. If you have content you would like to contribute and be a guest (or regular) blogger on itvce.com, feel free to reach out to me at dane @ itvce.com or on Twitter @youngtech so we can discuss. Otherwise, feel free to leave comments, questions, or any feedback for Richard.

Citrix WorkspacePod, an HP Moonshot Architect’s Perspective

Announcing the End User Computing Podcast! (www.eucpodcast.com)

Citrix Chained Reboot Scripts, now supporting Citrix Cloud, Citrix Virtual Apps and Desktops (CVAD), and XenApp/XenDesktop 5, 6, 6.5, and 7.x!

Deploying Windows 8 Virtual Desktop Infrastructure on Windows Server 2012

Automated Migration to VMxNet3 Network Adapters and Paravirtual SCSI Controllers for vSphere 4.x Virtual Machines

Creating a Bulletproof Citrix Licensing Server Infrastructure using NetScaler Global Server Load Balancing (GSLB) and CtxLicChk.ps1 PowerShell Scripts

Just in time for the New Year, Citrix Chained Reboot Scripts now support Citrix Cloud and Citrix Virtual Apps and Desktops (CVAD) 7 1811!

Bay Area Citrix User Group Community (CUGC) Meetup October 2018 – Guest Blog Post by Donald Wong

Is Getting Started with Windows Server 2019 and Citrix Cloud Apps and Desktops Service Even Easier? Let’s Have a Look…

With Windows Server 2019, Getting Started with Citrix Virtual Apps and Desktops Has Never Been Easier! Here’s How…

Jeff Rutherford08-04-2015

Dane, that’s a ton of great info. To date, the problem with VDI’s and high-performance 3D graphics was that GPUs were never designed to be virtualized. NVIDIA tackled this issue with GRID — a GPU designed to be virtualized and serve multiple concurrent users.

If someone finds your article and is still investigating/contemplating implementing NVIDIA GRID, here are a couple of resources that they might want to check out. This white paper – http://bit.ly/1IMEBRr – NVIDIA GRID: Graphics Accelerated VDI with the Visual Performance of a Workstation – explains NVIDIA GRID in more depth. Here’s a side-by-side comparison video of NVIDIA GRID vGPU vs. CPU only – http://bit.ly/1Fke5N7

Again, thanks for all the technical details on implementing NVIDIA GRID.

Jeff Rutherford, commenting on behalf of IDG, NVIDIA, VMware, and Dell

Reply
Tobias K01-28-2015

Lots of useful information, thank you. Don’t forget that that fan speeds should not be allowed to go too low and that you need hefty power supplies to keep things going smoothly (preferably, dual in case one fails, which they do, or someone pulls a power cord out by mistake!). It is also worthwhile mentioning that XenServer 6.5 supports up to 96 vGPUs per server, which is massive and that vGPUs will now work also in conjunction with regular VMs on XenServer 6.5 Enterprise (in other words, XenDesktop/XenApp is not required). Finally, one must never forget that the GPU does not do _all_ the work, so right-sizing a VM with adequate memory and VCPUs to go along with is is important, as well.

Reply
Rasmus01-28-2015

Great blog! Thank you for sharing. Keep up the good work!

Reply
Jason Southern01-28-2015

Hi Guys,

Nice article, lots of information in there that we cover in our training events, so great that you’re sharing with the wider world.

For a bit of extra info, it’s only the K2’s that are airflow direction sensitive. K1’s go either way.

And we print a big arrow on the K2 case to show the airflow direction. It’s a lot easier than looking up tiny little part numbers.

Get yourselves to GTC, we’ve got loads of new stuff like this to share with everyone…

Reply
- Dane Young01-28-2015
  
  Thanks Jason! Updated the blog post.
  @youngtech
  
  Reply