Dust Free! NVIDIA GRID and a GPU Deep Dive: Guest Blog Post by Richard Hoffman
It’s been a little while since a colleague of mine Richard Hoffman wrote his first blog post, which you can find here. In preparation for GTC 2016 in San Jose next month, Richard wanted to publish the following article. If you have content you would like to contribute and be a guest blogger (or regular) on itvce.com, feel free to reach out to me at dane@itvce.com or on Twitter @youngtech so we can discuss. Without further ado, below is the guest blog post by Richard Hoffman, you can find him on LinkedIn or on Twitter.
I wrote this article about a year ago but only published it for an internal team. In honor of the upcoming NVIDIA GTC 2016 conference this April, I have dusted it off and here it is! I’ll explore how a GPU works and cover some lesser known facts about the NVIDIA GRID vGPU system. To kick it off, let’s start with a basic question. How do CPU’s and GPU’s differ? A CPU has a handful of cores and is optimized for sequential tasks. A GPU has hundreds or thousands of cores and is optimized for parallel tasks. But what is really going on inside the GPU? Does the below diagram of a GPU tell the whole story?
See the below diagram of the NVIDIA GRID vGPU implementation on a hypervisor. Notice the four nodes under “Timeshared Scheduling.” These are the GPU’s different engines and each has a specific function. The first is the 3D engine and it is responsible for powerful parallel processing. The second is the Copy Engine which moves data through the GPU system. Third we have the NVIDIA Encoder or “NVENC.” The NVENC is responsible for encoding the graphics stream into the H.264 format. As of May 2015, XenDesktop’s ICA protocol is encoded into H.264 by the CPU. Perhaps encoding the ICA stream into H.264 on the GPU will be done by Citrix in the future. The NVIDIA Decoder or “NVDEC” decodes the graphics stream. For instance, it could decode an H.264 video stream on a physical endpoint with an NVIDIA GPU.
All four of these GPU engines are time-sliced for the virtual machines on the hypervisor. In the diagram above, there are two virtual machines that are running virtual GPU’s on one physical GPU. In this scenario, the two virtual machines each get 50% of the time of the physical GPU. Let’s say that these two virtual machines are using a K260Q vGPU profile on a GRID K2 card. The physical GPU’s on the K2 card have 1536 GPU cores. NVIDIA calls these “CUDA Cores” but they are essentially processing cores. Each of our two virtual machines in this example gets access to all of the 1536 cores for at least 50% of the time. A fair sharing model is applied here. A virtual machine gets a minimum of 50% of the GPU cores’ time but if only one virtual machine is active, then it gets 100% of the cores’ time.
Shown above the four GPU engines is “Timeshared Scheduling,” where the scheduling, or time-slicing, of the physical GPU occurs. The scheduling of GPU cores is done here at the hardware level on the GPU itself. The scheduling of the GPU cores does not occur at the hypervisor level.
The framebuffer, or video RAM, for the GPU is shown in the bottom-right of the above image. A set amount of framebuffer is allocated to each virtual machine at the time of startup. The framebuffer allocation remains static and is not shared. To continue our earlier example, if the above two virtual machines each have a K260Q vGPU profile, then they will have 2GB of dedicated framebuffer. The framebuffer (“graphics memory”) for all GRID 1.0 vGPU profiles is shown in the second chart below. (Since the time of writing this, NVIDIA has release their GRID 2.0 product and I’ll have an article written on that soon. The principles of how the GPU work are basically the same between the 1.0 and 2.0 products although some of the specs and capabilities have improved. This article is based on the GRID 1.0 product.)
The below diagram shows the graphics processing path through an NVIDIA GPU. At the top-left, graphics command come into the GPU from the CPU. These commands are processed through the GPU cores and onto the framebuffer. If this was a physical workstation with a monitor, the pixels would be finally deposited here in the framebuffer. However, in a scenario like the NVIDIA GRID infrastructure, the graphics must be sent over the network. This continues the journey from the frame buffer to the NvFBC (Full-frame Buffer Capture) or the NVIFR (In-band Frame Readback). The NvFBC is better for remote desktop applications where there is one video stream per instance, like Citrix XenDesktop or VMware Horizon View. Contrast this to the NVIFR which supports multiple streams per instance.
After the NvFBC, the graphics are encoded by NVENC into H.264 and sent over the network to the physical endpoint. As mentioned above, XenDesktop currently uses the CPU to encode graphics into H.264 but may support encoding on the GPU in the future. VMware’s Blast Extreme protocol that is releasing with Horizon View 7, will use NVENC to encode its video stream.
Image courtesy of AWS
The architecture of NVIDIA’s GRID GPU’s is essentially the same as the NVIDIA Quadro line of GPU’s for workstations. The closest comparisons are the GRID K1 to the Quadro K600 and the GRID K2 to the Quadro K5000. However, the GRID cards have the hardware scheduler to time-slice the GPU cores where the Quadro line of cards do not.
XenDesktop’s ICA or “HDX” protocol uses the H.264 protocol. H.264 is the same as MPEG-4 AVC (Advanced Video Encoding). H.264 is used in Blu-Ray, YouTube, Adobe Flash, and Silverlight. It is designed to support low and high bitrates and low and high resolutions. This is a good fit for Citrix’s HDX protocol that is designed to give good performance in a variety network situations.
H.264 was integrated into XenDesktop when the HDX 3D Pro feature was released in XenDesktop version 7.0. At NVIDIA’s GPU Tech Conference in 2015, one of Citrix’s product managers referenced H.265 on the horizon for HDX. I have not heard that mentioned anywhere else though. H.265 is the standard following H.264. It is synonymous with HEVC or High Efficiency Video Encoding. It has double the compression efficiency as H.264. That will allow the same quality at half the bitrate or the same bitrate at a much higher quality.
Another interesting topic is NVIDIA’s implementation of a Frame Rate Limiter on the vGPU. Because machines using a virtual GPU share the same physical GPU, performance does vary if there is resource contention on the GPU. The Frame Rate Limiter caps the frame rate on vGPU profiles to 60 FPS (frames per second). This is to prevent the frame rate from varying greatly for the end user as other users start and stop use of the GPU. Otherwise, end users could wonder why their frame rate is varying between 40 and 300 frames per second, for example. The Frame Rate Limiter should help limit support calls and give a more consistent user experience. The FRL (Frame Rate Limiter) is not implemented on pass-through GPU as the virtual machine gets the whole GPU dedicated to it.
To clarify, the vGPU profiles ending in “Q” like “K140Q,” “K260Q,” etc., have the frame rate limited at 60 FPS. The lower performance vGPU profiles that do not end in “Q” like “K100” and “K200” have the frame rate limited at 45 FPS.
The FRL can be disabled by running the below command on the XenServer host. However, NVIDIA advises that this is not a supported configuration and does not recommend using it in production.
xe vm-param-set uuid=<VM-UUID> platform:vgpu_extra_args=”frame_rate_limiter=0″
The FRL only limits the frame render rate. If the GPU is used for other activity, the FRL will not limit that activity.
Vsync is another feature that has similar implications as the Frame Rate Limiter. Vsync synchronizes the frame rate with the refresh rate. For example, if you have a 60Hz monitor, Vsync will set and synchronize the frame rate at 60 FPS. Vsync is really a relic from the physical world because in a virtual desktop environment, there is not a physical monitor on the hypervisor-host to sync with. However, Vsync is required for WHQL certification and some apps may fail if Vsync is not present.
Vsync can be disabled in the NVIDIA Control Panel and this brings up an interesting topic. If you disable the Frame Rate Limiter, as noted above, you may not see that much of a performance gain because Vsync is still throttling the performance. However, if you disable both Vsync and FRL, then you will see much higher frames per second.
Vsync’s default setting is 60 FPS for both pass-through and vGPU.
Many people argue that the human eye cannot perceive the difference in frame rate above 30 FPS. If that is the case, then why set the Frame Rate Limiter to 60 FPS? The reason comes from the fact that the HDX protocol is not locked with the Frame Rate Limiter or Vsync. For example, let’s say that the Citrix HDX protocol was only able to deliver 30 FPS over its network connection and (hypothetically) the Frame Rate Limiter was set to 30 frame per second. Because HDX is not synced with FRL or Vsync, motions on the screen could be choppy. However, if the GPU sends 60 frames per second down the pipe, it compensates for HDX being out of sync with FRL and Vsync. That is because enough frames are arriving to the endpoint in a smooth fashion.
GPU metrics are also worth covering as they are not completely straightforward. NVIDIA’s nvidia-smi command line tool is the built-in GPU metric tool. An example of its output is below. When nvidia-smi is run from the XenServer hypervisor, it can only report on GPU utilization for vGPU instances. It cannot report on GPU utilization for GPU pass-through. This is because the hypervisor has no knowledge of what is happening on the pass-through GPU. To the hypervisor, it is simply a PCI device that is directly passed-through to the virtual machine but nothing else is known to the hypervisor about the PCI device’s function.
GPU-Z is another popular GPU metrics tool. It is installed within the guest OS of the virtual desktop. There is a difference between the metrics it is able to provide for virtual GPU and pass-through GPU. In the screenshots below, vGPU is shown on the left and pass-through GPU is shown on the right. GPU-Z in a vGPU machine is lacking several metrics that are present in the pass-through machine. The pass-through machine gets more metrics because of its direct PCI connection to the physical GPU.
It is important to understand that GPU-Z is reporting on the utilization of the entire physical GPU, even when it is running on a vGPU machine. As described earlier, a virtual GPU gets a time-slice of the physical GPU. GPU-Z is not taking sharing or time-slicing into account. It is reporting on any and all usage of the physical GPU. Let’s use the same scenario described earlier in this article where we have two virtual machines both with a K260Q vGPU profile. Those two vGPU-enabled machines are running on the same physical GPU. If you run GPU-Z from both machines and GPU-Z reports 90% utilization on both machines, does that mean that the GPU is being utilized at 180%? No, it simply means that GPU-Z is reporting on how much the physical GPU is being utilized. In this example, the virtual machines could each be utilizing the physical GPU at around 45% each. Of course, your utilization won’t always be even across machines unless you were running the same benchmark tool simultaneously.
Another topic I will cover briefly is the ratio of video RAM (framebuffer) to system RAM. There should be at least as much system RAM as there is video RAM. For example, a virtual desktop with 4GB of video RAM should have at least 4GB of system RAM. If the system RAM is less than the video RAM, it can create a performance bottleneck.
The RAM and core clock speeds differ between the K1 and K2 cards. If the users of the GPU enabled virtual desktops have very specific GPU requirements, the below information may help choose the best GRID card for their needs.
RAM:
-K1 = DDR3 @ 891 MHz
-K2 = GDDR5 @ 2.5 GHz
Core Clock Speed:
-K1 = 850 MHz
-K2 = 745 MHz
These specs were pulled from these two NVIDIA documents:
http://www.nvidia.com/content/grid/pdf/GRID_K1_BD-06633-001_v02.pdf
http://www.nvidia.com/content/grid/pdf/GRID_K2_BD-06580-001_v02.pdf
What is CUDA? You may have seen NVIDIA’s GPU’s listed with a number of “CUDA” cores. As mentioned above, CUDA cores are the GPU’s processing cores. CUDA is NVIDIA’s programmatic language for parallel processing. For example, CUDA allows scientists to harness the parallel processing power of the GPU for scientific computation. In the GRID 1.0 product, virtual GPU does not support CUDA. In other words, you cannot run an application on vGPU that requires CUDA communication to the GPU. The GRID 2.0 release does support CUDA in a vGPU session.
Here’s why CUDA did not work in vGPU in GRID 1.0. CUDA sent code directly to GPU and would run until completion. If this exceeds the user’s time-slice of the GPU, the CUDA code continued to run and locked out the GPU from other users. There was not yet a mechanism to halt the code for a sharing model. Again though, CUDA does work in vGPU in GRID 2.0.
The good news is that applications that rely on CUDA can alternatively be run on a pass-through GPU in GRID 1.0. An interesting example of CUDA requirements is Adobe Photoshop. Photoshop does not require CUDA except in its Mercury Playback Engine. So, users could run Photoshop in a vGPU desktop with no issues until they launch the Mercury Playback Engine.
If you run into an issue where it is suspected that an application may require CUDA, simply change the GPU profile of that virtual machine to a pass-through GPU. Run the test again, and if successful, you will know at a minimum that there is an issue with vGPU.
I hope that this has been an informative dive into a great technology from NVIDIA! Also a special thanks to Jason Southern for being such a dedicated sharer of GRID knowledge!
-Richard
I’d like to thank Richard for taking the time to put this content together. In our little world of information technology and virtualization, little nuggets of knowledge like this can go a long way to help fellow brothers and sisters in arms. If you have content you would like to contribute and be a guest (or regular) blogger on itvce.com, feel free to reach out to me at dane@itvce.com or on Twitter @youngtech so we can discuss. Otherwise, feel free to leave comments, questions, or any feedback for Richard.
Richard is right.
The Frame Rate Limiter was introduced within the GRID stack to create a fair balance between users. Since GPU Passthrough is one user only, there is no need to have a FRL.
Hi Tobias,
By the way, congrats on your recent induction to the CTP program!
If I understand correctly, you are only concerned about implementing a FRL on a pass-through GPU. In the case of pass-through, the VM would get 100% of the GPU cores for 100% of the time on that physical GPU. So, there would be no worry about time slicing or concurrent users. For example, a K2 card has two physical GPU’s on the board. If one of those GPU’s is passed-through to the VM, then that physical GPU is already a lost resource in terms of sharing to other VMs. In my opinion, if you’ve already “lost” that resource (the entire physical GPU), you might as well let that user go nuts with it! (And by “go nuts” I mean have high FPS of course.)
Does that help?
Cheers,
Richard
Is there really no way to impose a frame rate limit when using GPU passthrough? It seems like a waste of processing power plus potentially producing choppy video delivery. Or because the time slicing is fast, is this not so much of a concern, at least until you get too many concurrent users?