Introduction:
In a recent engagement with one of our customers, I received a call that many Virtual Desktop VMs are going to an unregistered state which obviously caused employees to panic given the state of resistance they are already in which required immediate intervention. Citrix Virtual Apps and Desktops “XenDesktop” 1811 was being used with VMware vSphere 6.7 U1 .
First Problem:
Studio showing Virtual Desktops as Unregistered is a familiar sight and could be caused by a number of events such as networking, security, and/or misconfiguration but for this case specifically, it seemed that unregistered Virtual Desktops were showing as frozen/stuck VMs in vCenter/vSphere console with just a black unresponsive screen and upon restart became operational as a VM and registered with DDCs.
That was key to understanding that this was not a security, networking, nor misconfiguration issue but rather OS was clearly crashing with some type of BSOD which was not visible from the now stuck console. vCenter console did not show any CPU/Memory spikes for any Virtual Desktop VM nor did vRealize Operations report on any spikes for the same. Checking with effected users showed that standard applications are being used amongst all users and issue happening intermittently for some users, all of which use the same set of data and instructions so seemed that a process/applications causing this was unlikely. Given the above, it was determined that this was a storage related issue of some sorts.
Key aspect of troubleshooting intermittent issues is keep on asking questions about behavior, I cannot stress how important that is, make it clear that these questions are not user related and thus have no consequences on the user but backend related to understand the flow of a certain event. An observation stated by IT was that this was happening to users that often leave their desks and come next day to commence their work where they left off.
The statement meant that users where not signing off and always disconnecting their sessions to come and continue work the second day which in turn meant that no disconnect idle timer was configured which ultimately meant that those virtual desktops where not being restart for some time. Checking the catalog configuration showed that Citrix MCS was being utilized with RAM cache 512MB and Overflow Disk Cache 10GB configured.
That’s it, the overflow disk cache was being filled up by the user eventually causing the VM to crash and consequently get unregistered. The restart of the VM would cause the overflow cache disk to clear which returned the VM into a working state. Fast forward couple of days in the future, users are not signing off which means virtual desktops are not being restarted (Delivery Group Power Management does not apply to Disconnected sessions) thus the overflow cache disk will fill up again with no space to do any IO writes and crash the VM .
Solution:
As simple as it may be always work your way through the basics when troubleshooting any issue with the prize question in mind, what changed or is changing from the as-built tested standardized environment with asking questions and observing being the ultimate way to do so. Logs are important of course no argument there but in my very humble opinion is the wrong way to start, always start by chatting, asking questions, observing, and analyzing then turn to logs and events. One thing that strikes me with many vendor support cases is that they directly ask for debug logs without the initial crucial analytical discussion that is why most cases take forever to get solved.
I could have rushed to open cases with Citrix and VMware for the same as recommended by my customer but I clearly explained that intermittent issues effecting two products is going to be weeks of debug logs and throwing inter-vendor blames with no foreseen outcomes ( Something I am sure we all have went through one way or another in the line of our career ).
Just enable the disconnected session timer which means if a user session has been disconnected for X amount of minutes, logoff the session which allows DDC to subsequently restart the VM clearing the WC disk ( If you have not disabled auto restart on the delivery group ). You can also choose to expand the WC disk which is recommended to be the same size as the free disk on your C drive on the base image but even so eventually if the virtual desktop is not restarted sometime ( planned ), the WC disk will fill and the VM will crash.
In Citrix Virtual Apps and Desktops 1903 release, the MCS IO write-back cache driver is now emulating the same one used by PVS which means now you have a visible drive where the cache will be stored and you can also store logs or anything else. This means that you can use vRealize Operations or any monitoring tool to check the utilized/available size of this drive for your virtual desktops which would be very helpful in this situation.
Now that this has been solved, the second day I received another call that now most VMs are showing as unregistered, lucky for us it was only un-utilized VMs that where showing in that state so we had to act fast before someone new tried to login.
Second Problem:
So VMs are showing fine in vCenter, not stuck or anything else, but showing un-registered in Studio. Running the usual Scout and other VDA troubleshooting techniques showed nothing was wrong. Again lets start asking questions and observing, one thing we noticed is that connected users are not being effected, only when they logoff that their previously connected VM becomes unregistered. Doing that test, we also observed that the virtual desktop that the user logged-off from did not auto restart in vCenter events and checking it manually confirmed the same.
Voilà, lets try to logon to vCenter with local service account used in the DDC hosting for that vCenter infrastructure, Oops could not login. In retrospect, earlier that day LDAP on NetScaler also stopped working although the domain service account seemed not to be locked and password was correct. Again a great observation by Mr. customer is that both of these accounts although one domain and one local where created in the same day hence password might be expired.
In vCenter, the default expiration policy is 90 days and so was the default domain expiration policy applying to the whole forest.
Solution:
For the vCenter service account expiration, change the password expiration policy for VMware local account to 0 which means never expire. A second approach would be to keep the password expiration policy and change the password whenever that expiry time comes then reflect the same in DDC Studio configuration with the new password.
For the domain LDAP service account, make sure that you place the user in an OU where the expiration group policy was not applying or change it every 2 months and reflect the same in the NetScaler LDAP configuration.
Conclusion:
Despite requiring very minimal changes to rectify these issues, they had severe repercussions on the environment, which effected the productivity of users which is always the ultimate aim of any VDI project. That been said, the aim of this post is to emphasize two things, one is that as good as you may think you are , without a proper documented configuration approach you will miss something for sure, and secondly to take the correct approach when trying to troubleshoot an issue not rush to open vendor cases.