Introduction:
I am a big fan of HCI “Hyper-Converged Infrastructure” technology in general and Microsoft Storage Spaces Direct specifically because of the efforts Microsoft is putting into integrating their Private, Hybrid, and Public Cloud offerings into a full stack of agile, elastic, and scalable software defined solutions including but not limited to software defined storage.
Azure stack which constitutes Microsoft’s private and hybrid cloud offering utilizes storage spaces direct “S2D” and I am assuming now that Microsoft Azure underlying infrastructure has been upgraded to Server 2016, they are using S2D or an abstract of the same technology to deliver VMs and storage services to Azure subscribers.
This gives me big confidence in that Microsoft will continue to enhance, optimize and support their storage spaces HCI offering in the future. We have seen this recently with the production release of Windows Admin Center previously labeled project Honolulu that eases the management of S2D clusters and also with the support of hosting an S2D cluster on the public cloud that is Microsoft Azure.
Being a VDI guy myself, a use case for S2D would be a fast reliable highly available share repository for users profile and data hosted on the cloud when utilizing Microsoft Azure to host my VDI infrastructure along with Citrix Cloud of course . I have blogged about building an S2D cluster on Azure for the same purpose HERE.
In September 2017, Microsoft Azure made news at least to me by announcing the public preview of global VNet Peering in only three regions. This was major news because of the possibilities this technology opens in terms of Disaster Recovery, Multi-Site, and Extended-Sites environments because of the ability to peer different regions not to mention offer a reliable low latency high-bandwidth internal network between networks hosted in different geographical locations.
In March 2018, Microsoft announced that Global VNet Peering is now generally available for production environments in ten regions listed HERE. “Global VNet Peering enables resources in your virtual network to communicate across Azure regions privately through the Microsoft backbone. VMs across virtual networks can communicate directly without gateways, extra hops, or transit over the public internet. This allows a high-bandwidth, low-latency connection across peered virtual networks in different regions”.
Citrix Cloud with Microsoft Azure for VDI environments is something that just makes sense no matter how you look at it. I have blogged about Citrix XenDesktop “on-premises” Active Passive and Active Active high availability considerations HERE and HERE so when global VNet peering was generally available, the first thing that popped into my head was what would that mean for an active active approach to Citrix VDI infrastructure.
The biggest hassle in any Active-Active approach to VDI was and still is, maintaining the availability of User Profiles and Data in a supported manner across different geographical locations which normally constitutes of different networks and more importantly some form of storage replication. Though many vendors might sell you on the idea that SAN devices across WAN links can be pooled “such as Dell EMC VPLEX” still Microsoft does not support replicating profiles in an active-active manner no matter the underlying technology, and the consistency of user data will heavily rely on your WAN link latency and bandwidth on top of the SAN device replication mechanism itself.
Stretched sites have completely different considerations when it comes to a multi-site approach because components in those sites act as if they belong to the same site given that a high bandwidth low latency link exists. These sites will combine storage devices in a single pool rather than clustered replication and mostly extend L2 network as well, but are very difficult to achieve when your DR or remote sites are geographically scattered in different areas and regions.
Given the above, the best possible way to achieve an Active-Active supported highly available storage repository for VDI User Profiles and Data would be to have extended sites that share the same L2 network backbone, storage infrastructure, and operate under a low latency high bandwidth link. Welcome in Microsoft Azure, Global VNet Peering, and Storage Spaces Direct managed with Windows Admin Center.
What if we host our VDI control plane on Citrix Cloud which is already Highly Available, and our VDI Resource plane on several Azure regions with networks globally peered, then create an storage spaces direct cluster that spans different resource groups and different regions which would provide in theory an extended storage repository that is regionally highly available. Add couple of NetScaler’s and Cloud Connectors in every Azure region along with Azure Traffic Manager and you are good to go with an Active-Active “Supported” Multi-Site VDI environment.
This sounds out of this world but here comes the BUT part (still going to deploy and test this) and we all know what Jon Snow said “Everything before the word BUT is horse shit”.
One, the latency between Azure regions that are globally VNet peered is around 100ms at least based on my testing between UK West and West US 2. This latency is high and not fit for a stretched cluster at least in an on-premises environment so definitely performance will be impacted . Two, their is no available information from Microsoft if S2D over a VNet peered network between regions is officially supported by Microsoft. Three, I am yet to stress test this, but I am assuming under a 100ms latency that performance will be impacted when accessing SOFS shares hosted on S2D.
Update: Testing latency between UK West and UK South , latency is 3ms which means from a pure S2D perspective it is fully supported and production ready. Will Microsoft support S2D over peered VNets !? As of now, no idea … Though West and South UK can be also configured as Availability Zones because of both being under the same region so I honestly do not see any reason for this not being supported .
Four, you cannot use the S2D Azure template because it does not allow you to build the machines in different resource groups this regions so it has to be done manually. Five, the configuration is technically possible and I will demonstrate below but because of the lack of supportability information for S2D and Global VNet peering, I cannot comment if the whole scenario is officially supported by Microsoft.
Finally I did not also find information whether Microsoft supports S2D on Azure with Managed disks or just storage accounts so I gave it a shot and Managed disks seem to be working fine though I am not sure of the supportability as well.
Configuration:
I am going to keep this at bare minimum just for the sake of testing, so here goes the following deployed components to simulate a stretched 2-node S2D cluster over globally peered networks in different regions:
UK West Region:
-
1 x Resource Group
-
1 x Virtual Network with subnet 192.168.1.0/24
-
1 x Server 2016 “DS2v2”
-
2 x 128GB Premium Disks “Managed Disks”
West US 2 Region:
-
1 x Resource Group
-
1 x Virtual Network with subnet 10.2.1.0/24
-
1 x Server 2016 “DS2v2”
-
2 x 128GB Premium Disks “Managed Disks”
Make sure you have a domain controller in each resource group if its in a different region incase Azure experience an site outage. For this demo I only have one AD. Also mind any errors you see in screenshots, as I have all of this pre-created so it will show overlapping, you are good to go.
Because this is a 2-node S2D cluster, a 2-way mirror (RAID1) will be created within the cluster so we should have around 250GB usable disk that will host an highly available SOFS share for User Profiles and Data. No need for availability sets here unless you have multiple machines in each resource group which is a production environment should be the case. Windows Admin Center will be installed and configured to simulate ease of management for an S2D cluster from a simple web management console.
1- Create UK West Resource Group “mind the error, its only because I have it pre-created”:
2- Create UK West Virtual Network:
3- Point UK West Virtual Network DNS to Domain Controller:
4- Create a Server 2016 DataCenter Virtual Machine in UK West:
5- Add 2 x 128GB Premium disks to UK West Virtual Machine and set IP to static. Join the VM to the domain and DO NOT initialize Disks after they are added:
6- Create West US 2 Resource Group “mind the error, its only because I have it pre-created”:
7- Create West US 2 Virtual Network:
8- Point West US 2 Virtual Network DNS to Domain Controller (I will point to my AD in a different Region but ideally you should have one in every Location so point to it and add the AD in the other site as secondary DNS):
9- Create a Server 2016 DataCenter Virtual Machine in US West 2:
10- Add 2 x 128GB Premium disks to West US 2 Virtual Machine and set IP to static. Join the VM to the domain and DO NOT initialize Disks after they are added:
11- Go to West UK Virtual Network and Configure Global VNet Peering:
12- Go to US West 2 Virtual Network and Configure Global VNet Peering (note that the status is now connected thus machines in different regions and different networks communicate through Microsoft L2 backbone through private IPs without any gateway intervention) :
13- Its time to configure storage spaces direct cluster on both servers. I will be using a file share quorum hosted on my AD for the Failover Cluster (don’t worry no shared storage needed here) but make sure to use Cloud Quorum which is relatively very very cheap, easy to setup, and officially supported ( I don’t want to see a storage account in my subscription to clean up later so going the easy way ).
14- Run the following Powershell ISE command on one of the machines or a third machine, I am running this from my domain controller (change $nodes to the hostnames of all of S2D hosts), this will install File Services and Failover Clustering on all S2D nodes:
$nodes = (“S2DUKWest.diyar.online”, “S2DUSWest.diyar.online”)
icm $nodes {Install-WindowsFeature Failover-Clustering -IncludeAllSubFeature -IncludeManagementTools}
icm $nodes {Install-WindowsFeature FS-FileServer}
15- Run the following PowerShell command to validate the cluster (could take some time and ignore any outcome errors), and create a cluster with no storage and a specific cluster IP (The cluster will be created with one IP here belonging to one of the virtual network subnets and after that a secondary DHCP IP will be taken by cluster configuration automatically in case the cluster fails to the second machine which is hosted on another virtual network):
Test-Cluster -node $nodes
New-Cluster -Name S2D -Node $nodes -NoStorage -StaticAddress <IP address>
16- After creating the storage account for the cloud quorum, issue the following command on one of the S2D VMs to create the same (I wont for now, just used a file share based quorum for sake of this post):
Set-ClusterQuorum -CloudWitness -AccountName <StorageAccountName> -AccessKey <StorageAccountAccessKey>
17- Run the following Powershell command on one of the S2D VMs to inform Failover Cluster that this is an Storage Spaces Direct Cluster and to configure the disk pools accordingly:
Enable-ClusterS2D
18- Now that an S2D Storage Pool has been created, we need to create a virtual disk that is configured for 2-way mirror since we only have 2 nodes. This disk is going to host the file services for our SOFS:
New-Volume -StoragePoolFriendlyName S2D* -FriendlyName S2D -FileSystem CSVFS_REFS -UseMaximumSize
19- Now that virtual disk has been created, it is time to create the Scale Out File Server and SMB File Share:
Add-ClusterScaleOutFileServerRole -Name S2DSOFS -Cluster S2d
New-Item -Path C:\ClusterStorage\Volume1\Shares -ItemType Directory
New-SmbShare –Name Profiles -Path C:\ClusterStorage\Volume1\Shares
20- Now we have an Highly available share “ \\S2DSOFS\Profiles“ that is spanning two different geographical sites but showing as a single repository while S2D handles the backend storage infrastructure pooling. Lets take a look of what all of this looks like from a GUI perspective:
The reason the second IP is down, is because the current owner of the cluster is the node that is present in virtual network 10.X.X.X so this normal, when this cluster failover to the 2nd node, the 2nd cluster IP which is currently down will come up and the failed node cluster IP will be down.
21- Lets try to Failover and Failback both SOFS role and storage owner:
22- What about if we simulate an unplanned failure of cluster owner node making sure both role and storage is owned by that specific node (failover was very fast and no outage was noticed):
23- Time to test performance and what better way to do it than Diskspd (If deploying on-premises, VMfleet can be used and also Microsoft Windows Hardware Lab Kit):
In simple terms, diskspd will create a 1GB file and use 2 threads to issue 64K random IO with 70% read and 30% write for 2 minutes while monitoring IOPs, latency, and MB/s .
diskspd.exe -b64K -c1G -d1 -t2 -r -o2 -d120 -Sr -ft -L -w30 \\sofs\profiles\S2D.S2D
Latency is still very very high so though technically everything is working fine, Reads and Writes are going to suffer when users start accessing this for their profiles and data. I would not recommend this for any production environment until latency has been sorted out, other regions may have different latency and bandwidth but we need some input from Microsoft on this. Adding more GBs to managed disks will add IOPS but network latency will still be the bottleneck.
Update: Testing latency between UK West and UK South , latency is 3ms which means from a pure S2D perspective it is fully supported and production ready. Will Microsoft support S2D over peered VNets !? As of now, no idea … Though West and South UK can be also configured as Availability Zones because of both being under the same region so I honestly do not see any reason for this not being supported .
24- Install Windows Admin Center and add Storage Spaces Cluster (Cannot be installed on domain controller) and remember that only Edge and Chrome are supported:
Time to update those 2016 machines with the April Cumulative Update … Even after updating, trying to connect to S2D throws the same error. Turns out S2D HCI management through Admin Center is only supported for Server 2019 Preview as of now . Lets check single server management:
Cool Overview Console:
Manage Files and Folders:
Manage Firewall Rules:
Manage Local Users & Groups:
Manage Network Settings:
Direct PowerShell:
Manage Processes:
Manage Registry:
Manage Roles and Features:
Manage Services:
Manage Storage:
Manage Updates:
Manage Failover Cluster:
Change Identity provider to Azure AD (In order to manage Azure IaaS VMs, either create a VPN and connect to the internal IP of the VM or directly add the VM using the public IP), Integrating with Azure also allows Windows Admin Center to protect VMs via Azure Site Recovery and more integration will surely follow:
Conclusion:
I only hope that we reach a stage where Microsoft Azure VMs globally peered have sub-5ms latency though I understand it is easier said than done, imagine the amount of traffic Azure deals with every second … Let me know what you think and how we can build on top of this. I am currently experimenting with Citrix Cloud, Citrix SF/NS, and Azure Application Gateways with Multi-Site Peering.
Salam .