Microsoft S2D on Dell EMC XR2 2-Node Azure Stack HCI Server 2019

microsofts2d2019

Introduction

Microsoft Storage Spaces Direct formally known as Microsoft S2D is an Converged Infrastructure (CI) & Hyper-Converged Infrastructure (HCI) product that is available on Server 2016 and Server 2019 Datacenter as part of Microsoft Software Defined Datacenter offering which uses local-attached drives on stand-alone X86 servers with a supported configuration to create a shared pool of usable highly available Software Defined Storage (SDS) as part of an Microsoft Failover Cluster.

Storage Spaces started as a technology that offered software RAID for servers and desktops running Windows 10 and then evolved into Storage Spaces Direct on Server 2016 then 2019 Datacenter as part of Microsoft Azure Stack offering given that Azure Stack is built on Microsoft S2D Software Defined Storage. Although under-rated in the HCI world, Microsoft S2D has gained significant grounds both independently as an HCI solution and as an Azure Stack hybrid cloud offering. S2D is an in-kernel feature of Microsoft Server so needless to say that as an HCI product it only works with Microsoft Hyper-V.

~ Practice What You Preach ~

If you have read any of my previous blogs especially on Nutanix and VxRail/vSAN, you would know that I am not a fan of cliché sales techniques practiced by many in the market. I do not believe that there exists a product that is superior in all forms over its competitors nor do I believe that there is a product that fits every customer requirements. A product needs to be pitched and sold based on solid understanding of specific requirements thus any differentiating features of two given products that are not part of any requirement should not be part of the assessment.

Although trivial compared to other projects, this was the very first time I was able to sell two different HCI solutions to the same customer serving different requirements and I couldn’t feel more proud to be honest. Despite all the odds that being technical, political, and financial, the proposed solution was cost effective, covered all requirements, expected outcome delivered and beat all competition.

image

Microsoft S2D !

You probably never heard of Microsoft S2D for couple of reasons, most of which is that Microsoft does not target marketing and selling this product independently nor do partners for that matter. In Server 2019, the best release of S2D since its inception, S2D is part of the datacenter license so if you are already paying for Server 2019 Datacenter, Microsoft is already making all the money needed from you regardless if you use or not use S2D. Also note that the required supported server hardware is not sold or marketed by Microsoft so no benefit for them in that area, they only sell software when it comes to HCI. Last but not least, most of the supported hardware vendors that offer Microsoft S2D Ready Nodes (more on that soon), already have their own HCI solution either directly or indirectly so they are not interested in spending the bucks to market or the effort to sell such a solution.

That been said, Microsoft S2D is an enterprise HCI solution and in Server 2019 it got a big boost of enterprise supported  features that now cover a large segment of any customer requirement. I am not saying it is the best, well it could be, but that depends on your workloads, requirements, budget, desired outcome, constraints, and so on … I am a big fan of Microsoft S2D because first its hardware independent and second most customers already pay for Microsoft Datacenter licenses anyway so why not take full benefit of the offered products within. Don’t be surprised to hear that Microsoft S2D is being used in more than 10,000 clusters worldwide which would account to hundreds of thousands of VM workloads.

S2D advancements such as deduplication support for ReFS, extended maximum storage pool size, new performance & health checks, support for external hosted share for quorum and true two-node cluster in Server 2019 made the solution a lot more enterprise usable. The best move Microsoft made not only for S2D but for servers and clusters is the Microsoft Windows Admin Center (Free) which allows any administrator to configure, operate, and monitor S2D clusters in a very simple and efficient manner without having to resort to PowerShell or even SCVMM for that matter. S2D builds on Failover Cluster so combining it with Windows Admin Center makes an enterprise HCI cluster with a built-in hypervisor Hyper-V and a centralized web management tool Windows Admin Center.

S2D Hardware …

Microsoft S2D is hardware sensitive, specifically on the storage HBA, Network Card, and Disks so those are areas that cannot be compromised in. You are welcome to use any server hardware vendor as long as the part of this hardware are all supported by Microsoft to run S2D which can be validated in Microsoft Windows Server Catalog. Basically the network needs to be minimum 10GB and recommended is 25GB with RDMA either RoCE or iWARP (the second being my choice given RDMA will work without specific server or switch configuration). The storage controller needs to be a SAS passthrough HBA not RAID HBA, this is very critical, JBOD or RAID0 is not supported and will give very bad performance so make sure the HBA is SAS pass through. The disks must be enterprise SSD disks not consumer SSD disks, I will discuss the required space and endurance later but make sure that SSDs are enterprise. S2D has CPU and Memory overhead depending on the cluster size, enabled features such as Dedup, RDMA network, and workloads but then again if the HBA, Network, and Disks are covered, performance and support wise we are good to go.

If you want to understand how S2D deals with Caching, Fault Tolerance, Efficiency, Sizing and other HCI related features then I suggest you spend some time reading the Documentation which is surprisingly quite good and comprehensive.  For example in an All-Flash configuration a minimum of 4 disks per server/node is required. Caching is not required but recommended for write intensive workloads, NVMe as a good example of high endurance write caching disks. 4GB of RAM is required for every TB of cache disk. The number of nodes in a cluster and the applied fault tolerance supported such as mirroring for 2-nodes or parity for 4-nodes and the resultant usable resources accordingly. Many other considerations come into play based on the workloads that will run on the cluster and how to size it accordingly so make sure you are well informed on the subject before delving into procuring an S2D solution.

Why Microsoft S2D ?

The reason I opted for Microsoft S2D in this particular case not our default VxRail/vSAN or Nutanix is that our customer was building an infrastructure for ships/vessels that were only allowed to physically host rigid servers and rack components due to policy and regulation. Dell XR2 from a physical server perspective was the obvious choice but it did not come from Dell EMC as an S2D Ready Node. S2D Ready Nodes are pre-built from the manufacturer with components that are supported by Microsoft and offer a single line of support for the same which is in our case Dell EMC, aside from that it is no different from any other server offering. Conceptually, if you procure any hardware vendor server and make sure all of the components are S2D certified, you would have an S2D ready node with only losing the single line of support which means you call the hardware vendor support for any hardware issue and call Microsoft for any software issue.

Aside from the ‘”rigid” requirement, the customer workloads were very standard so a single host can suffice yet for HA we needed a second server. The reason this is important is that, Microsoft is the only enterprise HCI vendor that can support true 2-node HCI cluster without requiring a third server component given that these vessels are completely isolated environments. S2D will still require a quorum but unlike other vendors such as VMware vSAN and Nutanix, the quorum can be hosted on an external file share as of Server 2019 which essentially means if your router supported CIFS shares you can use that as your quorum while with other solutions you need to procure a third server which must be rigid as well so its price is quite high and technically not needed to run resources.

Server 2019 also introduced additional types of nested resiliency for 2-nodes cluster that would allow a cluster to sustain failures from two fault domains ( both servers ) at the same time for example a disk failing in one server and a disk failing in another server but of course this would come at the cost of usable disk capacity. With standard mirroring we get 50% usable disk capacity and can sustain one server full failure but not components from two servers at the same time, with the newly introduced nested two-way mirroring we can sustain failure of components from both servers but we get usable disk capacity of 25% while with nested mirror-accelerated parity we get around 40% with a bit of performance hit. We chose to go with standard 2-way mirroring and assigning a hot spare disk on each node which made more sense cost, performance, and availability wise.

XR2 S2D Goodies :

Lots of vendors jumped-in since there are a lot of vessels but all of them without any exceptions could not come up with an 2-node HCI cluster without having some kind of cloud ( not an option because of latency and internet availability ) or on-premises quorum that being a dedicated server which had  to be rigid. Microsoft S2D was up to the task from a software perspective but we had to make sure that the XR2 hardware is going to be certified to run the same. S2D Ready Nodes from Dell EMC do not come in rigid format so XR2 were a must yet they do not come out of the box as S2D ready nodes so we went down the path of setting the right hardware components procured for the XR2 to be S2D certified.

image

Network wise we opted for dual 10GB NIC (back-to-back connectivity which is supported for a 2-node S2D cluster) because the workload requirements is very standard and the available resources on the servers heavily under utilized, I would always recommend RDMA capable adapters such as the offered Dell  which supports iWARP or the Mellanox 10/25GB which supports RoCE. The built-in dual 1GB was used for workloads traffic in a Switch Independent Teaming (SET) config while the 10GB dedicated for S2D traffic on dedicated subnets with no gateway (not routed since its only 2 nodes back-to-back connectivity).

WhatsApp Image 2019-08-20 at 12.20.17 PM-1   WhatsApp Image 2019-08-20 at 12.20.17 PM-2

HBA wise we opted for the HBA330 12Gbps SAS HBA Controller (NON-RAID) which is supported by Microsoft for S2D and is an option with Dell servers. This is not a standard option with Dell EMC XR2 so we had to involve the OEM team yet it is doable given that you will install this HBA as a customer kit which I have included a video on below. Boot wise we used the BOSS Controller with 2 M.2 disks with RAID1 which hosted the physical server OS and domain controller VMs.

WhatsApp Image 2019-08-20 at 12.20.19 PM   WhatsApp Image 2019-08-20 at 12.20.18 PM   WhatsApp Image 2019-08-20 at 12.20.17 PM(2)

Disk wise we went with 6 x 480GB read-intensive SSD disks per node, the bare minimum is four since this is an all-flash SSD cluster and caching was not required since we don’t have write-intensive applications. One disk on every node was labeled as hot spare since these are remote vessels and we needed an extra layer of fault tolerance for any disk cluster failure.

image

Quorum wise we went with a Synology EDS14 which is a very small rigid SD-Card NAS that was a perfect fit to be honest. The quorum is very small ( couple of MBs ) so the SD-Card was only 32GB which was far more than required and the whole box on top of being rigid was very cheap. This was connected to the 1GB network on the vessel which the servers management NICs connected to as well. A preconfigured spare EDS14 was provided in the vessel as well to be connected incase the quorum failed for any reason in the open sea.

WhatsApp Image 2019-08-20 at 12.20.17 PM

Configuration

HBA330 Installation

The provided HBA330 customer kit comes with a raiser that does not fit into the XR2 server so had to improvise a bit to install it on the raiser that came with the default H730 RAID controller that is going to be replaced. The video is uncut and unedited so bare with me on this one as we figure out how to replace the HBAs.

Storage Controllers

Verify that the M.2 disks are configured in RAID1 on the BOSS Controller which is done by default when you have 2 disks. Also verify that the HBA330 is working in pass through mode and that all local-attached disks are visible.

WhatsApp Image 2019-08-20 at 12.20.20 PM(2)

WhatsApp Image 2019-08-20 at 12.20.19 PM(1)

WhatsApp Image 2019-08-20 at 12.20.20 PM(3)

WhatsApp Image 2019-08-20 at 12.20.21 PM

Domain Controllers

Since active directory is a prerequisite for building an S2D Microsoft Failover Cluster, we needed to first ensure that domain controllers are available prior to the cluster build and second that they do not get effected if anything goes wrong with the Failover cluster itself. On every server RAID1 M.2 boot disks we built local Hyper-V VMs that acted as domain controllers so DC1 was hosted on server 1 and DC2 was hosted on DC2 both of which are on local storage with no HA configured since DCs do not require that. Just remember to set both VMs to auto start when servers boot from Hyper-V settings regardless of the physical server shutdown status.

17

18

S2D Configuration

Make sure that your Server 2019 is fully up to date, this is very important and both servers are on the same level of patches which absolutely should be the latest. Remember that S2D in Server 2019 is a datacenter feature so make sure that your licenses are in order. Many of the initial configuration I did manually and the rest through PowerShell so first let me briefly state what was done manually. Also make sure that all server hardware vendor drivers and BIOS are updated to the latest, this is done from the Lifecycle console on Dell EMC servers so update everything to the latest and make sure drivers are correctly installed on the hosts.

I renamed the NICs on the each node and assigned different internal subnet ranges for every S2D NIC. I have 2 on every server that are 10GB and will be dedicated for S2D traffic and live migration so name them and provide an IP and subnet for each NIC on a different subnet ( one that is not routed since its 2-nodes and it is only required for S2D inter-communication ). on Server-1 renamed the 10GB NICs and assigned an IP from different subnet for each and on Server-2 did the exact same thing, always better to specify the same names for the NICs and makes sure that the interconnected NICs between servers have IPs on the same subnet.

image

The hosts were joined to the domain and restarted and the DC VMs were set to auto start with 3 seconds buffer on each Hyper-V server. The rest of the configuration was PowerShell based so here goes the commands and explanation of each:

Disable Virtual Machine Queue (VMQ) on management/VMs 1GB NICs ( Run on Both Nodes ).

Disable-NetAdapterVmq -Name NIC1,NIC2

Create Switch Embedded Teaming (SET) on management/VMs 1GB NICs ( Run on Both Nodes ).

Make sure to assign an IP address to the newly created SET management NIC after its creation, I find it easier to do so manually. Note that I am not using any VLANs here as the 1GB network is flat.

New-VMSwitch -Name Mgmt -NetAdapterName NIC1, NIC2 -EnableEmbeddedTeaming $True -AllowManagementOS $True

Enable NUMA Node Assignment on S2D 10GB NICs ( Run on Both Nodes ).

A very good read on the subject can be found here, basically this controls which NIC is bound to which CPU NUMA node. This is not required with RDMA but it really helps to configure this if you have a standard 10GB NIC. First check how many NUMA nodes available and apply NodeId accordingly.

Get-VMHostNumaNode

Set-NetAdapterAdvancedProperty -Name “S2D1” -RegistryKeyword ‘*NumaNodeId’ -RegistryValue ‘0’

Set-NetAdapterAdvancedProperty -Name “S2D2” -RegistryKeyword ‘*NumaNodeId’ -RegistryValue ‘1’

Set VLAN on S2D NICS ( Run on Both Nodes ).

Make sure to assign an IP from the same subnet and VLAN between the interconnected NICs on both nodes. So in our case, S2D1 NIC on Server-1 is connected to S2D1 NIC on Server-2 and both tagged with VLAN 35 having an IP from the range 10.20.35.X/24 . No gateway or DNS.

Set-NetAdapter –Name S2D1 -VlanID 35 -Confirm:$False

Set-NetAdapter –Name S2D2 -VlanID 36 -Confirm:$False

Enable Jumbo Frames on S2D NICs ( Run on Both Nodes ).

Get-NetAdapterAdvancedProperty -Name S2D1 -RegistryKeyword “*jumbopacket” | Set-NetAdapterAdvancedProperty -RegistryValue 9014

Get-NetAdapterAdvancedProperty -Name S2D2 -RegistryKeyword “*jumbopacket” | Set-NetAdapterAdvancedProperty -RegistryValue 9014

Enable Virtual Machine Queue (VMQ) on S2D NICs ( Run on Both Nodes ).

Get-NetAdapterRSS

Set-NetAdapterRSS S2D1 -BaseProcessorNumber 2 -MaxProcessors 2 -MaxProcessorNumber 4

Set-NetAdapterRSS S2D2 -BaseProcessorNumber 6 -MaxProcessors 2 -MaxProcessorNumber 8

Add the required roles for S2D and restart ( Run on Both Nodes ).

Install-WindowsFeature Hyper-V, Failover-Clustering, RSAT-Clustering-Powershell, Hyper-V-PowerShell

Set the Interface Metric for Mgmt and S2D NICs ( Run on Both Nodes ).

Set-NetIPInterface -InterfaceAlias “vEthernet (Mgmt)” -InterfaceMetric 1

Set-NetIPInterface -InterfaceAlias “S2D1” -InterfaceMetric 2

Set-NetIPInterface -InterfaceAlias “S2D2” -InterfaceMetric 2

Test the nodes for Cluster Compatibility, Validation, and Support ( Run on One Node ).

Test-Cluster -Node S2D-01,S2D-02 -Include “Storage Spaces Direct”, “Inventory”, “Network”, “System Configuration”

Check the report and make sure that everything is green else Microsoft might not support the cluster configuration ( Run on One Node ).

12

13

Create the Microsoft Failover Cluster ( Run on One Node ).

New-Cluster -Name S2D-Cluster -Node S2D-01,S2D-02 -NoStorage -StaticAddress 10.20.34.16

Configure a Share on Synology NAS or any device and add the share as Cluster Quorum ( Run on One Node ).

Set-ClusterQuorum -FileShareWitness \\10.20.34.7\Witness

11

Enable Storage Spaces Direct ( Run on One Node ).

Enable-ClusterS2D

Set one disk in every node as hot spare ( Run on One Node ).

Get-PhysicalDisk

Get-PhysicalDisk 1006 | Set-PhysicalDisk -Usage HotSpare

Get-PhysicalDisk 2006 | Set-PhysicalDisk -Usage HotSpare

Change the Cluster network settings for S2D NICs “ Cluster Only “ ( Run on One Node ).

1

Make sure to also change the Live Migration Settings by right clicking networks in Failover Cluster Manager and select only the S2D NICs.

image

Create an CSV ReFS Volume ( Run on One Node ).

New-Volume -FriendlyName “S2D-CSV” -FileSystem CSVFS_ReFS -StoragePoolFriendlyName S2D* -ResiliencySettingName Mirror -Size 2.2TB

16

15

If you want to use the HCI cluster directly as a file server then create a volume without the CSVFS then add a file server role to the cluster.

Also make sure to set the Hyper-V default VM and virtual disk folders to the new C:\ClusterStorage\S2D-CSV path so that VMs are by default created in the S2D Pool.

A volume called Cluster Performance History is created by default with Server 2019 S2D, this is used by the Windows Admin Center to pull S2D related performance history data which is really really important so make sure not to delete this volume.

S2D As-Built

1

2

3

4

5

6

7

8

9

10

11

16

15

S2D Test

The right way to test the performance and scalability of an S2D HCI cluster is to run VMfleet . For so long I have ran S2D in my lab on an unsupported configuration and most of the time I got very bard performance so this time I was very pleased to see this screen. Copy/Paste is not the right metric to test S2D but I was relieved it was different than my lab honestly.

14

Other tests include planned downtime such as draining the cluster roles from one node and restarting the server manually, while unplanned downtime would mean removing a disk from a server or just powering off forcefully. All supported fault tolerance scenarios were tested and results were as planned.

Windows Admin Center

Installing Windows Admin Center is a very simple and straight forward job so that is not something I will cover here but wanted to show you some very cool available options for S2D through Windows Admin Center.

S2d-01 (7)

28

S2d-01 (1)

S2d-01 (2)

S2d-01 (3)

S2d-01 (4)

S2d-01 (5)

S2d-01 (6)

Many tasks can be accomplished from Windows Admin Center managing independent servers or an S2D cluster such as creating/deleting volumes, changing physicals disks, monitoring performance, migrating to Azure and much more all for free …

Conclusion

Microsoft S2D specifically on Server 2019 has proven itself to be a worthy adversary in the HCI market and should be marketed/sold as such not only by Microsoft but by partners as well. Partners need to open up different options for their customers in an effort to embrace the right technology for the right requirement and customers need to practice their due diligence when assessing available HCI products in the market.

May the Peace, Mercy, and Blessing of God Be Upon You

15 thoughts

  1. hi, thanks for the review!

    I also tried to do “Nested Mirroring” for two nodes, have you tried that?
    Would be cool to get any feedback.

    1. Hi, I haven’t tried nested mirroring because the requirements didn’t dictate it, definitely I will let you know once I have tried it. Thanks.

  2. Hi great article, my Windows hosts have 1TB M.2 drive for OS and a number of SATA for the storage pool. Since the Windows OS doesn’t need 1TB, I wish I can partition the M.2 and give the OS 200G and the rest to the Storage Pool as cache or storage but look like S2D only take physical disk.
    Do you have any trick or any suggestion for better utilize the M.2 space.

    1. Thank you . No , Azure Stack will need to utilize the full disk since its going to be formatted and joined to the pool so it cannot use just a portion of it. Anyhow you will need 2 M.2 drives for caching so only one wont do you any good.

      1. Ah, to clarify I have 4 hosts, each host has 1x M.2 and 4x SATA. The 16 SATA were pooled where as the 4 M.2 are for Windows OS and it is a pity a lot of fast storage is not being used.

  3. Saad,
    Would you recommend separating the physical nodes of the S2D into different locations or keeping them together?
    Thanks

    1. Hi Khalil , I would not recommend that because of possible latency issues and S2D is latency sensitive as with all HCI. If you want DR HA, you need to configure two local clusters and then use Storage Replica or Hyper-V Replica for the same but I wouldn’t recommend a stretched cluster for S2D.

  4. Really interesting article. I can’t say I had come across S2D and it does seem like it ticks a lot of boxes for a cost effective “DiY” HCI – not least because at least some customers I’ve spoken to are licensing for Server Datacenter simply due to the Windows guests they are running (irrespective of hypervisor).

    On RDMA, I didn’t see that you mentioned DCB on the switch (unless you direct connected your 2 nodes?) RoCE/iWARP capable NIC *plus* lossless ethernet network was what I had as the pre-reqs for RDMA – do you agree?

    1. Hi Allan, 100% true, most enterprises out there with any kind of hypervisor not just Hyper-V are Datacenter licensed because they use more than 2 VMs on every server which adds to the cost effectiveness of the solution. For this configuration, it was direct connected so no switches in between hence no DCB but even with a bigger cluster or switch connected nodes, I always recommend iWARP especially on Chelsio since it does not require any switch or server additional network configuration. Yes definitely for RDMA, RoCE/iWARP capable NICs and at least a 25GB network is recommended although 10GB would do as well but that all depends on workloads and so on …

  5. Great write up.

    I’ve setup a S2D for the first time myself, and in your opinion, will it make any sense to make a SET Switch for the Storage NICs as well..? I have 2x25GB interconnected NICs on each server and I’m considering using a SET Team for this.

    Or will it always be better to interconnect the 2 NICs on each server and put i.e. NIC1 on server 1 and NIC1 on server 2 in the same VLAN..? I can’t really find any best practice doc on this on the Microsoft S2D documentation (which is really good!).

    Thanks and keep up the good work.

    1. If you are doing a 3 node or more S2D or Azure Stack HCI nowadays then yes do a SET since you wont be able to interconnect them but if you are doing a 2 node cluster then the best approach would be to interconnect them without a SET on different VLANs which btw are just conceptual or logical VLANs to separate traffic to avoid the overhead and dependency of storage on SET. I totally agree, Microsoft documentation is great. Thanks.

  6. Hello Saadallah, Thank you for the detailed write up on this. Just starting the planning stages of my project. my question for you is. what is the model/serial number of the BOSS controller you used for the 2 X M2 drives?
    Thank you, Michael

    1. Hi Michael, I honestly dont remember and dont have access to the BOM anymore. What I normally do is use configuration manager from the hardware vendor to make sure that there is space on the board for additional controllers ( since it would give an error if it doesnt on the BOQ ) and any BOSS controller supporting RAID1 would work here and I am sure not many options are available. Which vendor are you going with ?

Comments are closed.