If you would like to read any other chapter’s of this blog series, click the links below:
- Part 1 – Introduction and Licensing
- Part 2 – Architecture and Hardware
- Part 3 – Data Availability
- Part 4 – Fault Domains and Stretched Clusters
- Part 5 – Failure Events
- Part 6 – Compression, Deduplication and QoS
- Part 7 – Monitoring and Reporting
Now I’ve covered the high-level architecture of vSAN it’s time to look at how the data is structured within the vSAN datastore and how vSAN uses multiple hosts within a cluster to provide data redundancy. Not forgetting one of the biggest benefits with vSAN – this is all configured using policies that are configured at the software layer.
Storage Policy-Based Management (SPBM)
When it comes to administering traditional storage arrays you typically create volumes or LUNs and present these to vSphere where they are formatted using VMFS ready for virtual machine placement. It’s at this volume level where many settings are configured such as RAID protection and available capacity along with services such as thin provisioning, replication, compression and encryption. This approach can increase the administrative overhead when deploying a new workload as you will need to know which volume to place the virtual machine in advance. In some larger organisations storage configuration can be looked after by a different set of administrators which can add an additional layer of complexity. With vSAN you are free to consume these features or services on a per VM basis rather than living within the confines of a LUN management model.
Storage policies are created within vCenter and are assigned to virtual machines and objects such as the virtual disk (vmdk file). If the requirements of a workload or virtual machine change then the policy can be simply changed on the fly. You won’t need to shut the machine down first or perform any form of storage migration.
Data Availability
In part 1 of this series I mentioned that vSAN doesn’t use any form of hardware RAID to protect data and instead data protection is configured at the software layer. The primary level of failures to tolerate (pFTT) setting within a storage policy determines the number of failures the vSAN cluster can tolerate whilst still maintaining data availability. The most common configured pFTT setting is 1. This means the cluster will be able to tolerate the loss of a single cache or capacity device, network card or host and still maintain data availability.
This example shows how the data is distributed across the hosts using a pFTT policy of 1. We have two copies of the vmdk file each on a separate host plus a witness file on a third host to act as a tiebreak during if there is a network failure. As the data is effectively mirrored the amount of capacity consumed on the vSAN datastore is twice the amount of the original vmdk size. For example, if the virtual machine has a 100GB disk this will consume 200GB of capacity on the datastore. It is possible to have different levels of protection configured if some workloads are more critical than others. Due to the fact this is all policy driven it’s very easy to achieve. One important consideration is a greater pFTT level will consume more capacity and require a greater number of hosts. The table below gives an example of the consumed capacity and number of hosts required for each pFTT level.
<!doctype html>
pFTT Level | Method | Number of Hosts | Overhead | 100GB Usable |
1 | Mirror | 3 | 2x | 200GB |
2 | Mirror | 5 | 3x | 300GB |
3 | Mirror | 7 | 4x | 400GB |
As you can see increasing the pFTT level above 1 has quite a dramatic effect on the storage efficiency within vSAN. With the release of 6.2 some new features were added to help address this.
Erasure Coding
This feature is only available with all-flash deployments and requires a minimum of 4 hosts. It’s important to note the cluster size does not need to be a multiple of 4.
Erasure coding is similar to how RAID 5 and RAID 6 operate within a traditional storage array, but remember we are not configuring any form of hardware RAID and this is all configured using storage policies. Erasure coding provides the same level of redundancy as a mirror but with a lower storage overhead making it more efficient. As with hardware RAID, the data is split into multiple chunks and distributed over multiple hosts with parity data also written for to protect against data corruption or loss.
RAID5 erasure coding requires a minimum of 4 hosts and provides a pFTT of 1. The data is split over 3 hosts with a fourth host holding the parity data.
RAID6 erasure coding requires a minimum of 6 hosts and provides a pFTT of 2. The data is split over 4 hosts with parity data written to a further two hosts.
The table below shows how erasure coding compares to mirroring using our 100GB virtual machine.
<!doctype html>
PfTT Level | Method | Number of Hosts | Overhead | 100GB usable |
1 | RAID 5 | 4 | 1.33x | 133GB |
2 | RAID 6 | 6 | 1.5x | 150GB |
An important consideration when choosing which method of protection is required is the additional overhead generated by erasure coding. This is commonly referred to as I/O amplification. During normal operations there is no amplification of read I/O however there is I/O amplification of write I/O as the parity data needs to be updated every time new data is written. The process can be described as follows
- read the part of the fragment that needs to be modified
- read the relevant parts of the old parity data to recalculate their values
- combine the old values with the new data to calculate the new parity
- write the new data
- write the new parity
For RAID5 this results in 2 reads and 2 writes on the storage. For RAID6 this results in 3 reads and 3 writes on the storage. This means there is an increase of network traffic between nodes during write operations compared to mirroring. Additionally, if a node were to fail I/O amplification may also occur during read operations. Don’t let this put you off. As flash devices can provide a substantial number of IOPS I/O amplification can be less of a concern when weighed up against the capacity savings when compared to mirroring. There is no one size fits all approach and with vSAN you are free to pick and choose which workloads require high performance (mirroring) and which require capacity efficiency (erasure coding). Remember this is all driven by storage policies that you control and can change on the fly.
Hopefully you now have a good understanding on how data is protected within a vSAN cluster by using storage policies. I’ll wrap up this post with a look at how data is stored within the vSAN datastore and briefly look at striping.
When you create a virtual machine on a vSAN datastore it consists of several objects which includes the following.
- VM home (vmx file)
- VM swap file
- Virtual disk (vmdk)
- Delta disk (snapshot deltas)
Each of these objects above can be made up of one or more components. This is determined by the settings configured in a storage policy or the size of the object. The maximum size of an object is currently 255GB. Anything larger than this is split into multiple smaller components. For example, a 800GB vmdk file will be split into 4 components. The resulting components may reside on the same physical disk.
Striping
It is possible to configure the number of disk stripes per object setting (or stripe width) in the storage policy. This will stripe the data over multiple devices similar to RAID 0 and before you change this policy there are a couple of limitations to be aware of. The first is that this number cannot be greater than the number of capacity devices available. The second is striped components from the same object also cannot reside on the same physical device. Now you may have seen me mention RAID 0 and think that’s bad, if I lose a disk my data’s gone. Well don’t forget the pFTT policy setting ensures there are always multiple copies of the data located across different hosts. Striping may result in greater performance however this is not guaranteed.
Part four will cover fault domains and stretched clusters that can further enhance the redundancy of virtual machines.
Leave a Reply