Table of Contents
Step 1: Making Choices
Step 1 gave the rationale for my choices. Step 2 is about setting up GlusterFS on three nodes. I’m assuming you’re conversant from a Linux shell. If not, there are lots of free online classes. Look at linux.com
You can read about GlusterFS here. For those who don’t want to click, it’s a scalable network file system. It follows the folder/file paradigm we’re all used to and can be configured to be highly available. I chose it because it’s fairly easy to set up, understand, and maintain. Support for it is also built into Kubernetes.
This section applies to all nodes:
The first step is to partition, write a file system, and then mount that partition. The recommendation is to use XFS for the file system, but XFS is not supported on some of my SBCes. So we’ll use EXT4. That doesn’t seem to lose us anything material for us. XFS is supported to 500 TB and EXT4 to 50 TB. If you need 500 TB, then you should be looking for an enterprise solution, not a homelab solution.
There are lots of tutorials on partitioning your hard hard drive. It’s just a few keystrokes. For me, it’s
sudo fdisk /dev/nvme0n1. Then
ENTER a few times. Write it out with
w. Your partition will now show up as something like
/dev/nvme0n1p1. Of course, I’m starting with blank storage media. If it’s not blank, partitioning will erase your data.
Now create the file system with
sudo mkfs.ext4 /dev/nvme0n1p1. It will take a few seconds.
After that, we need to create a directory and mount the new partition to it. To keep things simple, we create a directory called
/gfs/brick01 with the command
sudo mkdir -p /gfs/brick01. Why ‘brick01’? Glusterfs uses directories called bricks to create volumes. This way, the name gives hints as to what’s going on.
Mounting the partition involves adding a line to
etc/fstab. First, change to root with
sudo su. Then change fstab with
echo ‘/dev/nvme0n1p1 /gfs/brick01 ext4 defaults 1 2’ >> /etc/fstab
Next mount everything with
mount -a && mount.
Exit back to your normal user with
Now we install and start glusterfs. This is a series of commands:
sudo apt install software-properties-common -y
sudo apt-add-repository ppa:gluster/glusterfs-7
sudo apt update
sudo apt install glusterfs-server -y
sudo systemctl enable glusterd.service
sudo systemctl start glusterd.service
At this point, you should have glusterfs running on four nodes.
This section is to be done on the first node:
We need to tell glusterfs where all the nodes are. Since I’m running pfSense as my gateway/DHCP/DNS server, I can use hostnames instead of IP addresses. So here’s what I did:
sudo gluster peer probe kube2
sudo gluster peer probe kube3
Do this on another node:
sudo gluster peer probe kube1
If you don’t do this, then the other devices will only know the FQDN for kube1. I don’t know that it would hurt anything, but this is safe.
Do this on all nodes:
sudo gluster peer status
You should see something like:
Setting up a volume:
There are some decisions to be made here. GlusterFS has several storage modes:
- Distributed — files are distributed across bricks with no replication. So if just a single brick goes down, you’re then whatever files are on that brick are gone. But, this is the fastest operational mode. Maybe you use this to cache something.
- Replicated — files are just copied n numbers of times across the bricks in a volume. This makes for slower writes and high availability. It uses the most space because every file is written multiple times.
- Distributed Replicated — like the name says. Multiple copies of files are written to separate bricks. You get fast reads and high availability. It uses the same amount of space as replicated.
- Dispersed — files are broken into pieces, then written across bricks in a volume so they’re recoverable up to a specified level of failure. This is like RAID 5 or 6. It gets complicated deciding the optimal amount of redundancy. There is some added latency for both reads and writes as well as increased CPU cycles. This makes sense when you’re using mechanical drives or your storage is much slower than your network. In my case, our flash storage is faster than the network.
- Distributed Dispersed — This is the more space efficient version of Distributed Replicated. Distributed Replicated writes copies of whole files while Distributed Dispersed writes copies of fragments. I can see this making sense in an enterprise where you have a super-fast network, are using slow mechanical drives, and have very large files. That’s not me.
Distributed Replicated is the best choice for me.
On all nodes:
Create a directory for the volume with:
sudo mkdir -p /gfs/brick01/gv0
On any node:
sudo gluster volume create gv0 replica 3 arbiter 1 transport tcp kube1:/gfs/brick01/gv0/ kube2:/gfs/brick01/gv0/ kube3:/gfs/brick01/gv0/
Let’s explain what’s going on here. Intuitively, a single replica is enough, but that leads to a problem with split brain. It’s possible for two bricks to get out of sync and we cannot tell which one is correct. So we need three replicas, but I don’t really want three copies of each file. Consequently, we set up the third replica as an arbiter. This replica only stores the file name and metadata. So if there is divergence between bricks, we have enough info to figure out which one is correct.
Now we start the volume with
sudo gluster volume start gv0. Check to see if the start succeeded with
sudo gluster volume info. You should see something like:
Test it out with:
sudo mkdir -p /mnt/disk
sudo mount -t glusterfs kube1:/gv0 /mnt/disk
for i in `seq -w 1 100`; do echo 'test' > /mnt/disk/copy-test-$i; done
So that will create one hundred files in /mnt/disk with copy-test starting each name. Count them with
ls -lA /mnt/disk/copy-test* | wc -l. It should return with 100. Run
ls -lA /gfs/brick01/gv0/copy-test* | wc -l on kube1, kube2, and kube3. Two of the nodes should return 100 and one 0. If that’s the case, then GlusterFS is up and running.