This is an automated archive made by the Lemmit Bot.

The original was posted on /r/selfhosted by /u/ElGatoPanzon on 2024-11-13 15:26:57+00:00.


Scheduling and clustering has been a thing for a long time now. There’s solutions like docker swarm, nomad and the massive k8s to schedule containers on multiple nodes. 2 years ago I was still manually provisioning and setting up services outside of docker and giving all my servers cute names. But I wanted to up my game and go with a more “cattle not pets” solution. And oh boy, it sent me down a huge rabbit hole to get there but I finally did.

So 2 years ago I set out to create a setup which I want to call the “I don’t care where it is” setup. It has the following goals:

  1. Provision and extend a server cluster without manual intervention (for this, I used Ansible and wrote about 50+ ansible roles)
  2. Automatic HTTP and HTTPS routing, and SSL (for this, I used Consul and a custom script to generate nginx configs, and another script to generate certs using consul data)
  3. Schedule a docker container job to be run on one or more servers in the cluster (for this I went with Nomad, it’s great!)
  4. Nodes need to be 100% ephemeral (essentially, every single node needs to be disposable and re-creatable with a single command and without worry)

Regarding point #2, I know that Traefik exists and I used Traefik for this solution for a year. However it had one major flaw that being you cannot have multiple instances of Traefik doing ACME certs, because of 2 reasons: Let’s Encrypt rate limits, or the fact that Traefik’s ACME storage cannot be shared. For a long time Traefik was a bottleneck in the sense that I couldn’t scale it out if I wanted to. So I ultimately wrote my own solution with nginx and Consul and generate certs with certbot and feed them to multiple nginx instances.

Where I was ultimately stuck however was #4. Non-persistent workloads are a non-issue because they don’t persist data so they can show up on any node. My first solution (and the one I used for a long time) was essentially running all my deployments on NFS mounts and having the deployment data living on a bunch of nodes and creating a web of NFS mounts so that every worker node in the cluster had access to the data. And it worked great! Until I lost a node, and deployments couldn’t access that storage and it brought down half my cluster.

I decided it was time to tackle #4 again. Enter the idea of the Distributed File System, or DFS for short. Technically NFS is a DFS, but what I specifically wanted was a replicating DFS where the deployment’s data exists on multiple nodes and gets automatically replicated. That way, if a node stopped working the deployments data would exist somewhere else and they could be scheduled to come back up without data loss.

MooseFS changed the game for me, here’s how:

1. I no longer need RAID storage It took me a while to come to this conclusion, and I’d like to explain it because I believe this could save a lot of money and decrease hardware requirements. Moosefs uses “chunkservers” which are daemons running on a server that offer local storage for use with the cluster, to store chunks. These chunkservers can use any number of storage devices of any type and any size. And it does not need to be 2 or more. In fact, moosefs does not even work on top of a typical RAID and requires JBOD (passing disks as-is to the system).

In my eyes 1 node with a 2 disk ZFS mirror or mdraid offers redundancy of 1 failure. In the moosefs world, 2 nodes each with 1 disk is the same setup. Moosefs handles replication of chunks to 2 chunkservers running on the nodes, but it’s even better because you can lose an entire node and the other node still has all the chunks. Compared to RAID if you loose the node it doesn’t matter if you had 2 disks to 200 disks, they are all down!

2. I no longer care about disk failures or complete node failures This is what I recently discovered after migrating my whole cluster to use moosefs. Deployment data exists on the cluster, every node has access to the cluster, deployments show up on any worker (I don’t even care where they are, they just work). I lost a 1TB NVMe chunkserver yesterday. Nothing happened. Moosefs complained about a missing chunkserver, but quickly rebalanced the chunks to other nodes to ensure the minimum replication level I set (3). Nothing happened!

I still have a dodgy node in rotation. For some reason, it randomly ejects NVMe drives (either 0 or 1), or locks up. And that has been driving me insane the last few months. Because, until now, whenever it died half my deployments died and it really put a dent in the cluster. Now, when it dies, nothing happens and I don’t care. I can’t stress this enough. The node dies, deployments are marked as lost, and instantly scheduled on another node in the cluster. They just pick right back up where they left off because all the data is available.

3. Expanding and shrinking the cluster is so easy With RAID the requirements are pretty strict. You can put a 4TB and a 2TB in a RAID, and you only get 2TB. Or you can make a RAIDZ2 with like 4 disks and cannot expand the number of disks in the pool, only their sizes, and it has to be the same sizes for all of them.

Well, not with moosefs. Whatever you have for storage, it can be used for chunk storage, down to the MB. Here’s an example I went through while testing and migrating. I setup chunkservers with some microSDs and usb sticks and started putting data on it. 3 chunkservers, with like 128gb usb stick and one with 64gb microsd, and the other had 80gb free. It started filling them up evenly with chunks, until the 64gb one filled and then it started filling the other 2 with most of the chunks. With replication level 2, that was fine. Now I wanted replication level 3. So I picked up 3 256GB usb sticks for cheap, added each one to each node, and marked the previous 3 chunkservers for removal. This triggered migrations of chunks to the 3 new usb sticks. Eventually I added more of my real node’s storage to the cluster and concluded the usb sticks were too slow (high read/write latency), and marked all 3 for removal. It migrated all chunks to the rest of the storage I added. I was adding them in TBs like 1 TB SSD, then one of my 8TB HDDs, then an 2 TB SSD. Adding and removing is not a problem!

4. With moosefs I can automatically cache data on SSDs, and automatically backup to HDDs Moosefs offers something called storage classes. It lets you define a couple of properties that apply per-path in the cluster by giving labels to each chunkserver and then specifying how to use them:

  • Creation label: which chunkservers are used when files are being written/created
  • Storage label: which chunkservers are the chunks stored on after they are fully written and kept
  • Trash label: when deleting a file, which chunkservers hold the trash
  • Archive label: when the archive period passes for the file, which chunkservers hold the chunks

To get a “hot data” setup where everything is written to an SSD and read from an SSD as long as it’s accessed or modified within X time, the storage class is configured to create and keep data on SSDs as a preference, and set to archive to HDDs after a certain time such as 24 hours, 3 days or 7 days.

In addition to this, the storage and archive labels include an additional replication target. I have a couple of USB HDDs connected and setup as chunkservers, but they are not used by the deployments data. They are specifically for backups, and have labels which are included in the storage class. This ensures that important data which I apply the storage classes to get their chunks replicated to these USB HDDs, but the cluster won’t read from them because it’s set to prefer labels of the other active chunkservers. The end result: automatic local instant rsync-style backups!

The problems and caveats of using a DFS There are some differences and caveats to using such a deployment. It’s not free, as in, resource wise. It requires a lot more network activity and I am lucky most of my worker nodes have 2x 2.5GBe NICs. Storage access speed is network bound, so you don’t get NVMe speeds even if you had a cluster made up entirely of NVMe storage. It’s whatever the network can handle - overhead.

There is 1 single point of failure with the moosefs cluster GPL version, which is the master server. Currently I have that running on my main gateway node which also runs my nginx and accepts incoming network access and handles the routing to the right nodes. So, if that node goes down my entire cluster is non-accessible. Luckily they offer another type of daemon called a metalogger, which logs the metadata from the master and acts as an instant backup. Any metalogger server can easily be turned into a running master, for disaster recovery.

Did I mention it’s a hybrid cluster made up of both local and remote nodes? I have certain workloads running in the cloud, and others running locally, all accessing each other over a zero-trust wireguard VPN. All of my deployments bind to wireguard’s wg0 and cannot be accessed from the public (even local LAN) addresses and everything travels over wireguard even the moosefs cluster data.

It’s been just over 2 years but I finally reached enlightenment with this setup!