RAID vs. Software-defined storage (SDS)?

John Richard@lemmy.world · 1 year ago

RAID vs. Software-defined storage (SDS)?

terribleplan@lemmy.nrd.li · 1 year ago

Based on the hardware you have I would go with ZFS (using TrueNAS would probably be easiest). Generally with such large disks I would suggest using at least 2 parity disks, but seeing as you only have 4 that means you would lose half your storage to be able to survive 2 disk failures. Reason for the (at least) 2 parity disks is (especially with identical disks) the risk of failure during a rebuild after one failure is pretty high since there is so much data to rebuild and write to your new disk (like, it will probably take more than a day).

Can’t talk much about backup as I just have very little data that I care enough about to backup, and just throw that into cloud object storage as well as onto my local high-reliability storage.

I have tried many different solutions, so will give you a quick overview of my experiences, thoughts, and things I have heard/seen:

Single Machine

Do this unless you have to scale beyond one machine

ZFS (on TrueNAS)

It’s great, with a few exceptions.
Uses data checksums so it can detect bitrot when performing a “scrub”.
Super simple to manage, especially with the ~~FreeNAS~~ TrueNAS GUI.
Can run some services via Jails and/or plugins
It only works on a single machine, which became a limiting factor for me.
It can’t add disks one at a time, you have to add an entire vdev (another set of drives in RAID-Z or whatever you choose).
You have to upgrade all disks in a vdev to use higher capacity disks.
Has lots of options for how to use disks in vdevs:
- Stripe (basically RAID-0, no redundancy, only for max performance)
- Mirror (basically RAID-1, same data on every disk in the vdev)
- RAID-Zx (basically RAID-5, RAID-6, or <unnamed raid level better than 6>, uses x # of disks for parity, meaning that many disks can be lost)
ZFS send seems potentially neat for backups, though I have never used it

MDADM

It’s RAID, just in your linux kernel.
Has been in the kernel for years, is quite reliable. (I’ve been using it for literally years on a few different boxes as ZFS on Linux was less mature at the time)
You can make LVM use it mostly transparently.
I would probably run ZFS for new installs instead.

BTRFS

Can’t speak to this one with personal experience.
Have heard it works best on SSDs, not sure if that is the case any more.
The RAID offerings used to be questionable, pretty sure that isn’t the case any more.

UnRaid

It’s a decently popular option.
It lets you mix disks of different capacity, and uses your largest disk for parity
Can just run docker containers, which is great.
Uses a custom solution for parity, so likely less battle-hardened and less eyes on it vs ZFS or MDAM.
Parity solution reminds me of RAID-4, which may mean higher wear on your parity drive in some situations/workloads.
I think they added support for more than one parity disk, so that’s neat.

Raid card

Capabilities and reliability can vary by vendor
Must have battery backup if you are using write-back for performance gains
Seemingly have fallen out of favor to JBODs with software solutions (ZFS, BTRFS, UnRaid, MDADM)
I use the PERCs in my servers for making a RAID-10 pool out of local 2.5in disks on some of my servers. Works fine, no complaints.

JBOD

Throwing this in here as it is still mostly one machine, and worth mentioning
You can buy basically a stripped down server (just a power supply and special SAS expander card) that you can put disks in, and lets you connect that shelf of storage to your actual server
May let you scale some of those “Single Machine” solutions beyond the number of drive bays you have.
Is putting a number of eggs in one basket as far as hardware goes if the host server dies, up to you to decide how you want to approach that.

Multi-machine

Ceph

Provides block (RBD), FS, and Object (S3) storage interfaces.
Used widely by cloud providers
Companies I’ve seen run it often have a whole (small) team just to build/run/maintain it
I had a bad experience with it
- Really annoying to manage (even with cephadm)
- Broke for unclear reasons, while appearing everything was working
- I lost all the data I put into during testing
- My experience may not be representative of what yours would be

SeaweedFS

Really neat project
Combines some of the best properties of replication and erasure coding
- Stores data in volume files of X size
- Read/Write happens on replica volumes
- Once a volume fills you can set it as read only and convert it to erasure coding for better space efficiency
- This can make it harder to reclaim disk space, so depending on your workload may bot be right for you
Has lots of storage configuration options for volumes to tolerate machine/rack/row failures.
Can shift cold data to cloud storage and I think even can back itself up to cloud storage
Can provide S3, WebDAV, and FUSE storage natively
Very young project
Management story is not entirely figured out yet
I also lost data while testing this, though root cause there was unreliable hardware

Tahoe LAFS

Very brief trial
Couldn’t wrap my head around it
Seems interesting
Seems mostly designed for storing things reliably on untrusted machines, so my use case was probably not ideal for it.

MooseFS/LizardFS

Looked neat and had many of the features I want
Some of those features are only on (paid) MooseFS Pro or LizardFS (seemingly abandoned/unmaintained)

Gluster

Disks can be put into many different volume configurations depending on your needs
- Distributed (just choose a disk for each file, no redundancy)
- Replicated (store every file on every disk, very redundant, wastes lots of space, only as much space as the smallest disk)
- Distributed Replicated (Distributed across Replicated sets, add X disks as a Replicated set of disks, choose one of the replica sets and store the file on every disk in that set, is how you scale Replicated disks, each replica can only be as big as the smallest member disk, you must add X disks at a time)
- Dispersed (store each file across every disk using X disks for parity, tolerates X disk failures, only as much space as the smallest disk * (number of disks - X), means you are only losing X disks worth of parity)
- Distributed Dispersed (Distributed across Dispersed sets, add X disks as a Dispersed set of disks with Y parity, choose one of the disperse sets and store each file across its X disks using Y disks for parity, is how you scale Dispersed disks, each disperse only has as much space as the smallest disk * (X - Y), you must add X disks at a time)
Also gets used by enterprises
Anything but dispersed stores full files on a normal filesystem (vs Ceph using its own special filesystem, vs Seaweed that stores things in volume files) meaning in a worst case recovery scenario you can read the disks directly.
Very easy to configure
I am using it now, my testing of it went well
- Jury is still out on what maintenance looks like

Kubernetes-native

Consider these if you’re using k8s. You can use any of the single-machine options (or most of the multi-machine options) and find a way to use them in k8s (natively for gluster and some others, or via NFS). I had a lot of luck just using NFS from my TrueNAS storage server in my k8s cluster.

Rook

Uses Ceph under the hood
Used it very briefly and it seemed fine.
I have heard good things, but am skeptical given my previous experience with Ceph

Longhorn

Project by the folks at Rancher/SUSE
Replicates the volume
Worked well enough when I was running k8s with some light workloads in it
Only seems to provide block storage, which I am much less interested in.

OpenEBS

Never used it myself
Only seems to provide block storage, which I am much less interested in.

caboosen00b@sh.itjust.works · 1 year ago

Wow this is an amazing explanation! I’ve been starting to lean towards UnRAID but I didn’t realize their fault tolerance wasn’t as vetted as my cohorts believe. I’ve been taking a painful option to keep an lvm on ubuntu for my drives, but I’m trying to get out of my mistake and build something for offsite so I can feel comfortable with redoing it all.

terribleplan@lemmy.nrd.li · 1 year ago

I mean, lots of people use UnRAID without problems. Lots of people use single disks without problems (until it all fails catastrophically). Some people use ZFS/MDADM/etc and do have problems. My biggest point is that UnRAID uses a custom and proprietary solution, as such it has had less eyes on it and a shorter history than some other options like ZFS or MDADM. BTRFS is also pretty young (and has had some serious issues with reliability in recent memory), but is open source and has gotten pretty reliable because of that. Which is better? Who knows.

There are some really nice things about UnRAID. It is one of the few solutions that handles mixed disk sizes well. Further, like some other solutions I mention, everything but the parity drive(s) are just regular filesystems, so even with a “total loss” you only lose the data from the disks that fail. If the X parity drives + 1 data drive fail you only lose the stuff on that 1 failed data drive. If you lose X + 1 data drives, you lose all of the data from those X + 1 data drives, and your parity drives can’t do anything to help you.

I am more than happy to share my thoughts and experiences. I can’t guarantee that my experiences and impressions are up to date as it’s been years since I’ve used UnRAID at this point, and I haven’t followed it super closely. Do your own research, hopefully these posts serve as a decent starting point and list of topics to get you wondering about. If you find a silver bullet that doesn’t involve paying come company millions of dollars or me hiring an engineer to run it let me know.

Good luck!

John Richard@lemmy.world · 1 year ago

Thanks so much for the very detailed reply. I think at this point I’m conflicted between using TrueNAS or going all in and trying SDS. I’m leaning towards SDS primarily because I want to build experience, but heck maybe I’ll end up doing both and testing it out first and see what clicks.

I’ve setup Gluster before for OpenStack and had a pretty good experience, but at the time the performance was substantially worse than Ceph (however it may have gotten much better). Ceph was really challenging and required a lot of planning when I used it last in a previous role, but it seems like Rook might solve most of that. I don’t really care about rebuild times… I’m fine if it takes a day or two to recover the data as long as I don’t lose any.

As long as I make sure to have an offsite backup/replica somewhere then I guess I can’t go too wrong. Thanks for explaining the various configurations of Gluster. That will be extremely helpful if I decide to go that route, and if performance can be tuned to match Ceph then I probably will.