July 12, 2020

Shared filesystems across Kubernetes namespaces with Rook

⚠️ Monitum vestris

Let’s get something out of the way first: if you’re thinking of doing this in the enterprise you have big problems that are not technical. Doing this is a clear anti-pattern of scalability and isolation.

Why on earth

Like a lot of people online, I love Linux. And as such, I use Bittorrent to share Linux images. As your needs as a Linux image consumer grow, and as you share the joy of Linux with your friends and family, you might end up deploying multiple chained applications in different Kubernetes namespaces that pull those sweet sweet ISOs. You add metadata, conform names for easy tracking, and finally you share those Linux images with Plex. >_>

In the enterprise this would be an excellent use-case for something like Apache Airflow and object storage. New data flowing across gates at a regular interval, manifests and ledgers kept updated at every step.

In the household, there’s no way I’m dealing with Airflow, retention rules, cleanup jobs across multiple filesystems, or the load this would place on my network both in data transit or through Ceph replication.

At home, we’ll be lazy and use CephFS’s natural abilities as the nice cluster filesystem it is. If you intend to use applications like Sonarr or Radarr, those are optimized to lifecycle your media on a single filesystem and use the magic of hard links.

But there’s a catch. While CephFS volumes are very well supported through Container Storage Interface (CSI) volumes, CSI will provision volumes that can be shared within the namespace, but not across namespaces. A given pod might be using the same storage pool and metadata controllers of a particular CephFS shared filesystem as another pod from another namespace, but it cannot escape its subvolume. Those two pods’ subvolumes will be rooted in parallel to the same source (a volume group in CephFS speak). In other words, one pod cannot see another’s data since they are jailed to separate subvolumes.

How to “properly” achieve an actual shared filesystem across namespaces is not quite well documented… because it violates pretty much every (limited) principle of isolation Kubernetes is supposed to provide through CSI. It’s a big fat hack… and we’ll abuse it!

Figuring out the kludge

If you dig about how to use the CIS driver for static provisioning, and add Ceph into the mix, you’ll eventually stumble on this PR, and this gist. They describe the mechanics to achieve statically provisioned Persistent Volumes (PVs), and Persistent Volume Claims (PVCs).

This is half the battle, this gist show you how to do it, but doesn’t tell you what it does.

Here are the takeaways from my trying to fit this round peg of a recipe into the square hole that was my mind.

  • kubectl describe is your best friend when setting this up. Whenever you are playing with unknown Kubernetes mechanisms, most will report their behaviour when it comes to interdependency. Calling describe on pods that hang waiting for volumes is a lifesaver.

  • One friend is not enough though, tailing the logs for the CSI plugins will tell your the other half of the story. I personally like using stern for that.

    $ stern -t csi-cephfsplugin -n rook-ceph
    [ ... error messages will start illuminating both you and your screen ]
    

    There you will find the reason why volumes might be hanging. Short of that, it will give you strings to search for in the Rook and Ceph issues.

  • These PVs and PVCs have a one-to-one relationship. The notion of ReadWriteMany does not apply to them as API objects. You will need one PV for one PVC for one namespace. That means you’ll need a pair of PV and PVC in every namesapce where you’ll want to share storage.

  • All these shenanigans assume you will be manually managing CephFS volume groups, and subvolumes. This is rather straightforward, but if you have never used the ceph tooling, brace yourself for a bit of reading.

Getting it done

You’ll start by pre-provision the filesystem internals with the direct-mount tool article.

Create a volumegroup to hold your totally not sketchy pseudo-CIS volumes.

ceph fs subvolumegroup create myfs testGroup

Create a sized subvolume. The size must be provided in bytes, so let’s use some GNU coreutils tools.

_size_in_GiB=64G
_size_in_bytes="$(numfmt --from=iec <<< "${_size_in_GiB}")"

ceph fs subvolume create myfs testSubVolume testGroup --size="${_size_in_bytes}"

Now let’s find our volumegroup. (Unlike the Rook instructions, I mounted my CephFS filesystem to /mnt as per FHSv3, you do you.)

$ ceph fs subvolumegroup ls cephfs
[
    {
        "name": "mediaStorage" # <<<--- I created this guy
    }, 
    {
        "name": "_deleting"
    }, 
    {
        "name": "csi"
    }
]
$ cd /mnt/volumes/mediaStorage
$ df -h .
Filesystem                                                Size  Used Avail Use% Mounted on
10.43.112.140:6789,10.43.164.39:6789,10.43.79.128:6789:/  1.7T  870G  795G  53% /mnt

Be aware that from this point on, we’ve said goodbye to all the niceties provided by CSI. So we’ll create a “root” for the different pods to share data from, as well as folders for our separate namespaces.

$ mkdir folder-01 # Always use creative and scalable naming
$ mkdir {plex,radarr,sonarr,transmission}

And let’s make sure to avoid permissions problems since those are enforced.

$ chown -Rv 1000:1000 ./* # Always use location specific globbing kids.
changed ownership of 'plex/' from root:root to 1000:1000
changed ownership of 'radarr/' from root:root to 1000:1000
changed ownership of 'sonarr/' from root:root to 1000:1000
changed ownership of 'transmission/' from root:root to 1000:1000

Setting up the “secret” sauce

As we’ll be setting up our PVs, we’ll be explicitly telling Kubernetes which CIS driver to use, and how to use it. To make sure the CIS plugin can connect to Ceph, we need to provide it with a secret that is not already provided in the right format by the Rook manifests. Failure to do so will practice the skills you didn’t have to learn from the Figuring out the kludge section.

In your Rook namespace, provision the following:

---
apiVersion: v1
kind: Secret
metadata:
  name: rook-csi-cephfs-static-provisioner
type: Opaque
data:
  userID: ""
  userKey: ""

This secret is empty on purpose. Once your cluster is provisioned, you just need to copy over the values from the secret rook-csi-cephfs-provisioner.

Meet the potatoes

---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: somenamespace-somepurpose-cephfs-pv
spec:
  accessModes:
    - ReadWriteMany
  capacity:
    storage: 128Gi
  csi:
    driver: rook-ceph.cephfs.csi.ceph.com
    nodeStageSecretRef:
      name: rook-csi-cephfs-static-provisioner
      namespace: rook-ceph
    volumeAttributes:
      clusterID: rook-ceph
      fsName: "cephfs"
      staticVolume: "true"
      rootPath: /volumes/mediaStorage/folder-01/somenamespace
    volumeHandle: somenamespace-somepurpose-cephfs-pv
  persistentVolumeReclaimPolicy: Retain
  volumeMode: Filesystem
  claimRef:
    name: somenamespace-somepurpose-cephfs
    namespace: somenamespace

---

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: somenamespace-somepurpose-cephfs
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 128Gi
  storageClassName: ""
  volumeMode: Filesystem
  volumeName: somenamespace-somepurpose-cephfs-pv

This is essentially a cleaned up version of this gist. The essence of this manifest is simply that you’ll need to provision a pair of these for every namespace.

Do note that your PV’s spec.claimRef must specify the target namespace. Failure to set this will result in a lot of digging, very few error messages, and a lot of swearing.

A few more gotchas

  • The storage request set on shared filesystem PVs and PVCs is essentially ignored, and is not enforced by quotas. Managing the use of any shared filesystem is non-trivial and should be done through other means such as monitoring and alerting on the Prometheus metrics emitted.
  • PVCs of this nature cannot be mounted more than once to a given pod. This mean that you can’t mount both the root and a subpath from the same PVC to a given pod.

© Alexis Vanier 2020