Learnings from self-hosting

← Blog

2024-06-01

The first “side” computer I owned was a Raspberry Pi Zero, back in 2022. I don’t remember exactly why I bought it, but it was probably because I felt like experimenting with a remote motor-controlled door opener – for when I am lazy?

Time went by, and when I started living alone, birds began playing in front of my window on a mango tree. I wanted to share this with my friends and family, but I couldn’t video call everyone every time a bird came around. ngrok came to my rescue, allowing me to expose a small 2MP Raspberry Pi camera stream.

Humans are greedy creatures – give them ‘x,’ and they’ll ask for ‘2x.’ Give me a Raspberry Pi Zero, and I’d ask for a Raspberry Pi Zero. (Get it? ;))

Jokes aside, I wanted to build something faster that could handle other workloads besides being a proxy for a video stream. The Zero throttled heavily at noon due to the Indian summer heat. The HDD was the only usable component; the RAM was soldered in, but I might find a use for the display someday.

Also, the idea of a computer science major with three years of professional experience who had never assembled a computer didn’t sit right with me.

Without spending a lot of money, I assembled this box for under 130 USD. The secondhand market in India isn’t very reliable, otherwise I would have aimed for at least an 8-core, 3 GHz processor. Note that this build doesn’t have storage yet. I tried to salvage whatever I could from the laptop I used in college. The HDD was the only usable component; the RAM was soldered in, and I might find a use for the display someday.

When the full build was assembled and I powered on the system, a familiar image of the old Ubuntu login screen appeared from my university days, much to my surprise. It had been about three years at that point since I last logged in. The wallpaper brought back memories of hackathons, unemployment, and poverty.

The software I host

Following is a list of services I installed and why:

Miniflux : As mentioned in this blog, I have been using Miniflux locally on weekends to follow some technical blogs and check up on pending podcasts.
Calibre : Two new services that I onboarded were to track some ePubs and read things online. Nowadays, I try to read on a Kindle or paperback to save some screen time.
Actual : For personal finance, a lot of people recommend Firefly-III. But it’s written in Laravel (PHP), which I am barely familiar with. So I used Actual instead (it’s written in JavaScript) with a sleek modern UI and offline editing (syncing).
Trilium : A note-taking service that I use the most.
- There are many options available if you’re looking for a self-hosted open-source note-taking app like Obsidian, Trilium, Joplin, TiddlyWiki, Logseq, or Roam.
- But if you are looking for a tool that meets the same conditions I considered, i.e.,
  - open-source (unlike Obsidian)
  - service (not an app, so not Obsidian or TiddlyWiki)
  - supports knowledge graphs (unlike Joplin or TiddlyWiki)
- After pruning apps in each step, we’re only left with Logseq and Trilium. Among these two, Trilium releases software more frequently and is easier to host/migrate than Logseq.
Changedetection.io : This is a service I never thought I would need or use. You give it HTTP endpoints, and it’ll give you a diff after polling at certain intervals. I use Change Detection to track job postings on government websites.
Gitea : I have written in detail about why I use Gitea and how it taught me something about how a git repository server works here. More recently, I realized that Gitea supports registries, so I’ve been using it as my local Docker registry.
Nextcloud : By far, my biggest concern that led me to self-host is personal photographs being sent to Google/Apple.
- If you’re to self-host personal photos and videos, you would have the following requirements:
  - Open source, stable releases
  - App to continuously monitor a phone’s /Photos and sync with the self-hosted server.
- In this space, you have 2-3 very good options and some average options like Nextcloud, Owncloud, Immich.
- You are always compromising while making a choice.
  - Immich is new and less stable but has a slick UI and could be a major drop-in replacement for Google Photos in terms of UX.
  - Nextcloud is stable, decently fast but is bloated with many features like a calendar server, document server; basically, it’s not just a backup tool but a full-fledged replacement for Google Drive.
- I have deployed both. Nextcloud is used as the primary backup, and Immich is used as an experimental deployment (but boy is it slick!). With both of these, the app support (especially on iOS) is solid, and I have gone through four major Nextcloud releases. I haven’t gone through any Immich migrations, so I can’t vouch for it yet. From what I have read on Reddit, the migrations are sometimes very drastic but are well-documented by the author.
Many other people also set up streaming servers for music and video, like Plex, Sonarr, Radarr, etc.

Security

When exposing your device (and thus your network) to ingress from the internet, we should tread with caution. With exposed TCP ports, someone can exploit vulnerabilities in your software and do bad things to your data (which you so preciously want to keep out of the reach of the big companies). They can also launch DDoS attacks and interrupt your usage.

Networking and security could be the most crucial and time-consuming steps in your self-hosting journey. Startups use the cloud to avoid this cost, and rightly so. For personal use cases, I don’t have ‘services’ to expose to the public internet, just my static blog website which I host on GitHub Pages. Other than that, all of my services initially stayed in my network…

…until I had to leave home and use it from other networks! I could use something like ngrok or Cloudflare tunnels. However, anyone could still access the HTTP ports. That’s where VPNs come in. I learned about WireGuard and the tools built on top of it (wg-easy, Tailscale, Netbird, Zerotier).

Let me plug OCI, the only ‘Free Forever’ bit of cloud you can use. The free tier is very generous here (1 Core CPU, 1 GB vRAM + 200 GiB of storage: enough to host VPNs).

With the cloud VM, one can:

Reverse port-forward with SSH and expose local ports through a local HTTP proxy
Set up WG remotely and expose local services such that anyone on the VPN can access them.

I tried with vanilla WG, but one can explore wg-easy and other abstractions which help with easier configuration. The first time I tried setting it up, it cost me one full day, and it was probably not worth it. Maintaining WireGuard is hard, thus I’d recommend using something like Tailscale to set up your VPN. If you don’t like someone else managing your network, there’s also Headscale, the open-source control server implementation cloning Tailscale.

Since VPN is a transient, volatile service, even if Tailscale starts acting fishy (slacking on security, charging hefty prices on offerings), you can always pull the plug on it and use WG on a remote VM.

With the current setup, if one wants to access my service, they have to:

Create a Tailscale account / log in.
Request to join my network / wait for my invite.
Connect to my network from a client device (which also requires my permission as the network admin).

Tip: While using Tailscale, please disable SSH access via public IPs.

People who expose their services without VPN to the internet also practice some of the following things:

Route traffic through Cloudflare
Even if your ISP gives you a public IP, use reverse port forwarding
Put a layer of authentication over services like Authentik
Keep intrusion detection systems like Fail2Ban
Honeypots in various parts of your system

Security is an ongoing journey. Maybe that’s why the big companies charge so much to keep your data safe.

Backup

There’s a reason companies like Google/Apple/Backblaze are charging 5/10+ USD a month for a few GBs of data. It’s because it’s a pain to back up data from the modern collection of devices.

Your phone charges 100+ USD for 64 GB of additional storage, while with the same price you can get 10x the amount of data as consumer SSDs, or even cheaper as enterprise SSD/HDD in the used market.

If you learn how to back up data properly, over a couple of decades you’ll save a lot of money just by not caring about purchasing additional mobile storage :)

So why are backups hard/expensive?

Well, because SSDs have expiry and thus you have to create redundancy. The big companies might have the luxury to keep 60-70 snapshots of your S3 buckets for very cheap. But to do this, your setup cost itself is 7000 USD approximately (for 70 x 100 USD each for 1TB SSD) So you have to strike a balance between cost and redundancy.

A practical backup strategy for the common man is as follows:

The 3-2-1 backup rule can be easily implemented and something I use. It says that you should always have 3 copies of your data, 2 copies should/can be onsite, and 1 copy must be offsite (in case your house burns down).
Additionally, every few weeks – make sure that the copies are intact and not corrupted. If it is, replace the storage medium.
Now let’s briefly talk about a case study:
- You take pictures from your phone. The local storage on the phone is the first copy.
- Your home server has some spare storage and you copy/paste this image data to the home server every weekend.
- You also have some Dropbox/Backblaze storage and backup this ‘home-server-computer.
- At the end of the week, one copy still remains on the phone, another on the computer, another on Dropbox.
- When your phone runs out of storage, you must have to get another SSD and move your phone’s data there.

How it helps having the 3-2-1 backup policy: Even if one storage medium fails due to unseen circumstances:

Phone gets stolen/out of order
Home server storage crashes and dies
Dropbox goes bankrupt and can’t retrieve your data.

You still can react as there’d be 2 other places you can find your data at. In case all three happen at the same time? May the god(s) guide you in your quest :)

Software

Now that we’ve decided on how to plan our backups, let’s look at some tools we can use to perform these backups on a Linux system.

I’ll explain the tools that I use for your reference.

I use Nextcloud to sync the important bits of my phone like pictures, documents to the home server. Since Nextcloud is running in Docker, it’s safe to back up the Docker volume itself. There are separate volumes for blob data and the Nextcloud database.
Now there are other services on my home server which generate data like Trilium, Immich, Miniflux, etc. Fortunately, I run them as Docker containers with Docker volumes. It’s safe for me to back up all the Docker volumes together. But there are some other data directories on my computer. I keep an aggr.sh script which copies all the directories to a common place.
Once you have everything in the same directory, the next step is to use Restic to move this data to a cloud/offsite backup. There are alternatives to Restic, of course; you can pick your tool. I have some 200GB in OCI as part of the free tier, which is attached to a VM with configured SSH.
Internally, Restic uses scp for my configuration. One has to configure IdentityFile for the Host in .ssh/config/. There’s no other way to tell Restic to use a specific identity file (i.e., not in the backup command).
I’m yet to configure retention for Restic. However, Restic uses incremental storage, so I’m not sure how much space retention can help save.

Maintenance

Just deploying this software isn’t enough. Periodically, you have to update it with bug fixes and security patches, especially if you’re exposing your network to the public internet.

I had an instance where I was trying to update Nextcloud, and it needed to upgrade from Postgres 14 to 16. I was using pg_upgrade, but in the process, I messed up the existing data directory. Because my backups were in OCI, I quickly fetched the latest Restic snapshot and rectified the mess-up.

Some software with a lot of dependencies, like Trilium, gets version updates frequently. I think one has to spend about 4 hours every month to stay updated with the latest version of their deployed stack.

I also spend around 2 hours every week taking backups of the system.

Due to the additional chores that come with every self-hosted software, one must be very selective and exercise restraint when deciding to host a new shiny service. If you’re not using a service for a while, consider removing it altogether from the system. I just wanted to point out that the more software you deploy on your home server, the more maintenance work you’ll have. My advice is to keep your home server lean.

Next steps

Some themes I want to explore in the next few months include:

I am moving to Germany. I wonder how I can continue to access my system as I’ll physically be away. Some options I have are:
- Taking the whole server (not practical). The cost of shipping ~= cost of hardware assembly.
- Taking only the HDD and building a new, cheap, amd64 system.
- Getting a new HDD and working it with a Raspberry Pi Zero there by pulling existing backups on OCI and rewriting the scripts.
Even if I get my hardware, it’ll be hard to replicate the tooling and dependencies. That’s what I want to fix. It shouldn’t be hard to get a new system working.
- One can get deterministic environments with NixOS, Ansible to set things up declaratively. However, this method would take a lot of time.
- Alternatively, one can use Proxmox to virtualize VMs and then back up the whole system. Install Proxmox on the new hardware I set up there and create a VM from a backup.
One of the major concerns of my home network is a TP-Link camera data being proxied to Chinese servers. The reviews, on the camera and service are great, but I don’t trust them enough. I am exploring a project on GitHub called go2rtc: so far it’s able to generate an HTML dashboard to view the camera feeds. Ideally, I would also like to schedule recordings, for which I am to explore another project called frigate, which is a more sophisticated home automation platform allowing face recognition and scheduled recordings.
In the mid-future, I’d also like to improve/upgrade the server hardware itself.
- The current specs aren’t great:
  - i3-3rd gen processor
  - 8GB DDR4 memory
  - the /boot 1TB HDD, which is ~7 years old.
- I have to wait ~15 minutes until all the restart: always docker containers are up and running. If any two major tasks are running at the same time, i.e., an immich files upload and a change-detection poll, the CPU usage reaches 100%, and things start to throttle.
- My next build is planned to have the following specifications:
  - At least an i5-12th gen CPU
  - 16 GB memory
  - a 256 GB boot drive
  - ~2-3 TB of memory with ZFS.

The r/selfhosted, r/homelab community on Reddit is great places to hang out in. The people are more inclusive than other communities, it’s also more active since the hobby is relatively cheaper (free to self-host things) than others like electronics, keyboards, pens, watches, etc.

I encourage you to self-host some of the services you use on a daily basis. It’s fun to know that your data is really “personal,” and no tech giant is thinking of 100 different ways to make money off of it.

Please also talk to me if you’re a small/mid-sized company looking to adopt self-hosted alternatives for the software you subscribe to. Some of the glaring benefits are:

Self-hosted software is cheap compared to managed/paid options.
Most self-hosted software is also open-source. That means you can fork it and modify it as per your needs.
You own your data. Host it inside a VPN, and it’s very hard for an intruder to even have a network path to access your data, let alone stealing it.

Thanks for reading this post all the way through. Please write to me if you set something up after getting inspired from here. I’d also love to help you out if you get stuck.

May your home labs run forever. Good luck!