On changing how I store source code
Context
It’s no news that LLMs are performing some tasks at near human levels. We cannot entirely rule out the fact that some models will generate code with a lot more context than humans in the future, and thus they’ll perform these tasks a lot better than us. The few ridiculous answers we get in today’s models will definitely be sparse.
The only way your intellectual property will be yours and not of the models’ is by not being a source of training data for these models.
Now consider “source code generation” as a job that some of us do. What are some of the largest source of these data? Yes, it’s the source code we humans write and docs/source of upstream dependencies.
As long as our code is open source, it’s semantically sound for LLMs to train on them. But what stops GitHub(cough! Microsoft, cough! OpenAI) from sending anonymous private repositories as training data to these models in the future?
Self-hosting
I have been self-hosting things for a little under a year now, mostly my photos, documents, and some services to take notes, changedetection.io etc.
To really own your code, you have to self-host the same and not rely on the morale compass of organizations like GitHub, BitBucket, etc.
To self-host any generic software/web-service X you’re too dependent upon, we do:
- search for “selfhosted alternative to X” on the web – (Y)
- pray that Y has a friendly set of docs and users’ community.
r/selfhosted
can be a nice place to start if you want to start self-hosting some things.
I started with a Raspberry PI Zero worth $20, but sky’s your limit.
Self-hosting source code storage
Coming back to self-hosting your source code, there are a few good alternatives.
-
GitLab - Popular FOSS project, written on RoR and has some enterprise options. Over the years, GitLab has moved on to be a full-fledged Dev/Deploy platform than a storage for backends.
-
Gitea - FOSS, no ’enterprise’ offering, entirely maintained by community. Written in Go.
-
sourcehut - Written in Go. Open source, commercialised but probably isn’t revenue-driven.
There are others but these seem like the most popular options to me.
Now if you work by yourself on most of your projects, the only things you need from your git hosting service is
- storage for your source code
- CI/CD
- Backups of storage
Gitea seemed like the best option for me because I could tweak the code as per my needs since I’m comfortable with Go, and it satisfied the above options while staying nimble.
It literally took me running a single binary to ho st my first codebase on Gitea. The other options didn’t seem this friendly to deploy.
Self-hosting Gitea
One can deploy Gitea with a binary, Docker, as a systemd service, etc. The full list can be accessed in the “Installation” section here.
Here’s a safe draft of the docker-compose.yml
file I run for Gitea. If you don’t specify DB credentials, Gitea uses Sqlite3.
version: "3"
networks:
gitea:
external: false
services:
server:
image: gitea/gitea:1.19.0
container_name: gitea
environment:
- USER_UID=1020
- USER_GID=1020
restart: always
networks:
- gitea
volumes:
- ./gitea:/data
- /etc/timezone:/etc/timezone:ro
- /etc/localtime:/etc/localtime:ro
- /home/git/.ssh:/data/git/.ssh
ports:
- "6565:3000"
- "127.0.0.1:222:22"
For backups, I can always manually back-up ./gitea
directory. I use restic to take back-ups, but that deserves another writing in itself.
A tricky thing about git servers is configuring SSH, since it’s safe and easier to interact with git servers via SSH. The full length of the problem is discussed here but I’ll attach a TL;DR in case you want.
The basic idea is for Gitea to take control of the session from SSH server of the host when an user does SSH based operation. What Gitea does is the following
- When user creates an SSH key at Gitea web, it adds this to
.ssh/authorized_keys
in the host, prefixed bycommand=[some_command]
- When the user commands
git clone git@ip:user/repo
, ssh server checks/home/git/.ssh/authorized_keys
to match the public key and runs the prefixed commandsome_command
- After step 2, Gitea handles subsequent authentication and authorization.
One other chore to take care of when hosting with docker containers is as follows.
- mount host
.ssh
dir with docker container so that host can ssh into container. - Add host’s public key to
/home/git/.ssh/authorized_keys
such that a ‘shim’ can be established between the host and the docker-container - The shim is a bash executable with a command like:
ssh -p 2222 git@127.0.0.1 "SSH_ORIGINAL_COMMAND=\"$SSH_ORIGINAL_COMMAND\" $0 $@"
The full set of instructions to get SSH working can be found here.
I have onboarded 2 private repositories to the local Gitea and planning to experiment with some CI/CD later this week. :) The only downside to self-hosting source-code is the nice green GitHub contributions chart is green-no-mo!