tl;dr: Hosting a private git repo on your home machine is a lot of fun. It’s not meant to replace other forms of storage and collaboration but instead complements them. This does assume that you already have the capability to SSH in to your home hardware remotely.
What problem are we trying to solve?
Before jumping in to any DIY tech project it’s important to have a clear understanding of what problem you are trying to solve and/or what the end result should look like. Putting in this time up front will save you hours of googling and potentially lots of money spent on Amazon, Newegg, etc…
Situation: I have a number of data science projects I’m working on. These projects come with rather large data sets and at least one Jupyter notebook per project.
Problem: I work on these projects across multiple devices and need them all to stay in sync. Also, this is just work I’m doing on my own to get better at data science and not necessarily something I want to have out there publicly at all times.
End Result: A solution that is simple, secure and affordable as possible that allows me to work on my projects and keep everything in sync.
What options do we have?
Now that we understand the end result we’re going for let’s list out some options that come to mind:
- External Storage: USB stick or external hard drive
- Cloud Storage: Dropbox or equivalent cloud storage service
- Github: Git platform provider
- Self Hosted Git Repo: Similar to Github but only using hardware I own and/or control
- Risky (no redundancy)
- No Collaboration
A USB stick or external hard drive is by far going to be the simplest way to go. Just plug that device in to whatever machine you’re working on and there’s your stuff. The storage is also affordable.
The risk is also insane. That’s it, right there. All your work on that thumb drive. What if you lose it? You can also forget about collaboration. You are the only one that can work your data/code.
- Probably Affordable
- Limited Collaboration
- Possibly not enough storage
Dropbox and its equivalents are great. You have a folder on your machine that is a local copy and also actively syncs to storage in the cloud and trickles down to all the other instances you have running on other machines. Depending on how much data you need to store they can be very affordable as well.
Collaboration might be limited though. Depending on the tier of account you have you may only get a limited number of installs. Also, how they handle the same document being worked on simultaneously may vary.
- Great collaboration
- Redundancy (independent of other changes)
- Limited collaboration on private repos
- Space might not be adequate
Github is great. It allows you (and other people) to work on the same project simultaneously across different machines. Changes happening on other machines are not effecting your work in real time.
Depending on the plan you have, you may be limited in the number of collaborators you can have on a private repository. If you don’t need a private repo this could be a non-issue for you though. Also, depending on the size of you data sets their storage plans might not be adequate either.
Host Your Own Private Git Repository
- You control every part of the process
- Storage is only limited by what you can provide
- You are responsible for every part of the process
- You’re paying for the equipment and the bandwidth
Hosting a private git repo can happen in two different ways: in the cloud, or on your own hardware (probably at home). We’ll focus on using your own hardware. The cloud example is similar but you’re trading off the burden of maintenance for some degree of control.
By setting up a private git repo on a machine you have at home you now have a redundant copy of whatever you’re working on that you or anyone you want to give access to can now collaborate on.
It’s as private as you want it to be. Depending on your threat model it may also be “more secure” than a cloud provider. For example: breaching your Dropbox account could be as simple as stuffing credentials from an unrelated breach. Whereas gaining access to a machine on your home network could prove more difficult (think “security through obscurity”, also ssh key only login)
All that sounds great but you’ve now taken on the burden of maintaining the hardware hosting the repo and are paying for the bandwidth to send data back and forth. You’re also losing the convenience of Dropbox-esque cloud provider (Git has more steps) and on Github you could simply add collaborators or make your entire repo public if you want more people to see it. Giving random people access to your home machine probably isn’t a great idea.
External storage was just too clunky and too risky for me and allowed no way to collaborate with anyone else. Cloud storage is great, the storage plans weren’t too pricey but I just didn’t like the idea of all my stuff syncing in real time and the security risk that comes with a Dropbox account. I use Github for projects I want to share publicly. These can be jupyter notebooks I want to walk people through or software projects I’m working on. For private stuff though, the storage limitations were just too much for me.
Setting up my own private repo on my home machine was a lot of fun and comes with a certain “cool” factor. All the above methods and services still have a place in my life. This is just one more way to augment my workflow.
So, final thought, who is this good for?
- People that want complete control over what they are working on.
- People working with large data sets they want to share with just themselves or a small team of people.
- Anyone with a DIY attitude and a few extra minutes to kill.
Stay tuned for a video where I show you how I set up my private repo on my home machine and walk through my personal use cases.
Here are some resources I found very helpful during this process for hosting a private git repo: