I’m finishing up at the library in just over a week, so this will be my last post to LibraryTechNZ, and I intend it to be mercifully brief.
But I want to touch base with you all, to thank you for being such a great audience, and to say how much I've enjoyed this foray into blogging on digital libraries.
To recap, my "primary responsibility" over the past couple of years has been the digitisation of the Donald McLean Papers and the development of a website to host them. The project, our approach, and the results are described in some detail in the presentation that David Colquhoun and I gave at LIANZA last year. Check it out if you’re into that kind of thing.
But what I want to talk about today is version control, and how it can bring you peace and serenity.
Subversion and git are two free open source version control systems. I’m going to describe subversion, but feel free to use git or mercurial if it pleases you.
First, I guess I should admit that not everyone needs version control in their day-to-day life. In fact, if you never work with data or metadata, you can sign off right here.
The rest of you either already know about the magic of version control, or I’m doing you one huge favour by cluing you in, right now. Either way, read on.
With version control, you never need to fear questions like:
- Where are the latest versions of all the files? (A: "They're in the subversion repository.")
- Can I edit them? (A: "Sure! Just check them out, make your changes and commit them back in, with a note describing the change.")
- Where are the versions of all the files signed off by the steering group? (A: "Just check out the steering-group-approved-march-09 tag.")
- What changes have been made since then, and why, and by whom? (A: "View the log and do a diff.")
- Can we make a slightly different version of the files for xyz? (A: "Of course! Just create a branch.")
Version control is just so elegant and right that when you say "the files are all in the subversion repository" it's like saying "the water is in the tap", except it's better because you decided to put the water there and it's obviously where it belongs.
So how does it work?
- To start with, you'll need a "subversion repository". If there isn't one available to you in your organisation, you can ask your technical people to set one up, find a hosted solution or even go ahead and install one yourself on your PC.
- Then you import your files into the repository. From that moment on, you can breathe a sigh of relief and say "All the files are under version control. ftw."
- You then check out your files to a local working directory. I’m a Mac user, so I tend to check out files to a folder on my cluttered desktop, but you can put them wherever you like.
- Now that you have a local working copy of the files, you can edit them and work with them just like you always have. The only thing is that every time you make a significant change to a file, you should commit the change back to the repository. There’s no law about how often you do this but it’s Good Practice to commit your changes frequently.
With this small investment of effort, you can achieve magic. Because unlike in space, under version control nothing is ever lost™. Provided you back up your repository, of course.
What kinds of files belong in version control?
Both text files (.txt, .xml, .conf, rtf, etc) and binary files (.doc, .odt, etc) can be kept under version control. With text files you can see the exact changes that were made to which lines of the file, whereas with binary files you only know that the file changed.
In the McLean Papers Digitisation Project we used version control for:
- all the TEI (Text Encoding Initiative) xml full text transcriptions and translations
- the prototype delivery system scripts in php
- a snapshot of the mysql database as a mysqldump .sql file
- the database schemas
- the solr configuration files and schema
- the java tomcat delivery system and configuration
- the apache reverse proxy config httpd.conf
- various xsl files to do a variety of unholy things
- and much, much more!
Every time you commit a file, it gets a new revision number. The previous version can still be retrieved if you ask for it by revision number, but by default you get the latest version.
You can delete files, but you can also retrieve the previous, undeleted version, if you ask for it by revision number. This means nothing is ever lost, but things don't get cluttered either.
It's common to structure your repository as:
The trunk is where the latest main version is kept. If you just want the latest config files or whatever you're using the repository for, trunk is the natural place to go.
The branches are where you or others can take a version of your files and develop a new version for some purpose, without affecting the trunk.
The tags are where you record a particular set of your files as being a "release" or an approved version. In the example above, the steering group approved a set of the configuration files in March. We need to still be able to retrieve the exact files they approved, as well as being able to work on the files to fix all the issues they missed, and keep track of these changes. When we tag a set of files as steering-group-approved-march-09, we capture a moment in time forever, allowing anyone to download those exact files even while we continue to develop and change them.
Interested? Go read the excellent and free online book and discover for yourself the peace of knowing where all your files are, what's been changed, when and by who, and to be able to work on your files at the same time as other people without messing things up for each other.
Subversion, for a better world. Bye for now.