Brian of Auckland has asked about the New Zealand Web Harvest 2010: “How much of the data has been analysed, catalogued or made available… Any stats?”
All good questions. “I'm sure there is a lot of interest :-)” he adds.
Analysis
This prompt has caused me to stop making excuses, and start analysing. This is more complicated than you might think, because there’s just so much data. Even the log files and summary reports are too large to work with easily.
Luckily, I still have the scripts I used in 2008, so the first pass is fairly easy. (These scripts don’t examine the data itself, they examine the reports generated from the harvest result by the Internet Archive.) I’ve now verified and written up this summary for 2010.
My colleague Gillian has taken this report and started doing side-by-side comparisons with the 2008 data. I’ve summarised her findings below, and here’s a more detailed breakdown (Link to follow).
Statistics
The following table provides a summary of the different website harvests in 2008 and 2010.
Here's a bit more detail on the .NZ part of the harvest.
What does this tell us? The obvious thing is the 2010 harvest ran longer and gathered more data, but that doesn’t necessarily mean the internet was any bigger by then because we made a lot of changes as a result of feedback following the 2008 harvest, and consultation prior to the 2010 harvest.
The first major change was that the 2010 harvest had much better seeds because we had access to the Zone files for .nz, .com, .net and .org and therefore believe we have much better coverage of the registered domains.
The second major change was that we honoured the robots.txt protocol (except when downloading images and similar elements embedded in web pages). This means that many websites were crawled less heavily than may have been the case in 2008, when we ignored robots.txt (unless specifically requested otherwise) to get a more complete crawl.
To summarise, we think the 2010 crawl had greater coverage than the first, the specific websites harvested were in many cases less complete.
Some anecdotal comments
While we haven’t made a systematic study of the data I believe the second harvest provides good coverage of the .nz domain in 2010 (whereas the 2008 harvest was patchy), and that .nz simply was significantly bigger in 2010 than in 2008 (but we’ll probably never know how much bigger, or even if such things can be measured).
Gillian and the harvesting team currently have access to both harvests. As is always the case with web archiving, the quality of the harvested websites varies. Some are complete and can be viewed properly. Others lack content because of technical limitations of the harvester, or because the website owners have excluded the harvester with robots.txt files. In selective web harvesting these problems are often resolved by tailoring the profiles for each website, or contacting the website owner. In domain harvesting this isn’t possible, due to the sheer quantity of data and the speed of the harvest.
Anecdotally, the 2010 seemed to do a much better job of avoiding spider traps, thanks to advances in harvesting practice, and to the changed robots policy.
Cataloguing
There’s none. Our selective harvests are individually catalogued and available online, but as yet we have no catalogue record for the domain harvests.
Making it available
The harvests are currently only available to selected staff members in the Library. There are a lot of legal (and also technical) issues that have to be addressed before we can provide public access, and while we’ve been able to run the harvests and secure the results, we haven’t had the resources to have a serious tilt at these access challenges.
As an interim measure we’re discussing bringing the 2008 and 2010 domain harvests together into one access point, and making them available within the the Library's reading rooms when the Molesworth Street building re-opens in 2012.
The next stage would be to provide public online access, and we’re every bit as excited about that prospect as the many people who email us to request it!
Thursday, April 7, 2011
Comparing the 2008 and 2010 New Zealand Web Harvests
Posted by
Gordon Paynter
at
8:42 PM
5
comments
Tags:
Gordon Paynter,
internet,
web harvest
Subscribe to:
Posts (Atom)