Thursday, April 7, 2011

Comparing the 2008 and 2010 New Zealand Web Harvests

Brian of Auckland has asked about the New Zealand Web Harvest 2010: “How much of the data has been analysed, catalogued or made available… Any stats?”

All good questions. “I'm sure there is a lot of interest :-)” he adds.

Analysis

This prompt has caused me to stop making excuses, and start analysing. This is more complicated than you might think, because there’s just so much data. Even the log files and summary reports are too large to work with easily.

Luckily, I still have the scripts I used in 2008, so the first pass is fairly easy. (These scripts don’t examine the data itself, they examine the reports generated from the harvest result by the Internet Archive.) I’ve now verified and written up this summary for 2010.

My colleague Gillian has taken this report and started doing side-by-side comparisons with the 2008 data. I’ve summarised her findings below, and here’s a more detailed breakdown (Link to follow).

Statistics

The following table provides a summary of the different website harvests in 2008 and 2010.

Here's a bit more detail on the .NZ part of the harvest.

What does this tell us? The obvious thing is the 2010 harvest ran longer and gathered more data, but that doesn’t necessarily mean the internet was any bigger by then because we made a lot of changes as a result of feedback following the 2008 harvest, and consultation prior to the 2010 harvest.

The first major change was that the 2010 harvest had much better seeds because we had access to the Zone files for .nz, .com, .net and .org and therefore believe we have much better coverage of the registered domains.

The second major change was that we honoured the robots.txt protocol (except when downloading images and similar elements embedded in web pages). This means that many websites were crawled less heavily than may have been the case in 2008, when we ignored robots.txt (unless specifically requested otherwise) to get a more complete crawl.

To summarise, we think the 2010 crawl had greater coverage than the first, the specific websites harvested were in many cases less complete.

Some anecdotal comments

While we haven’t made a systematic study of the data I believe the second harvest provides good coverage of the .nz domain in 2010 (whereas the 2008 harvest was patchy), and that .nz simply was significantly bigger in 2010 than in 2008 (but we’ll probably never know how much bigger, or even if such things can be measured).

Gillian and the harvesting team currently have access to both harvests. As is always the case with web archiving, the quality of the harvested websites varies. Some are complete and can be viewed properly. Others lack content because of technical limitations of the harvester, or because the website owners have excluded the harvester with robots.txt files. In selective web harvesting these problems are often resolved by tailoring the profiles for each website, or contacting the website owner. In domain harvesting this isn’t possible, due to the sheer quantity of data and the speed of the harvest.

Anecdotally, the 2010 seemed to do a much better job of avoiding spider traps, thanks to advances in harvesting practice, and to the changed robots policy.

Cataloguing

There’s none. Our selective harvests are individually catalogued and available online, but as yet we have no catalogue record for the domain harvests.

Making it available

The harvests are currently only available to selected staff members in the Library. There are a lot of legal (and also technical) issues that have to be addressed before we can provide public access, and while we’ve been able to run the harvests and secure the results, we haven’t had the resources to have a serious tilt at these access challenges.

As an interim measure we’re discussing bringing the 2008 and 2010 domain harvests together into one access point, and making them available within the the Library's reading rooms when the Molesworth Street building re-opens in 2012.

The next stage would be to provide public online access, and we’re every bit as excited about that prospect as the many people who email us to request it!

7 comments:

Rotovegas said...

Hi Gordon, many thanks for this and for the work both you and Gillian have done. Certainly appreciate what a large task it is to manage and process the harvests. Interesting to see that the number of requests were significantly down for .ac, although actual data increased. I'm guessing that's because of the number of robots.txt files on University servers. Was the decision to honour the protocol this time a political one, or practical - eg picked up a lot of junk last time...

Regards
-Brian of Auckland

Gordon Paynter said...

Hi Brian: I just spent 30 minutes typing a detailed response, but blogger swallowed it! Here's the short version.

The changes between the 2008 and 2010 harvests of ac.nz are complex: we found more domains, but fewer hosts on those domains; we made fewer requests, but received more data in response to those requests.

We don't/can't know the precise effect of the robots.txt changes on ac.nz. However, my hunch is that robots.txt changes probably did have a significant effect on the "junk" harvested, particularly from ac.nz. For example, in 2008 I know we harvested a good chunk of a university library catalogue (oops!) and at least one test server that effectively duplicated a production site (doh!). Honouring robots.txt may have saved us from these mistakes.

Finally, the robots.txt decision was complex, and while I can't remember all the costs and benefits, these are detailed in a paper that is online (I think).

However, one practical argument that sticks with me is that we estimated beforehand that we could harvest the same quantity of data regardless of whether we honoured or ignored robots.txt (i.e. the robots-open part of the NZ web is larger than we can gather). Therefore, there isn't a huge disadvantage to ignoring robots.txt (as long as we ensure that for each pages that we do harvest we always get all the embedded elements so that the quality of those pages is not compromised).

Note this argument doesn't apply to selective harvesting. And I don't know if I'd make it for every harvest: mature programmes like the one at the BNF adjust scope and crawl parameters (such as robots policy) in subsequent domain harvests, and I can see value in that.

I hope that answers a few of your questions.

Gordon

Anonymous said...

Hi!

Thanks for that Post. What are you doing if there will be no heritrix report available for some jobs. Are you ignoring these Jobs for that statistics or do you have tools to build up these report files based on the collected (w)arc files?

Gordon Paynter said...

Hi Anon: We're not working from the Heritrix reports, we're working from a summary "hosts" reports that IA created (I would guess they either generate is as create the CDX index, or generate it from the CDX file at the end).

We could regerenate these reports from the CDX index, and in 2008 we did in fact generate some reports ourselves from the CDX index.

However, your point remains: we are depending on IA to generate the CDX index and hosts report for us, rather than doing our own analysis of the raw data.

At some point we'll have to go back to the WARC files, and regerenate all the derived data from these. However, the indexing process alone can take weeks of CPU time (which is why we have asked IA to do it for us).

Gordon

Anonymous said...

Is there a list of those downloaded domains?

Shayne said...

Its impressive to see there was a 100% jump of domains from 2008 to 2010. I would love to see the stats for 2011.

fukt i källare said...

very informative article .well written dear sir..