Tuesday, December 2, 2008

New Zealand Web Harvest 2008 Briefing

As you may have read already, the National Library recently hosted a meeting of ICT sector representatives to discuss the New Zealand Web Harvest 2008.

The meeting was attended by the Library, DNC, Internet NZ, Citylink, LIAC, NZNOG, and MED (apologies: ISPANZ).

At the meeting I presented some background to the New Zealand Web Harvest 2008, and to the issues that arose before, during and after the harvest, which I’m sharing here.

Facts and figures

The harvest collected 105 million URLs over ten days. It downloaded about 4.1 terabytes of data, which compresses down to slightly less than 3 terabytes. A maximum of 50,000 URLs were taken from hosts in ac.nz, cri.nz, govt.nz and maori.nz, and up to 10,000 from other hosts.

Rationale

Social responsibility. The National Library of New Zealand has a social responsibility to preserve New Zealand's social and cultural history, be it in the form of books, newspapers and photographs, or of websites, blogs and YouTube videos. An increasing amount of New Zealand's documentary heritage is only available online, so the public benefit from the safe, long-term preservation of New Zealand's online heritage is incalculable.

Legal mandate. This project is being undertaken under the Library's legislative mandate for 'collecting, preserving, and protecting documents, particularly those relating to New Zealand, and making them accessible for all the people of New Zealand, in a manner consistent with their status as documentary heritage and taonga'.

Selective harvesting programme. We have been selectively harvesting websites for several years, and has been using the Web Curator Tool in the current legal environment since January 2007. In that time, we have attempted about 3,000 harvests. Our current selective harvesting focuses on the general election, with several hundred harvests of political party websites, blogs, and other sites of interest scheduled during October and November.

International context. The Library is a member of the International Internet Preservation Consortium. The recent member survey includes 24 national libraries, 14 of whom already crawl their entire national domain. Most also perform selective harvests.

Setup

Goals. The goals of the crawl were twofold: to seek as broad a selection of hosts as possible, and to crawl more deeply into hosts in ac.nz, cri.nz, govt.nz and maori.nz.

Scope. The scope of the was limited to hosts in the nz TLD, plus a selection of other hosts that met our criteria for harvesting that is based on the legal deposit legislation.

Seeds. Seeds were gathered from our and the Internet Archive’s past crawls, and from a test crawl in early October.

Robots policy. There is not a lot of research out there quantifying what is lost by a harvest that honours robots.txt. However, our experience with selective harvests is that honouring robots.txt leads to a significant reduction in the quantity and quality of material harvested. (Almost all archival crawls ignore robots.txt to some degree.)

Crawling

Hardware, software and network. The harvest was performed with Heritrix, the Internet Archive’s open-source archival-quality crawler. The crawler’s capabilities guided a lot of our decisions. Heritrix was deployed across five high-end Linux servers, and made requests from two IP addresses, both in California.

Shaping and polling. The Internet Archive’s crawl engineers designed the crawl to request 100 URLs at a time from each host (500 for ac.nz, cri.nz, govt.nz and maori.nz) and regularly reviewed the maximum number of URLs downloaded per host to ensure a broad crawl.

Continuous monitoring. The crawl engineers monitor for problems like crawler traps and web spam, and are sent a copy of all email feedback so they can take immediate action.

Future harvest considerations

Robots policy. We got a lot of feedback about our robots.txt policy. This policy was determined (prior to harvest) after consultation with our peers in the IIPC and with the Internet Archive. The policy was reviewed daily during the harvest in response to feedback, and will be again when we analyse the harvest result.

Communications. We could have communicated better with webmasters, and have agreed that for our next harvest we will work with the industry representatives to publicise and inform groups. The industry group assembled will be central to future communications for web harvests.

International traffic. Webmasters pointed out that the harvester was based in California and therefore costing webmasters international traffic charges. Several technical solutions to the problem have been suggested, including locating a harvester in New Zealand. We are investigating with the Internet Archive and network operators.

What’s next?

The harvested data will be returned to the Library in December on SATA disks, following a period of quality review. We will then start the long process of analysing the data and deciding what to do with it.

A few major issues loom on the horizon. The first is preservation: how do we keep this stuff safe? We will investigate this issue in the new year, though we’re unlikely to resolve it in the short or medium term. A second major issue is public access. There are several policy issues around privacy and take-down requests to be sorted out first, and we are not currently planning to offer public access to the harvest until these are sorted out.

Once our analysis of the data is complete we look forward to sharing the findings - watch this space.

Many thanks to DNZ, Internet NZ, CityLink, LIAC, NZNOG, MED, and ISPANZ to discuss the project with us.

1 comments:

mike forbes said...

It's great to see the lessons learnt post-harvest.

I'm looking forward to seeing the results, and having the wider IT community involved in the next crawl.

Thanks for sharing the details!