Wednesday, February 3, 2010

Reminder: Consultation on 2010 web harvest

A reminder that the closing date for submissions on the 2010 Web Harvest Options Paper is 9am Monday 8 February 2010.

You can read our original blog post here, or full announcement on the National Library website.

Thank you to those who have already sent in your feedback. We've also received some questions since releasing the paper:

How/where is the data being stored, and how securely?

Data collected in selective and whole of domain websites is stored on-site at the National Library of New Zealand, and backed up in several off-site (New Zealand) locations.

Can larger sites submit their sitemaps to the crawler to make the harvesting less intensive?

At the time of the 2008 harvest the Heritrix software used by the Internet Archive for the harvest had difficulty understanding sitemaps. We're looking into whether this has improved.

Will you harvest password-protected websites?

No. We will only archive the publicly available pages of a website.

Was all the data harvested in 2008 from computers located in New Zealand?

No. Just because a website has a .nz URL it does not mean that the computer is physically in New Zealand. Our current estimates of where the data was harvested from in the 2008 harvest (based on bytes downloaded) are as follows:

Hosted in New Zealand: 73%
Hosted in the USA: 17%
Hosted in Australia: 4%
Hosted in an unknown location: 1%

Four more countries (Germany, Fiji, Bulgaria, United Kingdom) hosted 1% of the data each, and no other country hosted more than 0.5% of the total.

If you have any questions, fire them through to web-harvest-2010 AT natlib.govt.nz

Thanks heaps,

Courtney and Gordon

Gordon Paynter (Programme Manager Digitisation) and Courtney Johnston (Web Manager) are the New Zealand Web Harvest 2010 team. You can contact us via web-harvest-2010 AT natlib.govt.nz


6 comments:

Anonymous said...

Where is the list(s) of what was actually harvested

What non .nz TLD are in that list

Courtney Johnston said...

Hi Anon

There is an awful lot of data, and we have tried to put as much generally-useful summary information as we can in the options paper. Some people have asked us for specific information to prepare their feedback and we have provided it when we can (again, as a summary report). Could you please be more specific about what data you'd like to see (beyond what is in the Appendix in the options paper)?

Anonymous said...

A simple list of the 363,279 hosts harvested

Anonymous said...

I guess i missed where that list of the 363,279 hosts harvested was published. Can I request it under the Official Information process. Where would I go to do that. Do you have an online form?

Gordon Paynter said...

Hi Anon:

Mea culpa, I kind of lost your request in the WoD consultation washup.

I have the data, I need to find a place to put it, and our website admin is out of the office today. I'll try and do it tomorrow. I'll post the original list of "seed URLs" (that we started with), and the list of hosts that we would up taking content from.

Hope that helps, and sorry about the delay.

Gordon

PS: I'm told that any request to a government department for information is an OIA request. So technically, you already made one.

Courtney Johnston said...

Hi Anon

The list of hosts is now available for download on the Web Harvest project page on our website (at the end of the Q and A).