Wednesday, October 15, 2008

2008 Web Harvest - Let us know how we can make it better for you

So – the Library is conducting the 2008 web harvest and we’ve been picking up some pain around the fact that we’re not honouring the robots.txt protocol: see ‘Life of Andrew’ and ‘Doing Nothing’ for example, or this wave of tweets.

First up: our intentions are good!

We’re not sucking up the internet for fun. This is an archival harvest, and the intent is to preserve the content we harvest so people in the future can find it and use it.

For years we’ve been collecting, preserving and making accessible books, letters, diaries, photos, paintings, newspapers – you name a physical format, we’ve probably got it. We do this so that New Zealand’s documentary heritage is safely preserved, and accessible to all New Zealanders. People use our collections for all sorts of things, from making films to writing family histories.

A change in our legislation in 2003 means we’re now responsible for collecting and preserving NZ’s online documentary heritage as well. We all know that the nature of the web is easy come, easy go. The web harvest is our attempt to capture a snapshot of the internet at this moment in time, so that people can access and use it in the future, the same way they currently do our physical collections.

This is why (after many drawn-out discussions) we’re not following the robots.txt protocol. The robots.txt files currently block many URLs – if we followed it, we’d only get a slice of the internet, not the whole pie that we’re trying to preserve.

Second up: let us know, we’ll make it stop

We know this policy can cause problems for websites, either by overloading them with too much traffic, or by following links that cause problems.

If the crawler is causing problems for your site, please email us at web-harvest-2008@natlib.govt.nz . We’ll stop or modify the harvest asap, and then follow up with you about how we can crawl your site in a way that works better for you.

Thanks to serenecloud and James McGlinn who have both already been in touch & had the crawler fixed to their satisfaction.

We're trying reach out on this matter, so it'd be very much appreciated if you could pass this on to anyone else you know who's concerned. We've also sent this (longer) message out on list-servs.


Posted on behalf of Gordon Paynter, the technical analyst who’s leading this harvest for the National Library.

23 comments:

Mike Riversdale said...

Interesting stance.
Isn't that a little like saying, "We want to keep a record of people's physcaly diaries and therefore can read over people's shoulders"?

I get and suppotr the intention.
However by breaking a fundamental rule of the WWW will probably mean you'll get blocked by more direct ways now and in the future.

Oh, and how are you handling NZ content that is not hosted in NZ? Just curious.

Gordon Paynter said...

Hi Mike:

Yes, in many cases it is similar to collecting people's diaries (national libraries do that too). One of the interesting ethical issues that web harvesting throws up is that legally posting stuff on the web is seen as "publication", the people doing it might think of it more as "communication", and expect some level of privacy. We will have to think about this more before we make the harvest publicly available, but for now our focus is capturing as much at-risk material as we can.

We are gathering a small sample of NZ content that is not hosted in New Zealand, but it is very difficult to detect this reliably in an automated fashion. Our selection is therefore hand-vetted. The Czech national library is doing some neat work on automating this process using whois lookups, and looking for Czech phone numbers, names, language and email addresses on web pages that we might be able to exploit in the future (if we ever do another harvest).

Gordon

mike forbes said...

I'm curious as to how you are collecting these domains..

where are you getting the list from?

Courtney Johnston said...

hey Mike

Gordon's left for the day, but he'll post a reply to your question tomorrow morning. Just to clarify - the list of domains; do you mean all the ones that are being harvested, or the not-hosted-in-NZ ones?

mike forbes said...

Well, either.

I guess i'm wondering
a) how you're getting the list of names to harvest,
and
b) how you tell the difference between sites hosted in new zealand, and those hosted overseas with .nz domain names.

Mike Riversdale said...

I think Mike F has asked the question I was trying to ask - for instance my work site (http://www.miramarmike.co.nz) is actually a Google hosted platform in (I assume) the US. And the opposite probably applies - international content hosted in NZ - but on a smaller scale.

And, of course, NZ content that's on *any* site in the world - Flickr immediately springs to mind (discussions as well as the actual photos)

Back to "putting it onto the Web is legally publication", I'm not sure that answers the question. If the rule is do not index this (using robots.txt) then I don't see how "it's published" changes it. This isn't books, this is the web and the rule (de fact, understood, community) rule still applies.

But hey, I also understand you've a civic duty to perform and appreciate the dilemma and that you're open to talking about it.

Tim Snadden said...

You say 'let us know, we'll make it stop'. By creating a robots.txt file the site owner *has* let you know and you are ignoring their wishes. Bad decision, bad net citizen.

no sleep till midnight said...

Gordon's left for the day, but he'll post a reply to your question tomorrow morning.

maybe Gordon could get the internet at home

dave said...

I don't think trying to apply a paper model to the internet is practical.

I run several large NZ websites containing literally tens of millions of indexable pages and unknown quantities of dynamic pages (search results etc). Thousands of pages change daily and total content is several hundred gigabyte, and lot of that is video and imagery.

Do you intend to download all of the content from all of my sites? (they are all withing the .co.nz)

I see aship in the harbour said...

interesting the non response to issues raised here

Gordon Paynter said...

Hi Mike and Mike:

I’ve added your questions (about where the list of names comes from, and how we find websites hosted inside and outside New Zealand) to a FAQ page, which you can find here:
http://www.natlib.govt.nz/about-us/news/20-october-2008-web-harvest-faqs

It’s all a bit complex, but I hope that covers it off.

Gordon

Gordon Paynter said...

Hi Dave:

I have also answered your question about very large sites directly in the FAQ. To provide a little more detail, we are trying to make the crawl broad rather than deep, so if your website is very large, then the chances are we won’t capture it all.

However, we don’t yet know how deep we can get into large websites. Consider this: we’re currently aiming to harvest 100 million URLs from just over 300,000 hosts. On the face of it that’s an average of 333 URLs per host. However, a lot of hosts will be empty, or redirects, or small. Here’s another case study though: in a recent domain harvest the BNF (French National Library) about half the .fr domains had 10 URLs or less, and only about 0.04% were crawled beyond 10,000 URLs (see “Legal deposit of the French Web” on http://iwaw.net/08/). It therefore seems unlikely we will be harvesting 10s of millions of URLs from your servers.

Gordon

Courtney Johnston said...

Link to the Web Harvest FAQs.

Anonymous said...

2. The crawl engineers used several available services (eg the Internet Archives Wayback Index) to look up the names of hosts that are physically in New Zealand but not registered in the nz domain.

You state this, but in my mind this doesn't actually sound possible.

How can the internet archives wayback index tell you if a host is in New Zealand or not?

Anonymous said...

October 16, 2008 at 9:12 PM, Tim Snadden said...
"You say 'let us know, we'll make it stop'. By creating a robots.txt file the site owner *has* let you know and you are ignoring their wishes."

As a web publisher myself, I whole-heartedly agree with this viewpoint. The "Robots.txt" file is there to control release of information, prevent direct grabbing of pages and files NOT for general distribution and indexing, and basically to let 'agents' know what they are allowed to grab and what they are required to leave alone.

I for one will be blocking your harvesting bot directly, even if it means turning down your 'offer' of posterity.

Also, it would be prudent to allow the 'harvestees' to request that you delete any harvested content, since it was been scraped without implicit consent.

Thank you for bringing to light your utter disregard of our wishes.

Andrew McMillan said...

If you're picking on sites physically hosted in NZ then scraping them from a source address that was also within New Zealand would make it a lot easier for some of us. Not to mention cheaper.

I know that National Library has a Citylink connection. I can't understand why this is not using it? Why are you exporting all of our content to San Jose, and charging us for that?

Cheers,
Andrew.

Boris said...

I guess we must remember that NLNZ has the legal right to collect your site - in full

You had a chance to have a conversation with them about that in the consulation that occured about National Library of New Zealand (Te Puna Mātauranga o Aotearoa) Act 2003 (see http://www.legislation.govt.nz/act/public/2003/0019/latest/whole.html?search=ts_act_National+Library+of+New+Zealand+(Te+Puna+M%C4%81tauranga+o+Aotearoa)+Act+2003#DLM191962

I guess the conversations and submissions are buried in a vault - but sometime will be unveiled along with the various harvests that the nlnz is doing

Anonymous said...

It's very disappointing that National Library is choosing to ignore good internet citizenship in order to achieve its ends. It's disingenuous to ask "how we can make it better for you" when electronic documents are being collected with the latent threat of a $5,000 fine for any publisher that does not comply with making the documents available (see s40 of the National Library Act 2003, http://legislation.govt.nz/act/public/2003/0019/latest/DLM192266.html?search=ts_all%40act%40bill%40regulation_national+library ).

I'll co-operate because I can't afford that fine, but don't expect me to like it. I note that the Internet Archive, to their credit, honours robots.txt (http://www.archive.org/about/exclude.php )

Mike Riversdale said...

Thanks for being a part of this obviously thorny discussion and for adding to the FAQ.

When you guys and gals come to do the next crawl maybe some prior discussion and heads-up would alleviate some of the slightly bad feeling being generated. Working together is always better than finding out after the event.

Mike Riversdale said...

@boris - aha, there was discussion ... darn, missed it at the time - did anyone submit?

Gordon Paynter said...

Hi anonymous:

You note that "the Internet Archives Wayback Index) to look up the names of hosts that are physically in New Zealand" does not seem possible... and its not.

Something got mixed up, and this is meant to refer to the Alexa web search API. We'll update the page.

Gordon

Jason Franks said...

Sigh... If they crawl more than a couple of GB from my site, or if they take down the server, they will be receiving an invoice from me for the international bandwidth used and for loss of business. Robots.txt is there for a reason!!!

Gordon Paynter said...

Hi Andrew:

We've posted a response to a couple more frequently asked questions to the FAQ, including explaining why the harvester is off-shore and why we didn't notify webmasters in advance. I think these are both areas for improvement should we do another harvest.

Gordon