<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/'><id>tag:blogger.com,1999:blog-7346520062335584992.post6536862248220855389..comments</id><updated>2008-10-21T14:15:09.380+13:00</updated><title type='text'>Comments on LibraryTechNZ: 2008 Web Harvest - Let us know how we can make it ...</title><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://librarytechnz.natlib.govt.nz/feeds/6536862248220855389/comments/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7346520062335584992/6536862248220855389/comments/default'/><link rel='alternate' type='text/html' href='http://librarytechnz.natlib.govt.nz/2008/10/2008-web-harvest-let-us-know-how-we-can.html'/><author><name>National Library of New Zealand</name><uri>http://www.blogger.com/profile/05067703181520460430</uri><email>noreply@blogger.com</email></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>23</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>25</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-7346520062335584992.post-7141452066490174544</id><published>2008-10-21T14:15:00.000+13:00</published><updated>2008-10-21T14:15:00.000+13:00</updated><title type='text'>Hi Andrew:We've posted a response to a couple more...</title><content type='html'>Hi Andrew:&lt;BR/&gt;&lt;BR/&gt;We've posted a response to a couple more frequently asked questions to the &lt;A HREF="http://www.natlib.govt.nz/about-us/news/20-october-2008-web-harvest-faqs" REL="nofollow"&gt;FAQ&lt;/A&gt;, including explaining why the harvester is off-shore and why we didn't notify webmasters in advance. I think these are both areas for improvement should we do another harvest.&lt;BR/&gt;&lt;BR/&gt;Gordon</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7346520062335584992/6536862248220855389/comments/default/7141452066490174544'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7346520062335584992/6536862248220855389/comments/default/7141452066490174544'/><link rel='alternate' type='text/html' href='http://librarytechnz.natlib.govt.nz/2008/10/2008-web-harvest-let-us-know-how-we-can.html?showComment=1224551700000#c7141452066490174544' title=''/><author><name>Gordon Paynter</name><uri>http://www.blogger.com/profile/13375515204887559709</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='01047955513003965780'/></author><thr:in-reply-to xmlns:thr='http://purl.org/syndication/thread/1.0' href='http://librarytechnz.natlib.govt.nz/2008/10/2008-web-harvest-let-us-know-how-we-can.html' ref='tag:blogger.com,1999:blog-7346520062335584992.post-6536862248220855389' source='http://www.blogger.com/feeds/7346520062335584992/posts/default/6536862248220855389' type='text/html'/></entry><entry><id>tag:blogger.com,1999:blog-7346520062335584992.post-2954756489731746980</id><published>2008-10-21T14:12:00.000+13:00</published><updated>2008-10-21T14:12:00.000+13:00</updated><title type='text'>Sigh... If they crawl more than a couple of GB fro...</title><content type='html'>Sigh... If they crawl more than a couple of GB from my site, or if they take down the server, they will be receiving an invoice from me for the international bandwidth used and for loss of business. Robots.txt is there for a reason!!!</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7346520062335584992/6536862248220855389/comments/default/2954756489731746980'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7346520062335584992/6536862248220855389/comments/default/2954756489731746980'/><link rel='alternate' type='text/html' href='http://librarytechnz.natlib.govt.nz/2008/10/2008-web-harvest-let-us-know-how-we-can.html?showComment=1224551520000#c2954756489731746980' title=''/><author><name>Jason Franks</name><email>noreply@blogger.com</email></author><thr:in-reply-to xmlns:thr='http://purl.org/syndication/thread/1.0' href='http://librarytechnz.natlib.govt.nz/2008/10/2008-web-harvest-let-us-know-how-we-can.html' ref='tag:blogger.com,1999:blog-7346520062335584992.post-6536862248220855389' source='http://www.blogger.com/feeds/7346520062335584992/posts/default/6536862248220855389' type='text/html'/></entry><entry><id>tag:blogger.com,1999:blog-7346520062335584992.post-3898145631028847859</id><published>2008-10-21T14:09:00.000+13:00</published><updated>2008-10-21T14:09:00.000+13:00</updated><title type='text'>Hi anonymous:You note that "the Internet Archives ...</title><content type='html'>Hi anonymous:&lt;BR/&gt;&lt;BR/&gt;You note that &lt;I&gt;"the Internet Archives Wayback Index) to look up the names of hosts that are physically in New Zealand"&lt;/I&gt; does not seem possible... and its not.&lt;BR/&gt;&lt;BR/&gt;Something got mixed up, and this is meant to refer to &lt;I&gt;the Alexa web search API&lt;/I&gt;. We'll update the page.&lt;BR/&gt;&lt;BR/&gt;Gordon</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7346520062335584992/6536862248220855389/comments/default/3898145631028847859'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7346520062335584992/6536862248220855389/comments/default/3898145631028847859'/><link rel='alternate' type='text/html' href='http://librarytechnz.natlib.govt.nz/2008/10/2008-web-harvest-let-us-know-how-we-can.html?showComment=1224551340000#c3898145631028847859' title=''/><author><name>Gordon Paynter</name><uri>http://www.blogger.com/profile/13375515204887559709</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='01047955513003965780'/></author><thr:in-reply-to xmlns:thr='http://purl.org/syndication/thread/1.0' href='http://librarytechnz.natlib.govt.nz/2008/10/2008-web-harvest-let-us-know-how-we-can.html' ref='tag:blogger.com,1999:blog-7346520062335584992.post-6536862248220855389' source='http://www.blogger.com/feeds/7346520062335584992/posts/default/6536862248220855389' type='text/html'/></entry><entry><id>tag:blogger.com,1999:blog-7346520062335584992.post-6393842272746779840</id><published>2008-10-21T10:47:00.000+13:00</published><updated>2008-10-21T10:47:00.000+13:00</updated><title type='text'>@boris - aha, there was discussion ... darn, misse...</title><content type='html'>@boris - aha, there was discussion ... darn, missed it at the time - did anyone submit?</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7346520062335584992/6536862248220855389/comments/default/6393842272746779840'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7346520062335584992/6536862248220855389/comments/default/6393842272746779840'/><link rel='alternate' type='text/html' href='http://librarytechnz.natlib.govt.nz/2008/10/2008-web-harvest-let-us-know-how-we-can.html?showComment=1224539220000#c6393842272746779840' title=''/><author><name>Mike Riversdale</name><uri>http://www.blogger.com/profile/00112999693425305730</uri><email>noreply@blogger.com</email></author><thr:in-reply-to xmlns:thr='http://purl.org/syndication/thread/1.0' href='http://librarytechnz.natlib.govt.nz/2008/10/2008-web-harvest-let-us-know-how-we-can.html' ref='tag:blogger.com,1999:blog-7346520062335584992.post-6536862248220855389' source='http://www.blogger.com/feeds/7346520062335584992/posts/default/6536862248220855389' type='text/html'/></entry><entry><id>tag:blogger.com,1999:blog-7346520062335584992.post-7226368280840527728</id><published>2008-10-21T10:45:00.000+13:00</published><updated>2008-10-21T10:45:00.000+13:00</updated><title type='text'>Thanks for being a part of this obviously thorny d...</title><content type='html'>Thanks for being a part of this obviously thorny discussion and for adding to the FAQ.&lt;BR/&gt;&lt;BR/&gt;When you guys and gals come to do the next crawl maybe some prior discussion and heads-up would alleviate some of the slightly bad feeling being generated. Working together is always better than finding out after the event.</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7346520062335584992/6536862248220855389/comments/default/7226368280840527728'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7346520062335584992/6536862248220855389/comments/default/7226368280840527728'/><link rel='alternate' type='text/html' href='http://librarytechnz.natlib.govt.nz/2008/10/2008-web-harvest-let-us-know-how-we-can.html?showComment=1224539100000#c7226368280840527728' title=''/><author><name>Mike Riversdale</name><uri>http://www.blogger.com/profile/00112999693425305730</uri><email>noreply@blogger.com</email></author><thr:in-reply-to xmlns:thr='http://purl.org/syndication/thread/1.0' href='http://librarytechnz.natlib.govt.nz/2008/10/2008-web-harvest-let-us-know-how-we-can.html' ref='tag:blogger.com,1999:blog-7346520062335584992.post-6536862248220855389' source='http://www.blogger.com/feeds/7346520062335584992/posts/default/6536862248220855389' type='text/html'/></entry><entry><id>tag:blogger.com,1999:blog-7346520062335584992.post-6069622475521642848</id><published>2008-10-20T21:41:00.000+13:00</published><updated>2008-10-20T21:41:00.000+13:00</updated><title type='text'>It's very disappointing that National Library is c...</title><content type='html'>It's very disappointing that National Library is choosing to ignore good internet citizenship in order to achieve its ends. It's disingenuous to ask "how we can make it better for you" when electronic documents are being collected with the latent threat of a $5,000 fine for any publisher that does not comply with making the documents available (see s40 of the National Library Act 2003, http://legislation.govt.nz/act/public/2003/0019/latest/DLM192266.html?search=ts_all%40act%40bill%40regulation_national+library ).&lt;BR/&gt;&lt;BR/&gt;I'll co-operate because I can't afford that fine, but don't expect me to like it. I note that the Internet Archive, to their credit, honours robots.txt (http://www.archive.org/about/exclude.php )</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7346520062335584992/6536862248220855389/comments/default/6069622475521642848'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7346520062335584992/6536862248220855389/comments/default/6069622475521642848'/><link rel='alternate' type='text/html' href='http://librarytechnz.natlib.govt.nz/2008/10/2008-web-harvest-let-us-know-how-we-can.html?showComment=1224492060000#c6069622475521642848' title=''/><author><name>Anonymous</name><email>noreply@blogger.com</email></author><thr:in-reply-to xmlns:thr='http://purl.org/syndication/thread/1.0' href='http://librarytechnz.natlib.govt.nz/2008/10/2008-web-harvest-let-us-know-how-we-can.html' ref='tag:blogger.com,1999:blog-7346520062335584992.post-6536862248220855389' source='http://www.blogger.com/feeds/7346520062335584992/posts/default/6536862248220855389' type='text/html'/></entry><entry><id>tag:blogger.com,1999:blog-7346520062335584992.post-6563788126711203152</id><published>2008-10-20T21:14:00.000+13:00</published><updated>2008-10-20T21:14:00.000+13:00</updated><title type='text'>I guess we must remember that NLNZ has the legal r...</title><content type='html'>I guess we must remember that NLNZ has the legal right to collect your site - in full&lt;BR/&gt;&lt;BR/&gt;You had a chance to have a conversation with them about that in the consulation that occured about &lt;B&gt;National Library of New Zealand (Te Puna Mātauranga o Aotearoa) Act 2003&lt;/B&gt;  (see &lt;A HREF="http://www.legislation.govt.nz/act/public/2003/0019/latest/whole.html?search=ts_act_National+Library+of+New+Zealand+(Te+Puna+M%C4%81tauranga+o+Aotearoa)+Act+2003#DLM191962" REL="nofollow"&gt;http://www.legislation.govt.nz/act/public/2003/0019/latest/whole.html?search=ts_act_National+Library+of+New+Zealand+(Te+Puna+M%C4%81tauranga+o+Aotearoa)+Act+2003#DLM191962&lt;/A&gt;&lt;BR/&gt;&lt;BR/&gt;I guess the conversations and submissions are buried in a vault - but sometime will be unveiled along with the various harvests that the nlnz is doing</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7346520062335584992/6536862248220855389/comments/default/6563788126711203152'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7346520062335584992/6536862248220855389/comments/default/6563788126711203152'/><link rel='alternate' type='text/html' href='http://librarytechnz.natlib.govt.nz/2008/10/2008-web-harvest-let-us-know-how-we-can.html?showComment=1224490440000#c6563788126711203152' title=''/><author><name>Boris</name><email>noreply@blogger.com</email></author><thr:in-reply-to xmlns:thr='http://purl.org/syndication/thread/1.0' href='http://librarytechnz.natlib.govt.nz/2008/10/2008-web-harvest-let-us-know-how-we-can.html' ref='tag:blogger.com,1999:blog-7346520062335584992.post-6536862248220855389' source='http://www.blogger.com/feeds/7346520062335584992/posts/default/6536862248220855389' type='text/html'/></entry><entry><id>tag:blogger.com,1999:blog-7346520062335584992.post-190920027560443834</id><published>2008-10-20T18:59:00.000+13:00</published><updated>2008-10-20T18:59:00.000+13:00</updated><title type='text'>If you're picking on sites physically hosted in NZ...</title><content type='html'>If you're picking on sites physically hosted in NZ then scraping them from a source address that was also within New Zealand would make it a lot easier for some of us.  Not to mention cheaper.&lt;BR/&gt;&lt;BR/&gt;I know that National Library has a Citylink connection.  I can't understand why this is not using it?  Why are you exporting all of our content to San Jose, and charging us for that?&lt;BR/&gt;&lt;BR/&gt;Cheers,&lt;BR/&gt;Andrew.</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7346520062335584992/6536862248220855389/comments/default/190920027560443834'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7346520062335584992/6536862248220855389/comments/default/190920027560443834'/><link rel='alternate' type='text/html' href='http://librarytechnz.natlib.govt.nz/2008/10/2008-web-harvest-let-us-know-how-we-can.html?showComment=1224482340000#c190920027560443834' title=''/><author><name>Andrew McMillan</name><uri>http://andrew.mcmillan.net.nz/</uri><email>noreply@blogger.com</email></author><thr:in-reply-to xmlns:thr='http://purl.org/syndication/thread/1.0' href='http://librarytechnz.natlib.govt.nz/2008/10/2008-web-harvest-let-us-know-how-we-can.html' ref='tag:blogger.com,1999:blog-7346520062335584992.post-6536862248220855389' source='http://www.blogger.com/feeds/7346520062335584992/posts/default/6536862248220855389' type='text/html'/></entry><entry><id>tag:blogger.com,1999:blog-7346520062335584992.post-4509320290200439658</id><published>2008-10-20T17:44:00.000+13:00</published><updated>2008-10-20T17:44:00.000+13:00</updated><title type='text'>October 16, 2008 at 9:12 PM, Tim Snadden said... "...</title><content type='html'>October 16, 2008 at 9:12 PM, Tim Snadden said... &lt;BR/&gt;"&lt;I&gt;You say 'let us know, we'll make it stop'. By creating a robots.txt file the site owner *has* let you know and you are ignoring their wishes.&lt;/I&gt;"&lt;BR/&gt;&lt;BR/&gt;As a web publisher myself, I whole-heartedly agree with this viewpoint. The "Robots.txt" file is there to control release of information, prevent direct grabbing of pages and files NOT for general distribution and indexing, and basically to let 'agents' know what they are allowed to grab and what they are required to leave alone.&lt;BR/&gt;&lt;BR/&gt;I for one will be blocking your harvesting bot directly, even if it means turning down your 'offer' of posterity.&lt;BR/&gt;&lt;BR/&gt;Also, it would be prudent to allow the 'harvestees' to request that you delete any harvested content, since it was been scraped without implicit consent.&lt;BR/&gt;&lt;BR/&gt;Thank you for bringing to light your utter disregard of our wishes.</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7346520062335584992/6536862248220855389/comments/default/4509320290200439658'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7346520062335584992/6536862248220855389/comments/default/4509320290200439658'/><link rel='alternate' type='text/html' href='http://librarytechnz.natlib.govt.nz/2008/10/2008-web-harvest-let-us-know-how-we-can.html?showComment=1224477840000#c4509320290200439658' title=''/><author><name>Anonymous</name><email>noreply@blogger.com</email></author><thr:in-reply-to xmlns:thr='http://purl.org/syndication/thread/1.0' href='http://librarytechnz.natlib.govt.nz/2008/10/2008-web-harvest-let-us-know-how-we-can.html' ref='tag:blogger.com,1999:blog-7346520062335584992.post-6536862248220855389' source='http://www.blogger.com/feeds/7346520062335584992/posts/default/6536862248220855389' type='text/html'/></entry><entry><id>tag:blogger.com,1999:blog-7346520062335584992.post-5258528244175839189</id><published>2008-10-20T17:05:00.000+13:00</published><updated>2008-10-20T17:05:00.000+13:00</updated><title type='text'>2. The crawl engineers used several available serv...</title><content type='html'>2. The crawl engineers used several available services (eg the Internet Archives Wayback Index) to look up the names of hosts that are physically in New Zealand but not registered in the nz domain.&lt;BR/&gt;&lt;BR/&gt;You state this, but in my mind this doesn't actually sound possible.&lt;BR/&gt;&lt;BR/&gt;How can the internet archives wayback index tell you if a host is in New Zealand or not?</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7346520062335584992/6536862248220855389/comments/default/5258528244175839189'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7346520062335584992/6536862248220855389/comments/default/5258528244175839189'/><link rel='alternate' type='text/html' href='http://librarytechnz.natlib.govt.nz/2008/10/2008-web-harvest-let-us-know-how-we-can.html?showComment=1224475500000#c5258528244175839189' title=''/><author><name>Anonymous</name><email>noreply@blogger.com</email></author><thr:in-reply-to xmlns:thr='http://purl.org/syndication/thread/1.0' href='http://librarytechnz.natlib.govt.nz/2008/10/2008-web-harvest-let-us-know-how-we-can.html' ref='tag:blogger.com,1999:blog-7346520062335584992.post-6536862248220855389' source='http://www.blogger.com/feeds/7346520062335584992/posts/default/6536862248220855389' type='text/html'/></entry><entry><id>tag:blogger.com,1999:blog-7346520062335584992.post-7166103803972226084</id><published>2008-10-20T16:55:00.000+13:00</published><updated>2008-10-20T16:55:00.000+13:00</updated><title type='text'>Link to the Web Harvest FAQs.</title><content type='html'>Link to the &lt;A HREF="http://www.natlib.govt.nz/about-us/news/20-october-2008-web-harvest-faqs" REL="nofollow"&gt;Web Harvest FAQs&lt;/A&gt;.</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7346520062335584992/6536862248220855389/comments/default/7166103803972226084'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7346520062335584992/6536862248220855389/comments/default/7166103803972226084'/><link rel='alternate' type='text/html' href='http://librarytechnz.natlib.govt.nz/2008/10/2008-web-harvest-let-us-know-how-we-can.html?showComment=1224474900000#c7166103803972226084' title=''/><author><name>Courtney Johnston</name><uri>http://www.blogger.com/profile/13465703476413455843</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='02720902840122581826'/></author><thr:in-reply-to xmlns:thr='http://purl.org/syndication/thread/1.0' href='http://librarytechnz.natlib.govt.nz/2008/10/2008-web-harvest-let-us-know-how-we-can.html' ref='tag:blogger.com,1999:blog-7346520062335584992.post-6536862248220855389' source='http://www.blogger.com/feeds/7346520062335584992/posts/default/6536862248220855389' type='text/html'/></entry><entry><id>tag:blogger.com,1999:blog-7346520062335584992.post-8709292228243134415</id><published>2008-10-20T16:32:00.001+13:00</published><updated>2008-10-20T16:32:00.001+13:00</updated><title type='text'>Hi Dave: I have also answered your question about ...</title><content type='html'>Hi Dave: &lt;BR/&gt;&lt;BR/&gt;I have also answered your question about very large sites directly in the FAQ. To provide a little more detail, we are trying to make the crawl broad rather than deep, so if your website is very large, then the chances are we won’t capture it all. &lt;BR/&gt;&lt;BR/&gt;However, we don’t yet know how deep we can get into large websites.  Consider this: we’re currently aiming to harvest 100 million URLs from just over 300,000 hosts. On the face of it that’s an average of 333 URLs per host. However, a lot of hosts will be empty, or redirects, or small. Here’s another case study though: in a recent domain harvest the BNF (French National Library) about half the .fr domains had 10 URLs or less, and only about 0.04% were crawled beyond 10,000 URLs (see “Legal deposit of the French Web” on http://iwaw.net/08/). It therefore seems unlikely we will be harvesting 10s of millions of URLs from your servers.&lt;BR/&gt;&lt;BR/&gt;Gordon</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7346520062335584992/6536862248220855389/comments/default/8709292228243134415'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7346520062335584992/6536862248220855389/comments/default/8709292228243134415'/><link rel='alternate' type='text/html' href='http://librarytechnz.natlib.govt.nz/2008/10/2008-web-harvest-let-us-know-how-we-can.html?showComment=1224473520001#c8709292228243134415' title=''/><author><name>Gordon Paynter</name><uri>http://www.blogger.com/profile/13375515204887559709</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='01047955513003965780'/></author><thr:in-reply-to xmlns:thr='http://purl.org/syndication/thread/1.0' href='http://librarytechnz.natlib.govt.nz/2008/10/2008-web-harvest-let-us-know-how-we-can.html' ref='tag:blogger.com,1999:blog-7346520062335584992.post-6536862248220855389' source='http://www.blogger.com/feeds/7346520062335584992/posts/default/6536862248220855389' type='text/html'/></entry><entry><id>tag:blogger.com,1999:blog-7346520062335584992.post-5842279928343881056</id><published>2008-10-20T16:32:00.000+13:00</published><updated>2008-10-20T16:32:00.000+13:00</updated><title type='text'>Hi Mike and Mike:I’ve added your questions (about ...</title><content type='html'>Hi Mike and Mike:&lt;BR/&gt;&lt;BR/&gt;I’ve added your questions (about where the list of names comes from, and how we find websites hosted inside and outside New Zealand) to a FAQ page, which you can find here:&lt;BR/&gt;http://www.natlib.govt.nz/about-us/news/20-october-2008-web-harvest-faqs&lt;BR/&gt;&lt;BR/&gt;It’s all a bit complex, but I hope that covers it off.&lt;BR/&gt;&lt;BR/&gt;Gordon</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7346520062335584992/6536862248220855389/comments/default/5842279928343881056'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7346520062335584992/6536862248220855389/comments/default/5842279928343881056'/><link rel='alternate' type='text/html' href='http://librarytechnz.natlib.govt.nz/2008/10/2008-web-harvest-let-us-know-how-we-can.html?showComment=1224473520000#c5842279928343881056' title=''/><author><name>Gordon Paynter</name><uri>http://www.blogger.com/profile/13375515204887559709</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='01047955513003965780'/></author><thr:in-reply-to xmlns:thr='http://purl.org/syndication/thread/1.0' href='http://librarytechnz.natlib.govt.nz/2008/10/2008-web-harvest-let-us-know-how-we-can.html' ref='tag:blogger.com,1999:blog-7346520062335584992.post-6536862248220855389' source='http://www.blogger.com/feeds/7346520062335584992/posts/default/6536862248220855389' type='text/html'/></entry><entry><id>tag:blogger.com,1999:blog-7346520062335584992.post-7736817349249544109</id><published>2008-10-18T03:11:00.000+13:00</published><updated>2008-10-18T03:11:00.000+13:00</updated><title type='text'>interesting the non response to issues raised here...</title><content type='html'>interesting the non response to issues raised here</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7346520062335584992/6536862248220855389/comments/default/7736817349249544109'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7346520062335584992/6536862248220855389/comments/default/7736817349249544109'/><link rel='alternate' type='text/html' href='http://librarytechnz.natlib.govt.nz/2008/10/2008-web-harvest-let-us-know-how-we-can.html?showComment=1224252660000#c7736817349249544109' title=''/><author><name>I see aship in the harbour</name><email>noreply@blogger.com</email></author><thr:in-reply-to xmlns:thr='http://purl.org/syndication/thread/1.0' href='http://librarytechnz.natlib.govt.nz/2008/10/2008-web-harvest-let-us-know-how-we-can.html' ref='tag:blogger.com,1999:blog-7346520062335584992.post-6536862248220855389' source='http://www.blogger.com/feeds/7346520062335584992/posts/default/6536862248220855389' type='text/html'/></entry><entry><id>tag:blogger.com,1999:blog-7346520062335584992.post-8230174373327032227</id><published>2008-10-17T09:04:00.000+13:00</published><updated>2008-10-17T09:04:00.000+13:00</updated><title type='text'>I don't think trying to apply a paper model to the...</title><content type='html'>I don't think trying to apply a paper model to the internet is practical.&lt;BR/&gt;&lt;BR/&gt;I run several large NZ websites containing literally tens of millions of indexable pages and unknown quantities of dynamic pages (search results etc).  Thousands of pages change daily and total content is several hundred gigabyte, and lot of that is video and imagery.&lt;BR/&gt;&lt;BR/&gt;Do you intend to download all of the content from all of my sites?  (they are all withing the .co.nz)</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7346520062335584992/6536862248220855389/comments/default/8230174373327032227'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7346520062335584992/6536862248220855389/comments/default/8230174373327032227'/><link rel='alternate' type='text/html' href='http://librarytechnz.natlib.govt.nz/2008/10/2008-web-harvest-let-us-know-how-we-can.html?showComment=1224187440000#c8230174373327032227' title=''/><author><name>dave</name><uri>http://www.blogger.com/profile/05654861944179220597</uri><email>noreply@blogger.com</email></author><thr:in-reply-to xmlns:thr='http://purl.org/syndication/thread/1.0' href='http://librarytechnz.natlib.govt.nz/2008/10/2008-web-harvest-let-us-know-how-we-can.html' ref='tag:blogger.com,1999:blog-7346520062335584992.post-6536862248220855389' source='http://www.blogger.com/feeds/7346520062335584992/posts/default/6536862248220855389' type='text/html'/></entry><entry><id>tag:blogger.com,1999:blog-7346520062335584992.post-1097387439048779328</id><published>2008-10-16T22:06:00.000+13:00</published><updated>2008-10-16T22:06:00.000+13:00</updated><title type='text'>Gordon's left for the day, but he'll post a reply ...</title><content type='html'>Gordon's left for the day, but he'll post a reply to your question tomorrow morning.&lt;BR/&gt;&lt;BR/&gt;maybe Gordon could get the internet at home</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7346520062335584992/6536862248220855389/comments/default/1097387439048779328'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7346520062335584992/6536862248220855389/comments/default/1097387439048779328'/><link rel='alternate' type='text/html' href='http://librarytechnz.natlib.govt.nz/2008/10/2008-web-harvest-let-us-know-how-we-can.html?showComment=1224147960000#c1097387439048779328' title=''/><author><name>no sleep till midnight</name><email>noreply@blogger.com</email></author><thr:in-reply-to xmlns:thr='http://purl.org/syndication/thread/1.0' href='http://librarytechnz.natlib.govt.nz/2008/10/2008-web-harvest-let-us-know-how-we-can.html' ref='tag:blogger.com,1999:blog-7346520062335584992.post-6536862248220855389' source='http://www.blogger.com/feeds/7346520062335584992/posts/default/6536862248220855389' type='text/html'/></entry><entry><id>tag:blogger.com,1999:blog-7346520062335584992.post-8511791236420625012</id><published>2008-10-16T21:12:00.000+13:00</published><updated>2008-10-16T21:12:00.000+13:00</updated><title type='text'>You say 'let us know, we'll make it stop'. By crea...</title><content type='html'>You say 'let us know, we'll make it stop'. By creating a robots.txt file the site owner *has* let you know and you are ignoring their wishes. Bad decision, bad net citizen.</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7346520062335584992/6536862248220855389/comments/default/8511791236420625012'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7346520062335584992/6536862248220855389/comments/default/8511791236420625012'/><link rel='alternate' type='text/html' href='http://librarytechnz.natlib.govt.nz/2008/10/2008-web-harvest-let-us-know-how-we-can.html?showComment=1224144720000#c8511791236420625012' title=''/><author><name>Tim Snadden</name><email>noreply@blogger.com</email></author><thr:in-reply-to xmlns:thr='http://purl.org/syndication/thread/1.0' href='http://librarytechnz.natlib.govt.nz/2008/10/2008-web-harvest-let-us-know-how-we-can.html' ref='tag:blogger.com,1999:blog-7346520062335584992.post-6536862248220855389' source='http://www.blogger.com/feeds/7346520062335584992/posts/default/6536862248220855389' type='text/html'/></entry><entry><id>tag:blogger.com,1999:blog-7346520062335584992.post-917021729962528287</id><published>2008-10-16T20:01:00.000+13:00</published><updated>2008-10-16T20:01:00.000+13:00</updated><title type='text'>I think Mike F has asked the question I was trying...</title><content type='html'>I think Mike F has asked the question I was trying to ask - for instance my work site (http://www.miramarmike.co.nz) is actually a Google hosted platform in (I assume) the US. And the opposite probably applies - international content hosted in NZ - but on a smaller scale.&lt;BR/&gt;&lt;BR/&gt;And, of course, NZ content that's on *any* site in the world - Flickr immediately springs to mind (discussions as well as the actual photos)&lt;BR/&gt;&lt;BR/&gt;Back to "putting it onto the Web is legally publication", I'm not sure that answers the question. If the rule is do not index this (using robots.txt) then I don't see how "it's published" changes it. This isn't books, this is the web and the rule (de fact, understood, community) rule still applies.&lt;BR/&gt;&lt;BR/&gt;But hey, I also understand you've a civic duty to perform and appreciate the dilemma and that you're open to talking about it.</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7346520062335584992/6536862248220855389/comments/default/917021729962528287'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7346520062335584992/6536862248220855389/comments/default/917021729962528287'/><link rel='alternate' type='text/html' href='http://librarytechnz.natlib.govt.nz/2008/10/2008-web-harvest-let-us-know-how-we-can.html?showComment=1224140460000#c917021729962528287' title=''/><author><name>Mike Riversdale</name><uri>http://www.blogger.com/profile/00112999693425305730</uri><email>noreply@blogger.com</email></author><thr:in-reply-to xmlns:thr='http://purl.org/syndication/thread/1.0' href='http://librarytechnz.natlib.govt.nz/2008/10/2008-web-harvest-let-us-know-how-we-can.html' ref='tag:blogger.com,1999:blog-7346520062335584992.post-6536862248220855389' source='http://www.blogger.com/feeds/7346520062335584992/posts/default/6536862248220855389' type='text/html'/></entry><entry><id>tag:blogger.com,1999:blog-7346520062335584992.post-1690754295325473986</id><published>2008-10-16T17:26:00.000+13:00</published><updated>2008-10-16T17:26:00.000+13:00</updated><title type='text'>Well, either.I guess i'm wondering a) how you're g...</title><content type='html'>Well, either.&lt;BR/&gt;&lt;BR/&gt;I guess i'm wondering &lt;BR/&gt;a) how you're getting the list of names to harvest, &lt;BR/&gt;and &lt;BR/&gt;b) how you tell the difference between sites hosted in new zealand, and those hosted overseas with .nz domain names.</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7346520062335584992/6536862248220855389/comments/default/1690754295325473986'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7346520062335584992/6536862248220855389/comments/default/1690754295325473986'/><link rel='alternate' type='text/html' href='http://librarytechnz.natlib.govt.nz/2008/10/2008-web-harvest-let-us-know-how-we-can.html?showComment=1224131160000#c1690754295325473986' title=''/><author><name>mike forbes</name><uri>http://www.blogger.com/profile/10429321058652694673</uri><email>noreply@blogger.com</email></author><thr:in-reply-to xmlns:thr='http://purl.org/syndication/thread/1.0' href='http://librarytechnz.natlib.govt.nz/2008/10/2008-web-harvest-let-us-know-how-we-can.html' ref='tag:blogger.com,1999:blog-7346520062335584992.post-6536862248220855389' source='http://www.blogger.com/feeds/7346520062335584992/posts/default/6536862248220855389' type='text/html'/></entry><entry><id>tag:blogger.com,1999:blog-7346520062335584992.post-1334900880720889747</id><published>2008-10-16T17:22:00.000+13:00</published><updated>2008-10-16T17:22:00.000+13:00</updated><title type='text'>hey MikeGordon's left for the day, but he'll post ...</title><content type='html'>hey Mike&lt;BR/&gt;&lt;BR/&gt;Gordon's left for the day, but he'll post a reply to your question tomorrow morning. Just to clarify - the list of domains; do you mean all the ones that are being harvested, or the not-hosted-in-NZ ones?</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7346520062335584992/6536862248220855389/comments/default/1334900880720889747'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7346520062335584992/6536862248220855389/comments/default/1334900880720889747'/><link rel='alternate' type='text/html' href='http://librarytechnz.natlib.govt.nz/2008/10/2008-web-harvest-let-us-know-how-we-can.html?showComment=1224130920000#c1334900880720889747' title=''/><author><name>Courtney Johnston</name><uri>http://www.blogger.com/profile/13465703476413455843</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='02720902840122581826'/></author><thr:in-reply-to xmlns:thr='http://purl.org/syndication/thread/1.0' href='http://librarytechnz.natlib.govt.nz/2008/10/2008-web-harvest-let-us-know-how-we-can.html' ref='tag:blogger.com,1999:blog-7346520062335584992.post-6536862248220855389' source='http://www.blogger.com/feeds/7346520062335584992/posts/default/6536862248220855389' type='text/html'/></entry><entry><id>tag:blogger.com,1999:blog-7346520062335584992.post-8382126856314521939</id><published>2008-10-16T17:14:00.000+13:00</published><updated>2008-10-16T17:14:00.000+13:00</updated><title type='text'>I'm curious as to how you are collecting these dom...</title><content type='html'>I'm curious as to how you are collecting these domains..&lt;BR/&gt;&lt;BR/&gt;where are you getting the list from?</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7346520062335584992/6536862248220855389/comments/default/8382126856314521939'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7346520062335584992/6536862248220855389/comments/default/8382126856314521939'/><link rel='alternate' type='text/html' href='http://librarytechnz.natlib.govt.nz/2008/10/2008-web-harvest-let-us-know-how-we-can.html?showComment=1224130440000#c8382126856314521939' title=''/><author><name>mike forbes</name><uri>http://www.blogger.com/profile/10429321058652694673</uri><email>noreply@blogger.com</email></author><thr:in-reply-to xmlns:thr='http://purl.org/syndication/thread/1.0' href='http://librarytechnz.natlib.govt.nz/2008/10/2008-web-harvest-let-us-know-how-we-can.html' ref='tag:blogger.com,1999:blog-7346520062335584992.post-6536862248220855389' source='http://www.blogger.com/feeds/7346520062335584992/posts/default/6536862248220855389' type='text/html'/></entry><entry><id>tag:blogger.com,1999:blog-7346520062335584992.post-6094978572054463375</id><published>2008-10-16T17:03:00.000+13:00</published><updated>2008-10-16T17:03:00.000+13:00</updated><title type='text'>Hi Mike:Yes, in many cases it is similar to collec...</title><content type='html'>Hi Mike:&lt;BR/&gt;&lt;BR/&gt;Yes, in many cases it is similar to collecting people's diaries (national libraries do that too). One of the interesting ethical issues that web harvesting throws up is that legally posting stuff on the web is seen as "publication", the people doing it might think of it more as "communication", and expect some level of privacy. We will have to think about this more before we make the harvest publicly available, but for now our focus is capturing as much at-risk material as we can.&lt;BR/&gt;&lt;BR/&gt;We are gathering a small sample of NZ content that is not hosted in New Zealand, but it is very difficult to detect this reliably in an automated fashion. Our selection is therefore hand-vetted. The Czech national library is doing some neat work on automating this process using whois lookups, and looking for Czech phone numbers, names, language and email addresses on web pages that we might be able to exploit in the future (if we ever do another harvest).&lt;BR/&gt;&lt;BR/&gt;Gordon</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7346520062335584992/6536862248220855389/comments/default/6094978572054463375'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7346520062335584992/6536862248220855389/comments/default/6094978572054463375'/><link rel='alternate' type='text/html' href='http://librarytechnz.natlib.govt.nz/2008/10/2008-web-harvest-let-us-know-how-we-can.html?showComment=1224129780000#c6094978572054463375' title=''/><author><name>Gordon Paynter</name><email>noreply@blogger.com</email></author><thr:in-reply-to xmlns:thr='http://purl.org/syndication/thread/1.0' href='http://librarytechnz.natlib.govt.nz/2008/10/2008-web-harvest-let-us-know-how-we-can.html' ref='tag:blogger.com,1999:blog-7346520062335584992.post-6536862248220855389' source='http://www.blogger.com/feeds/7346520062335584992/posts/default/6536862248220855389' type='text/html'/></entry><entry><id>tag:blogger.com,1999:blog-7346520062335584992.post-5435754642515524362</id><published>2008-10-16T16:36:00.000+13:00</published><updated>2008-10-16T16:36:00.000+13:00</updated><title type='text'>Interesting stance.Isn't that a little like saying...</title><content type='html'>Interesting stance.&lt;BR/&gt;Isn't that a little like saying, "We want to keep a record of people's physcaly diaries and therefore can read over people's shoulders"?&lt;BR/&gt;&lt;BR/&gt;I get and suppotr the intention.&lt;BR/&gt;However by breaking a fundamental rule of the WWW will probably mean you'll get blocked by more direct ways now and in the future.&lt;BR/&gt;&lt;BR/&gt;Oh, and how are you handling NZ content that is not hosted in NZ? Just curious.</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7346520062335584992/6536862248220855389/comments/default/5435754642515524362'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7346520062335584992/6536862248220855389/comments/default/5435754642515524362'/><link rel='alternate' type='text/html' href='http://librarytechnz.natlib.govt.nz/2008/10/2008-web-harvest-let-us-know-how-we-can.html?showComment=1224128160000#c5435754642515524362' title=''/><author><name>Mike Riversdale</name><uri>http://www.blogger.com/profile/00112999693425305730</uri><email>noreply@blogger.com</email></author><thr:in-reply-to xmlns:thr='http://purl.org/syndication/thread/1.0' href='http://librarytechnz.natlib.govt.nz/2008/10/2008-web-harvest-let-us-know-how-we-can.html' ref='tag:blogger.com,1999:blog-7346520062335584992.post-6536862248220855389' source='http://www.blogger.com/feeds/7346520062335584992/posts/default/6536862248220855389' type='text/html'/></entry></feed>