After two weeks of harvesting we've collected about 115 million URLs. Our target is 130 million, so we should wrap things up in a few more days.
A previously noted, the harvest has run much more slowly than expected because of the measures we have taken to protect hosts that share IP addresses.
As a result of our decision to (largely) comply with robots.txt instructions, the harvest is much shallower for some websites, but much deeper for others. Our current limit is 70,000 URLs for .com and .co.nz, and 90,000 URLs on other .nz sites. (The 2008 harvest was capped at 50,000 for .govt.nz and 20,000 for others.)
We still estimate the final collection will be between 4 and 5 TB compressed.
Update: The crawl will be stopped at 7AM Saturday June 5th NZST / 12 noon Friday, June 4th PDT.
Monday, May 31, 2010
Web Harvest 2010: Two weeks
Posted by
Gordon Paynter
at
11:18 AM
Tags:
Gordon Paynter,
web harvest
Subscribe to:
Post Comments (Atom)
0 comments:
Post a Comment