Thursday, May 27, 2010

2010 whole of domain harvest extended to 2 June

We've completed the first pass of the 2010 whole of domain harvest. The initial harvest collected approximately 100 million URLs, which is fewer then our target of 130 million.

We have taken measures this year to protect hosts that share IP addresses, and this has caused the harvest to proceed more slowly than in 2008. For this reason, we are extending the harvest period to 2 June 2010. Over the next week we will also be conducting a 'patch crawl' to improve the quality of the harvest. We will be ensuring we have captured the homepage/slash page of every host, and that all sites that were nominated for inclusion were well-captured.

During the period of the first harvest (12-25 June) we received a small number of notifications from website owners who have had problems with the harvester's treatment of their robots.txt file. If you spot any problems, or have any questions, please let us know by completing our Feedback form.

0 comments: