How much can you download in a week?
After seven days of harvesting we've collected over 2.6TB of data from in excess of 50 million URLs. The current average crawl rate is 149 URLs per second.
We now estimate the final collection will be between 4 and 5 TB compressed (compared with about 3TB compressed in 2008).
On a technical level, everything is going well, except that a hard disk failed over the weekend and we lost a log file. No data was lost because all downloaded content is immediately backed up to a data repository. We hope to recover the log file too, when it's all over.
There's so much data coming in that it is hard to track exactly what is being harvested in real time, but here's the top ten reported media types:
- [#urls] [mime-types]
- 31,567,006 text/html
- 6,908,737 image/jpeg
- 1,642,548 image/gif
- 563,463 image/png
- 510,400 application/pdf
- 311,524 text/xml
- 247,833 text/plain
- 196,715 text/css
- 178,576 application/rss+xml
- 123,638 no-type
3 comments:
Cheers Gordon!
I wonder what kind of stuff is in the (non-RSS) XML? I wonder if that includes any XHTML?
If I had to guess, I'd guess that a lot of the non-RSS xml pages are also feeds. What's the MIME type for an Atom feed?
Gordon
I think I saw that 149 URLs per second on several servers...
Post a Comment