After two weeks of harvesting we've collected about 115 million URLs. Our target is 130 million, so we should wrap things up in a few more days.
A previously noted, the harvest has run much more slowly than expected because of the measures we have taken to protect hosts that share IP addresses.
As a result of our decision to (largely) comply with robots.txt instructions, the harvest is much shallower for some websites, but much deeper for others. Our current limit is 70,000 URLs for .com and .co.nz, and 90,000 URLs on other .nz sites. (The 2008 harvest was capped at 50,000 for .govt.nz and 20,000 for others.)
We still estimate the final collection will be between 4 and 5 TB compressed.
Update: The crawl will be stopped at 7AM Saturday June 5th NZST / 12 noon Friday, June 4th PDT.
Monday, May 31, 2010
Web Harvest 2010: Two weeks
Posted by
Gordon Paynter
at
11:18 AM
0
comments
Tags:
Gordon Paynter,
web harvest
Friday, May 28, 2010
The Source: news about digital libraries and library innovations from around the web
Introducing The Source
Archives in Web 2.0: New OpportunitiesFrom the Ariadne website
Archives are using Web 2.0 applications in a context that allows for new types of interaction, new opportunities regarding institutional promotion, new ways of providing their services and making their heritage known to the community. Applications such as Facebook, Flickr and YouTube are already used by cultural organisations that interact in the informal context of Web 2.0. This article aims to describe how Web 2.0 can work as a virtual extension for archives and other cultural organisations, by identifying impacts and benefits resulting from the use of Web 2.0 applications together with some goals and strategies of such use.
The ABC of copyright
From the UNESCO website
This booklet is intended to provide to all who are concerned with the creation, circulation and transfer of knowledge replies to certain questions they may have on the subject of copyright. It has no other objective than to clarify a complicated subject by translating legal language into language that can be easily understood by everyone.
Cultural capital: A manifesto for the future
From the Museums, Libraries & Archives Council (MLA) website
The publication shows how investing in culture and heritage can help Britain's social and economic recovery from recession. It demonstrates with facts and figures that a fifteen-year period of investment has created a public appetite for culture that continues to grow, and that the arts, heritage, museums, libraries and archives make a strong contribution to the economic and social well-being of Britain.
Usability inspection of digital libraries
From the Ariadne website
Usability studies and digital library development are not often intertwined due to the existing cultural model in system development. Usability issues are likely to be addressed post-hoc or as a priori assumptions. Recent initiatives have advanced usability studies in terms of information environment development. However, significant work is still required to address the usability of new services arising from the trends in social networking and Web 2.0.
The JISC-funded project, Usability and Contemporary User Experience in Digital Libraries (UX2.0), contributes to this general body of work by enhancing a digital library through a development and evaluation framework centred on usability and contemporary user experience. Part of the project involves usability inspection and research on contemporary user experience techniques. This article highlights the findings of the usability inspection work recently conducted and reported by UX2.0. The report provided a general impression of digital library usability; notwithstanding, it revealed a range of issues, each of which merits a systematic and vigorous study. The discussion points outlined here provide a resource generally useful for the JISC Community and beyond.
The global information technology report 2009–2010
From the World Economic Forum website
This report measures the extent to which 133 economies from both the developed and developing worlds leverage ICT advances for increased growth and development through the methodological framework of the Networked Readiness Index. A number of essays and case studies on sustainability and best practices in networked readiness are featured, together with a comprehensive data section - including detailed profiles for each economy covered and data tables.
Posted by
Maria Nagelkerke
at
11:38 AM
0
comments
Tags:
Archives,
copyright,
culture,
digital libraries,
heritage,
information technology,
TheSourceNLNZ
Thursday, May 27, 2010
2010 whole of domain harvest extended to 2 June
We've completed the first pass of the 2010 whole of domain harvest. The initial harvest collected approximately 100 million URLs, which is fewer then our target of 130 million.
We have taken measures this year to protect hosts that share IP addresses, and this has caused the harvest to proceed more slowly than in 2008. For this reason, we are extending the harvest period to 2 June 2010. Over the next week we will also be conducting a 'patch crawl' to improve the quality of the harvest. We will be ensuring we have captured the homepage/slash page of every host, and that all sites that were nominated for inclusion were well-captured.
During the period of the first harvest (12-25 June) we received a small number of notifications from website owners who have had problems with the harvester's treatment of their robots.txt file. If you spot any problems, or have any questions, please let us know by completing our Feedback form.
Posted by
Courtney Johnston
at
4:48 PM
0
comments
Tags:
Courtney Johnston,
Gordon Paynter,
web harvest
Friday, May 21, 2010
Web Harvest 2010: One week
How much can you download in a week?
After seven days of harvesting we've collected over 2.6TB of data from in excess of 50 million URLs. The current average crawl rate is 149 URLs per second.
We now estimate the final collection will be between 4 and 5 TB compressed (compared with about 3TB compressed in 2008).
On a technical level, everything is going well, except that a hard disk failed over the weekend and we lost a log file. No data was lost because all downloaded content is immediately backed up to a data repository. We hope to recover the log file too, when it's all over.
There's so much data coming in that it is hard to track exactly what is being harvested in real time, but here's the top ten reported media types:
- [#urls] [mime-types]
- 31,567,006 text/html
- 6,908,737 image/jpeg
- 1,642,548 image/gif
- 563,463 image/png
- 510,400 application/pdf
- 311,524 text/xml
- 247,833 text/plain
- 196,715 text/css
- 178,576 application/rss+xml
- 123,638 no-type
Posted by
Gordon Paynter
at
9:50 AM
3
comments
Tags:
Gordon Paynter,
web harvest
Friday, May 14, 2010
The Source: news about digital libraries and library innovations from around the web
Introducing The Source
Content development in an indigenous digital library: A case study in community participation [page 30] (Note: PDF)
From the IFLA Publications website
This paper presents a case study in community participation in developing content for a digital library of local indigenous knowledge. Description of the programme highlights interaction between the library, the community and the technology used. Implementation challenges, results and lessons learnt are discussed and benefits to the community pointed out. In providing an online, contextually-based information service to local communities, public libraries in Africa will ensure future-oriented access to cultural heritage resources through 21st century information communication technologies (ICTs). The potential to reduce the digital divide will be enhanced and African communities will be introduced to the global information society.
The Digital Divide: Assessing organisations’ preparations for digital preservation (Note: PDF)
From the Preservation and Long-term Access through NETworked Services (Planets) website
This white paper is based on the findings of a Planets survey of two hundred organisations, mainly European archives and libraries, to investigate their digital preservation activities and needs. It summarises the survey results, discusses key digital preservation topics, and highlights the steps needed to tackle the challenges of retaining access to our digital information in the medium and long term.
National e-Strategies for development: Global status and perspectives 2010 (Note: PDF)
From the International Telecommunication Union website
This report provides a high-level update and an overview of the progress countries have made in their effort to develop national e-strategies, ICT strategies and sectoral e-strategies, analysing as well the extent into which ICT have been incorporated into poverty reduction strategies and other national development plans. In order to provide a broad analysis of ICT strategies, this report describes strategic approaches of national e-strategies and provides three examples of national ICT strategies, detailing their evolution over time.
The report identifies at least 161 economies (84 percent) that have already met the WSIS target of having a national ICT strategy in place by 2010. It also indicates areas where existing national e-strategies could be improved, such as their strategic orientation and their integration into national development plans and poverty reduction strategies. Based on the analysis of sectoral e-strategies, the report also emphasises the need for more comprehensive sectoral e-strategies that take full advantage of the potential ICT have for the economy and society. Finally, the appendix provides the reader a comprehensive list of national ICT strategies developed by ITU Member States.
The information presented in this report comes largely from the WSIS stocktaking, an extensive online research initiative conducted by the International Telecommunication Union (ITU), which brings together national ICT and sectoral e-strategies of ITU’s Member States, as well as publications by the five UN Regional Commissions.
Bridging between libraries and information and communication technologies for development [page 70] (Note: PDF)
From the IFLA Publications website
The International Federation of Library Associations and Institutions (IFLA), the Bill & Melinda Gates Foundation (Global Libraries initiative), and the Technology & Social Change Group (TASCHA), at the University of Washington Information School, believe that the library and ICTD fields are at a point in their evolutions where each may be able to provide significant value to the other. They have organised a series of ‘bridging’ convenings to bring together interested stakeholders in both fields to advance activities that will realize tangible benefits for the two communities. Libraries and ICTD share an interest in the use of technology to achieve their ultimate goals. While their contexts come from very different histories and intentions, there are many areas of commonality that are worth exploring as possible collaborative efforts.
A two-level view of the fields is proposed, starting with the overall characteristics that determine the character of each field as a necessary context for thinking about possible intersections, and ending with a proposal for exploration of potential areas for joint work at a more practical level. Possible projects in the areas of user services, training and technology are suggestions for further investigation.
Posted by
Maria Nagelkerke
at
11:57 AM
0
comments
Tags:
digital divide,
e-strategies,
ICTs,
indigenous knowledge,
TheSourceNLNZ
Wednesday, May 12, 2010
Reminder: The 2010 Web Harvest begins tomorrow
It is less than day until the Library undertakes its second New Zealand Web Harvest.
Last week we ran a successful test crawl, harvesting the ‘slash page’ or ‘home page’ of each host, along with the robot.txt file and selected embedded material.
In total, the slash crawl requested 1,607,547 URLs from 805,246 hosts, and downloaded 6.6 gigabytes (6,638,914,295 bytes) of data. It found 13,271 robots.txt files (that's about one robots.txt file per 60 hosts, a much lower ratio than in 2008).
The harvest proper starts on 12 May, and will run to approximately 25 May. Because the harvest is conducted by the US-based Internet Archive, it will begin on 12 May US time. Our user agent string (NLNZHarvester2010) will start appearing in NZ website owners’ logs from 8AM on 13 May New Zealand time.
Don’t forget: if you have any questions, or want to talk to us about how your site is harvested, drop us a line using our Feedback Form.
And if you want to let us know about a site that’s not on the .nz domain but that you think should be collected, you can fill out our Nomination Form. We'll be accepting nominations for the next week or so.
Posted by
Gordon Paynter
at
1:06 PM
0
comments
Tags:
Courtney Johnston,
Gordon Paynter,
web harvest
Friday, May 7, 2010
The Source: news about digital libraries and library innovations from around the web
Introducing The Source
Gutenberg 2.0
From the Harvard Magazine website
Increasingly, in the scientific disciplines, information ranging from online journals to databases must be recent to be relevant, so Harvard University’s Widener Library’s collection of books, its miles of stacks, can appear museum-like. Likewise, Google’s massive project to digitise all the books in the world will, by some accounts, cause research libraries to fade to irrelevance as mere warehouses for printed material. The skills that librarians have traditionally possessed seem devalued by the power of online search, and less sexy than a Google query launched from a mobile platform. “People want information ‘anytime, anyplace, anywhere,’” says Helen Shenton, the former head of collection care for the British Library who is now deputy director of the Harvard University Library. Users are changing - but so, too, are libraries. The future is clearly digital.
Yet if the format of the future is digital, the content remains data. And at its simplest, scholarship in any discipline is about gaining access to information and knowledge.
Publishing: The revolutionary future
From the New York Review of Books website
The transition within the book publishing industry from physical inventory stored in a warehouse and trucked to retailers, to digital files stored in cyberspace and delivered almost anywhere on earth as quickly and cheaply as e-mail, is now underway and irreversible. This historic shift will radically transform worldwide book publishing, the cultures it affects and on which it depends. Meanwhile, for quite different reasons, the genteel book business is already on edge, suffering from a gambler’s unbreakable addiction to risky, seasonal best sellers, many of which don’t recoup their costs, and the simultaneous deterioration of backlist, the vital annuity on which book publishers had in better days relied for year-to-year stability through bad times and good. The crisis of confidence reflects these intersecting shocks, an overspecialised marketplace dominated by high-risk ephemera and a technological shift orders of magnitude greater than the momentous evolution from monkish scriptoria to movable type launched in Gutenberg’s German city of Mainz six centuries ago.
Developing virtual worlds: The interplay of design, communities and rationality
From the First Monday website
This paper examines the evolution of virtual worlds from the developer’s perspective. It asks two questions: What are the motivations of developers? What are the specific challenges of the governance of user–generated content? User–created virtual worlds may be characterized according to their degree of design or emergence. On one end is the ‘designer as god’ perspective and on the other is the unforeseeable and perpetually emergent ‘user creativity.’ Utilising a theoretically derived sample of virtual worlds, we illustrate how governance is complicated by designers contending with three major issues. In general, across all three worlds, developers had to come to grips with the limits of their ability to design virtual worlds for premeditated outcomes. Secondly, communities forming within worlds, as opposed to atomised users, are central to the (creative) building, usage and governance of virtual worlds. Developers have a range of choices for how to interact with communities ranging from arm’s length monitoring to engagement. Thirdly, developers have to manage instrumentally rational aspects of their business, which can lead to tensions with the design and community goals, and, ultimately, lead to the failure of a world’s business model. A fuller accounting of governance will have to accommodate the complex interplay between purposeful design, emergent community, and the logic of the marketplace.
Engineering the web's third decade
From the Communications of the Association for Computing Machinery (ACM) website
As Web technologies move beyond two-way interactive capabilities to facilitate more dynamic and pervasive experiences, the Web is quickly advancing toward its third major upgrade.
Posted by
Maria Nagelkerke
at
9:52 AM
0
comments
Tags:
digitisation,
libraries,
publishing,
TheSourceNLNZ,
virtual worlds,
Web 3.0