Tuesday, April 9, 2013

New homes for your LibraryTech blogging needs

As you may notice from the date stamps, there's not a lot of fresh material on this blog.

Much of what would have been posted here is now showing up on the DigitalNZ blog, or in the National Library blog's Library Tech category.

Drop us a line in the comments over there, or at @NLNZ or @DigitalNZ if there's anything you want us to cover.

Posts on this blog will be ported over to the National Library site when we've got a spare moment. Cheers, all!

Thursday, April 7, 2011

Comparing the 2008 and 2010 New Zealand Web Harvests

Brian of Auckland has asked about the New Zealand Web Harvest 2010: “How much of the data has been analysed, catalogued or made available… Any stats?”

All good questions. “I'm sure there is a lot of interest :-)” he adds.


This prompt has caused me to stop making excuses, and start analysing. This is more complicated than you might think, because there’s just so much data. Even the log files and summary reports are too large to work with easily.

Luckily, I still have the scripts I used in 2008, so the first pass is fairly easy. (These scripts don’t examine the data itself, they examine the reports generated from the harvest result by the Internet Archive.) I’ve now verified and written up this summary for 2010.

My colleague Gillian has taken this report and started doing side-by-side comparisons with the 2008 data. I’ve summarised her findings below, and here’s a more detailed breakdown (Link to follow).


The following table provides a summary of the different website harvests in 2008 and 2010.

Here's a bit more detail on the .NZ part of the harvest.

What does this tell us? The obvious thing is the 2010 harvest ran longer and gathered more data, but that doesn’t necessarily mean the internet was any bigger by then because we made a lot of changes as a result of feedback following the 2008 harvest, and consultation prior to the 2010 harvest.

The first major change was that the 2010 harvest had much better seeds because we had access to the Zone files for .nz, .com, .net and .org and therefore believe we have much better coverage of the registered domains.

The second major change was that we honoured the robots.txt protocol (except when downloading images and similar elements embedded in web pages). This means that many websites were crawled less heavily than may have been the case in 2008, when we ignored robots.txt (unless specifically requested otherwise) to get a more complete crawl.

To summarise, we think the 2010 crawl had greater coverage than the first, the specific websites harvested were in many cases less complete.

Some anecdotal comments

While we haven’t made a systematic study of the data I believe the second harvest provides good coverage of the .nz domain in 2010 (whereas the 2008 harvest was patchy), and that .nz simply was significantly bigger in 2010 than in 2008 (but we’ll probably never know how much bigger, or even if such things can be measured).

Gillian and the harvesting team currently have access to both harvests. As is always the case with web archiving, the quality of the harvested websites varies. Some are complete and can be viewed properly. Others lack content because of technical limitations of the harvester, or because the website owners have excluded the harvester with robots.txt files. In selective web harvesting these problems are often resolved by tailoring the profiles for each website, or contacting the website owner. In domain harvesting this isn’t possible, due to the sheer quantity of data and the speed of the harvest.

Anecdotally, the 2010 seemed to do a much better job of avoiding spider traps, thanks to advances in harvesting practice, and to the changed robots policy.


There’s none. Our selective harvests are individually catalogued and available online, but as yet we have no catalogue record for the domain harvests.

Making it available

The harvests are currently only available to selected staff members in the Library. There are a lot of legal (and also technical) issues that have to be addressed before we can provide public access, and while we’ve been able to run the harvests and secure the results, we haven’t had the resources to have a serious tilt at these access challenges.

As an interim measure we’re discussing bringing the 2008 and 2010 domain harvests together into one access point, and making them available within the the Library's reading rooms when the Molesworth Street building re-opens in 2012.

The next stage would be to provide public online access, and we’re every bit as excited about that prospect as the many people who email us to request it!

Thursday, February 17, 2011

Results of our twitter user survey

We've been tweeting away since way back in 2008 and in that time have sent out over 1800 tweets about quirky items in our collections. In that time we've had some highlights such as lobsterotica as well as running battles with other institutions over who has the coolest collection items.

We've also gained over 3,400 followers who hang on with baited breath for our next gem from the collections. We follow a fairly predictable pattern with around two tweets a day and although it seems to have been working well we wanted to find out a little more about what our followers thought as well as a bit more about them.

How did we do this? Through twitter again naturally. It took all of a few minutes to come up with a four question survey, which we put up on Survey Monkey and another few seconds to put out a tweet with a link and a quick one-liner grovelling for our hoards of dedicated followers to tell us more.

The beauty of twitter is the instantaneousness of it. Within minutes we had people retweeting our survey as well as people asking if we were planning of stopping – we're not!

Overall we had a pretty good response with around 70 people coming back within a few hours – thanks everyone!

First up we asked where our followers were from. Unsurprisingly the large majority were from New Zealand with only 20% being from overseas. Of those overseas it was a split between Australia and the States with a handful from Britain.
Where are you from?
Next up we were keen to find out if people were from other libraries or cultural institutions or just liked what we do. Again, unsurprisingly, there were a large amount of followers from other galleries, libraries, archives and museums however over half of the respondents just like seeing the cool stuff from our collections. Cheers guys!
Are you in
We also thought it'd be cool to find out if people use other National Library services. Well over half use a combination of online and onsite services which is great. The more interesting stat however was that almost 40% of people don't use any of our other services. While some would take this to be an area of concern it actually shows how great the power of twitter can be. We're able to use twitter to give our collections a far wider exposure than previously and through this are reaching people who never would have seen our collections or perhaps been aware of what we do.

Do you use other National Library services?
Last up we asked if there was anything that we should be doing differently. On hindsight we could have asked this slightly better, as there were a lot of people who wanted for us to keep doing what we do but also talk about other National Library things going on. The awesome news was that we seem to be doing a good job just twice a day which works really well for us and is sustainable over the long term.

We mainly tweet cool stuff from our collections twice a day. Should we
One thing that did come through strongly in the comments was the desire from people to know more about the National Library going ons. We deliberately stay away from mixing this sort of news within this channel as we'd find that it would drive away everyone else who just want to see quirky items from our collections.
The good news is that we actually do have several other twitter channels that would help give a broader view of the library.

The Services to Schools team run an account @L2_S2S which is aimed at teachers and school librarians. They also have several blogs that talk about children's literature and innovation in school libraries.

The Alexander Turnbull Library, @AlexArchivists, tweet about general digital archival interests and the Aotearoa People's Network Kaharoa, @PeepsNetwork, tweet about their service and other IT related things.

Lastly, @DigitalNZ, have also been on Twitter for quite some time and talk about the DigitalNZ service as well as all things digitisation

As yet there are no plans for a newsy type twitter account with general National Library goings on however this may change as we look towards moving back into our building in 2012.
Thanks heaps to everyone who took a minute to help us out with this.

Matt O'Reilly

Friday, January 28, 2011

The Source: news about digital libraries and library innovations from around the web

Introducing The Source

Perceptions of libraries, 2010: Context and community (Note: PDF)

From the OCLC website

This new OCLC report, a sequel to the 2005 ‘Perceptions of Libraries and Information Resources’, provides updated information and new insights into information consumers and their online habits, preferences, and perceptions. Particular attention has been paid to how the current economic downturn has affected information-seeking behaviours and how those changes are reflected in the use and perception of libraries.
The report explores:
  • Technological and economic shifts since 2005
  • Lifestyle changes Americans have made during the recession, including increased use of the library and other online resources
  • How a negative change to employment status impacts use and perceptions of the library
  • Perceptions of libraries and information resources based on life stage
The report is based on U.S. data from an online survey conducted by Harris Interactive on behalf of OCLC. OCLC analysed and summarised the results in order to produce the report.

Friday, January 21, 2011

The Source: news about digital libraries and library innovations from around the web

Introducing The Source

Growing the pie: Increasing the level of cultural philanthropy in Aotearoa New Zealand (Note: PDF)

From the website of the Ministry for Culture & Heritage

For centuries, culture and private philanthropy have been inextricably linked. Early in the first century AD, the Roman poet Horace dedicated his first poem in Odes:I to his patron, Maecenas. The great painters of the European Renaissance were supported by wealthy individuals and rulers of states – both secular and religious. In pre-European Māori history, those with creative gifts were nurtured by their iwi or hapū. In modern Aotearoa New Zealand, the generosity of philanthropists over the decades has played a critical role in the growth of this nation’s cultural ecology. However, for culture to flourish truly and sustainably, it is vital that levels of private philanthropy in Aotearoa New Zealand are boosted.
Christopher Finlayson established the Cultural Philanthropy Taskforce in 2009; his brief to the Taskforce was succinct: I am keen for the Taskforce to explore whether there are new opportunities to encourage private investment in the arts in New Zealand over the next five to ten years.

Defining “Born Digital” (Note: PDF)

From the OCLC website

The purpose of this document is to define “born digital” and the various types of born-digital materials. It is intended to improve community discourse by encouraging caretakers of born-digital resources to specify what they mean when they use the term.

Digital forensics and born-digital content in cultural heritage collections (Note: PDF)

From the website of the Council on Library and Information Resources

This report introduces the field of digital forensics in the cultural heritage sector and explores some points of convergence between the interests of those charged with collecting and maintaining born-digital cultural heritage materials and those charged with collecting and maintaining legal evidence.

Turning the page: The future of eBooks (Note: PDF)

From the PricewaterhouseCoopers website

This new study examines trends and developments in the eBooks and eReaders market in the United States, United Kingdom, the Netherlands, and Germany, and discusses major challenges and key questions for the publishing industry worldwide. It also identifies market opportunities and developments for eBooks and eReaders, and makes recommendations for publishers, traditional retailers, online retailers, and intermediaries.
Given that publishers, internet bookstores, and companies that manufacture eReaders have high expectations for the digital future of the book industry, the study asks if a new generation of eReaders may, at last, achieve the long-awaited breakthrough that lures consumers away from paper and ink.

The survey of library database licensing practices

From the website of the Primary Research Group

The Primary Research Group has just released this new report. The complete report is a fee-based document but some highlights have been made available at no charge. The Complete TOC is available.
The 115-page report looks closely at how nearly 100 academic, special and public libraries in the United States, the UK, continental Europe, Canada, and Australia plan their database licensing practices. The report also covers the impact of digital repositories and open access publishing on database licensing. Data is broken out by size and type of library. Among the many issues covered:
  • database licensing volume
  • use of consortiums
  • consortium development plans
  • satisfaction levels with the coverage of podcasts, video, listservs, blogs and wikis in full text databases
  • spending levels on various types of content such as electronic journals, article databases and directories
  • perceptions of price increases for various types of subject matter
  • legal disputes between publishers and libraries
  • contract language
  • impact of mobile computing and other issues

Cloud-sourcing research collections: Managing print in the mass-digitized library environment
(Note: PDF)

From the OCLC website

This report presents findings from a year-long study designed and executed by OCLC Research, the HathiTrust, New York University's Elmer Bobst Library, and the Research Collections Access & Preservation (ReCAP) consortium, with support from The Andrew W. Mellon Foundation.
The objective of the project was to examine the feasibility of outsourcing management of low-use print books held in academic libraries to shared service providers, including large-scale print and digital repositories. The study assessed the opportunity for library space saving and cost avoidance through the systematic and intentional outsourcing of local management operations for digitised books to shared service providers and progressive downsizing of local print collections in favour of negotiated access to the digitised corpus and regionally consolidated print inventory.

Thursday, December 16, 2010

Join the search terms word cloud map mashup

Do you work in a library which has either an online search or OPACs with a catalogue search, or similar?

I’ve started a Google Map with links to word clouds of users’ search keywords. The map so far (http://bit.ly/dE3hrh) has just one set of search keyword clouds – it would be great to have more from around New Zealand (and beyond).

What you need:

  • Any kind of search tool or catalogue which produces a log of search keywords entered by users.

  • To be able to nominate a geographic location for the dataset.

  • Ideally – web statistics which include a list of the search keywords and (useful but not essential) their frequency. But as long as the data exists in some format (eg log files, or even just a list) it will still work.

If you’d like to contribute just email me (rebecca.cox @ Natlib) or comment here.

We collect web server log files and feed these into our web statistics software (Urchin, a version of Google Analytics which is installed and managed in-house.) From here you can export data in Excel format. I’ve cleaned this up, selected 500 terms from the top and bottom of the list, and created word clouds at www.wordle.net

Web stats give access to a wealth of data and can help identify audiences and behaviour which are not otherwise visible.

A while back, I checked the web stats for Papers Past to see how much “brand aware” search traffic the site was getting, and discovered there’s a significant number of people who appear to be searching the site for specific content using external search engines, eg site:paperspast.natlib.govt.nz “anti-opium association” or papers past deaths ashburton 1921.

You can look deeper by segmenting web stats by a range of criteria, from the number of words visitors use in their searches, to visitor domains (eg break out all the traffic from domains ending in .ac.nz), frequency of visit, number of pages viewed per visit, and more. For more on this, see Seb Chan’s Continuous Refinement and Data Driven Dynamic Personas from Webstock this year.

Another form of web visitor stats are heatmaps, which give a visualisation of where users are clicking on a web page (try Clickdensity or Clickheat). Here’s a heatmap showing the activity on our new homepage for the first few days after it went live.

National Library of New Zealand homepage heatmap

Tuesday, December 14, 2010

Adding Closed Captions to YouTube

We’ve recently had our first go at adding closed captions to our YouTube videos. Closed captions aid hearing impaired users in understanding the content of our videos and are extremely helpful for users that don’t have sound enabled on their computers. Closed Captions are also required under Guideline 1.2.2 of the Web Content Accessibility Guidelines (WCAG) 2.0.
The process is actually quite straightforward and less time-consuming than I would have thought. It does help if you’re provided with a transcript of the original content though.
Closed Captions are expected to describe all significant audio content including non-speech information such as the identity of speakers and their manner of speaking as well as music and sound effects. In this particular case it was fairly simple as the audio was largely only a voice over describing the content of the video.
YouTube currently supports two format options for closed captions, either .SRT of .SBV.
The .SBV format is YouTubes own format and is slightly simpler than .SRT so we have used it for this example.
The .SBV format is just a basic text file that follows a time format of hour:minutes:seconds.milliseconds. The times are delimitated by a comma and are followed by a line break and then the text to be displayed during this time. Two line breaks indicates the end of the caption and the start of the next time code.
Here’s how the first twenty seconds of the closed caption file looks:
Hi, I'm David Reeves
I'm the Associate Chief Librarian at the Alexander Turnbull Library in Wellington
We've undertaken a huge project to digitise a number of our photographic collections during 2010 and 2011
while the National Library building has been undergoing some major refurbishment
we've been able to dedicate around 20 staff to this special project.
Getting the captions aligned to the right time code can be slightly tricky and I found that it was easiest to play through the video and pause every now and then to pick the best start and finish time for each time code. It’s also important to keep line lengths reasonable as otherwise YouTube can cut of the captioning text. I found that no more than 15 words per line worked as a rough guide. This often means that you’ll need to break up longer sentences into several shorted time codes.
If you find that directly editing a .SBV text file is too much work then there are also sites out there such as Caption Tube which help make the captioning process easier.
Here’s how our original video looked:
And with closed captions (CC) turned on:

Have a look at the completed Pictures Online video on our YouTube channel.

Matt O'Reilly