“The ‘Magimix’ cake cookbook was fantastic, the best thing Mary Berry ever wrote. My mum relied on it but now you can't find it anywhere.”

Not the usual reference query I might handle in a Cambridge library, but instead something my wife mentioned during the latest baking programme on BBC One.

As a loyal systems librarian, I swiftly grabbed my laptop and turned to my institution's discovery platform with the sole intent of proving her wrong. After all, it describes, amongst other things, the collections of a national legal deposit library.

I was thwarted, however. No mention there of said cookbook. The latest webscale offering we are trialling? Again, nothing. So I tried COPAC, the UK Research Libraries union catalogue. Nothing there. Then the Library of Congress. No results. AbeBooks? Nothing.

Dejected, I turned to Google. A few mentions on forums (mostly avid bakers concurring with my wife) but the fourth or fifth result down was a reference to Facebook1. It turned out that the book had its own page on the popular social network (complete with two ‘likes’) but no entry in any UK major catalogue I checked.

What is surprising here is not that Google and Facebook between them found me the item I wanted, but that these were not the first places I went to. As a librarian, I'm embedded in world of academic search services and have a personal knowledge of collections described in union catalogues. I will instinctively turn to them for bibliographic queries, but these days I suspect I am in a minority, even in academic circles.

“… Google and Facebook between them found me the item I wanted, but … these were not the first places I went to.”

Lorcan Dempsey has succinctly described the reasons for this behaviour pattern2. With academic users living in an information-rich environment but often in time-poor situations, it is only natural that they gravitate to the highest possible layer of aggregation, the global search engine. In this model, local discovery services and national-level aggregations tend to get skipped over, despite their rich and valuable content.

What is possibly even more surprising is how Facebook came to know about the ‘Magimix cookbook’3 in the first place. Bibliographic data used to generate the page actually originated from the Harvard University library catalogue, which was last year published openly for anyone to use and reuse as they see fit4. In this case, it has been embedded into Facebook's graph search5, a complex semantic structure showing the relationships between people, objects and events in the social network. There, my wife's approval of this work is now visible to all of her friends.

The value of opening data to the web

It was thanks to Harvard publishing their bibliographic data under a permissive licence that I was able to confirm the existence of this book. But Hollis, the Harvard Library catalogue, although a splendid resource, was not on my list of places to try, mainly because I live and work in the other Cambridge.

Nonetheless, the whole episode neatly demonstrates the value of libraries opening up the data describing their collections to the global network of the internet, as indexed by search engines. Whilst search engines lack the fidelity and functionality of academic-centric search offerings, getting the right data into them gives users at least a fighting chance of finding the right material without having to a) know about the catalogues in the first place, and b) trawl through several of them as I had done.

“It is hard not to argue with the idea that discovery of academic material is increasingly happening outside of the library search domain.”

It is hard not to argue with the idea that discovery of academic material is increasingly happening outside of the library search domain. Libraries need to acknowledge this trend and work with it. Getting our collection data out there under a liberal licence is one way to help make this happen, either via dumps of data or by allowing web crawlers to access online catalogue pages. The model has worked wonders for e-commerce, which sees search engine optimization as a key means to drive business growth.

The Magimix cookbook example worked well in search engine results as it is a niche case. Library data describing more popular material would have to sit alongside results from Amazon and other popular sites.

Exposing such niche cases offers incredible opportunities for libraries. Replace obscure 1970s cookbooks with rare or early printed books, scientific data sets, rare material or unique collections of manuscripts and you move into the world of genuine academic use cases. Libraries could finally fulfil the promise of becoming ‘a very long tail of scholarly and cultural materials’6, putting the pieces in place to link up the right researchers with the right material in emerging global data structures.

Problems with institutional-level publishing

Since Cambridge University Library opened up its data in 2011 through the Open Bibliography7 and COMET8 projects, the UK higher education (HE) and cultural heritage sectors have taken great strides in publishing large amounts of open bibliographic data. The web pages for the Jisc-funded Discovery programme offer an insight into the advantages in creating an ecosystem of open data. Ben Showers' article in the July 2012 issue of this publication provides a great summary of this work to date9.

It is also encouraging to see such data being reused and repurposed to new ends. The British Library's Open British National Bibliography data set has reportedly seen over two million requests for data. Clearly, we are only beginning to see the advantages this will bring.

However, there are some caveats. Very few institutions currently publishing data do so routinely on their own. Cambridge and others have mostly relied on external funding to do so. As a single institution, publishing data in house looks like a potentially complex activity that may be seen to fail if no one immediately makes use of it. Standards and technologies are changing, and the intriguing potential of unknown medium- to long-term benefits10 does not necessarily fit in alongside more obvious institutional demands, such as enhancing the student experience.

To a budget-conscious library director, it could be seen as an expensive luxury or technical distraction. This is not necessarily the case. There are lightweight approaches to getting data on your collections harvested and published on the web.

Most promising of these is the schema.org initiative11, which allows structured descriptive information to be inserted into web pages in a way that can be intelligently crawled by search engines. Doing so allows them to understand the author and publisher of a work as separate data elements, not just as text on a web page.

OCLC is leading the way in exploring how libraries can use this technology, looking at getting library descriptive data and holdings information read and understood by search engines. Imagine being able to look for a book in Google, and have it automatically show you a map with the nearest library holding the work along with information on how many copies are currently checked in. Google has changed small things in my life, like finding a local cinema that is showing the film I want to see. Imagine if we could do this for our collections.

“The bigger, more reputable and more linked to you are, the more likely search engines are to notice you …”

Aggregate to disseminate

Regardless of the mechanism, publishing on your own is also not that likely to get you noticed by large consumers of data such as search engines and social networks. They tend to value larger sources.

The bigger, more reputable and more linked to you are, the more likely search engines are to notice you and promote you in their results. With this in mind, does it not make sense to look at existing aggregations as the natural platform to start publishing our data? Such aggregations arguably carry sufficient volume and authority to get noticed and harvested by big players.

Sidestep the OPAC

Another reason for using aggregators to publish data lies in the very nature of our institutional-level library systems infrastructure. In-house catalogues (OPACs) are not generally suitable for widespread crawling by search engines. A heavy level of page-crawling over an OPAC by search engines risks toppling the whole library management system that OPACs form part of. Commercial discovery platforms currently lack the flexibility and functionality to act as data publishing platforms12.

To get web pages crawled by search engines as OCLC is currently doing, we need better more flexible infrastructure with more control, often found at the aggregation level.

“The more customers that use a product, the bigger the underlying shared data set can grow.”

Library service platforms – level the playing field with open aggregations

Not only is the way we expose and share data changing, but the ways in which we as librarians create data are changing as well. The new generation of the LMS, library service platforms13, promise to transform data creation with streamlined workflows built around global data sets. These products have models of aggregation based around the customer base. The more customers that use a product, the bigger the underlying shared data set can grow.

This situation has the potential to create a network effect14, whereby libraries feel drawn to use the service provider with its hands on the largest and best aggregation of data. The more libraries that join the leading supplier, the harder it becomes for anyone else to challenge them. This is bad for the marketplace and ultimately bad for libraries as customers.

In the UK, the KB+ project15 is partly addressing a parallel problem related to e-journal holdings information and licensing terms. It is aiming to create a national-level store of licence and holdings data that is owned and managed by a community. System vendors can also take and contribute data from and to the store. This allows libraries as customers to migrate from one electronic resource management system (ERM) to another with confidence that data will be uniform in quality across the marketplace.

A national-level aggregation of bibliographic data could potentially fulfil a similar role, underpinning the new generation of library service platforms as a storehouse of well-curated bibliographic data.

“Discovery is not just about knowing what material is out there, but why it is useful to you.”

Social search

Discovery is not just about knowing what material is out there, but why it is useful to you. The assessment of a work has long been a vital use case for library catalogues and the data inside them. Recommendations and other social interactions also play a pivotal role in assessment, particularly those from a peer or trusted source, but also anonymously, based on aggregated activity data. They have always taken place in an academic social context, but increasingly this discourse can and does happen online. Joining up online recommendations with descriptive information seems like a sure-fit for libraries. Jisc has already funded a number of initiatives in this space16.

If libraries were to usefully enter this area en masse, then we might also consider a neutral house to store and aggregate recommendation and usage information. As with the bibliographic data it might link to, such usage information is too valuable to be left in the hands of a single discovery platform supplier. A national-level aggregation might be valuable in this respect.

Conclusion – building the business case

Aggregations of data are a wonderful resource for libraries and remain so in the age of the internet. With the above use cases, and potentially more, it is not hard to see how current aggregations could gain new leases of life as data publishing platforms. Many of the original use cases for aggregation, including acting as a collective shop front for UK research libraries, are still just as valid now. With this suggested direction of change, aggregations have the potential to reach far greater audiences by pushing data directly to the social networks and search engines that engage our users daily. Along the way, they can solve a few problems facing libraries as well.

There are challenges. Many aggregations currently depend on sale of bibliographic data to fund their efforts, so a change in business model would be required. One option is to move to selling services based around open data, rather than simply raw data itself. Rufus Pollock, co-founder of the Open Knowledge Foundation, is particularly fond of a phrase: ‘Data is a platform not a commodity: you build on it rather than sell it. And that's why it should be open’17.

The financial and organizational barriers to actually making this change are likely to be complex and difficult to navigate. But the right combination of open data and the ability to reach new library users by getting collection data into their favourite online environments is arguably too tasty a mix to ignore.