Establishing a shared research data service for UK universities

John Kaye; Rachel Bruce; Dom Fripp

Introduction and background

Throughout this article we use ‘research data’ as defined by the Engineering and Physical Sciences Research Council (EPSRC) research data definition: ‘Research data is defined as recorded factual material commonly retained by and accepted in the scientific community as necessary to validate research findings; although the majority of such data is created in digital format, all research data is included irrespective of the format in which it is created.’

For a number of years, managing research data has been on the agenda of research funders and organizations, and Jisc has worked with universities and funders to seek to address related needs. Towards the end of the Jisc managing research data programme, the Data Pool project at the University of Southampton summarized the situation as follows:

‘… it is clear that the issues surrounding research data management are becoming more complex rather than less. We now understand much more about the range of data to be managed, its size and sophistication and the expectations of researchers to manage workflows and share data. We also know that at institutional level the requirements of government and funders are placing potentially significant financial costs on institutions which they are finding challenging to discharge in the present financial climate.’

This characterizes the environment in which the research data shared service (RDSS) emerged as a priority development for Jisc. Through the consultation process surrounding the Research at Risk co-design challenge – led by Jisc in partnership with Research Libraries UK (RLUK), the Russell Universities Group IT Directors forum (RUGIT), The Society of College, National and University Libraries (SCONUL) and the Universities and Colleges Information Systems Association (UCISA) – the following high-level goals that higher education institutions (HEIs) would need a shared service to meet were identified:

research data policy compliance, such as meeting the EPSRC mandate
increased sector efficiencies, such as shared procurement, shared use of systems, data reuse opportunities and interoperability opportunities
improved integrity of research through making data available to reproduce research
integrated end-to-end research data management (RDM) system, addressing the digital preservation gap, enhancing user experience and usability
acceleration of RDM in institutions
provision of a solution that can ingest data and text research outputs, therefore supporting institutions to meet open access/Research Excellence Framework (REF) requirements.

Initially, the need to address the inefficient approach to RDM drove the project and, through consultation with the sector, a picture emerged of a fragmented RDM landscape. This was made up of a number of existing commercial and open source services that fulfil requirements for parts of the services needed to meet universities’ goals, but there was no one system that fulfilled them all. The procurement and development of all the necessary components for a complete end-to-end RDM solution can be resource intensive for institutions to piece together. This market and the varying approaches of institutions can lead to systems with gaps in functionality or institutions with no RDM systems at all. Widespread gaps in functionality were noted in terms of interoperability between research systems and a lack of preservation actions and systems beyond backing up data. A Jisc-procured, -managed and -hosted system would relieve institutions of this procurement, management and development burden and would create financial efficiencies in the sector, as well as enabling best practice in RDM and meeting the funder mandates.

At the same time as our initial work in late 2014, early 2015, there was a national policy discussion under way about developing a Concordat between funders and universities on Open Research Data. This was discussed by university leaders via Universities UK and the need for a practical solution to support the aspirations of the Concordat was sought. The Jisc RDSS project was therefore seen as an important development to meet this need.

In response to the top-level priorities from institutions, Jisc defined a scope for the initial RDSS project. We would focus on a system that will allow the ingest, publication, long-term storage and preservation of finalized data objects for publication or archiving and would create links to existing services in the ‘data creation’ and ‘managing active data’ parts of the lifecycle (see Figure 1). Therefore it would not include provision for active ‘live’ data storage systems for objects that are being created and worked on by researchers within their own workflows.

Figure 1

Jisc research data management life cycle

Requirements gathering

In order to produce a system that was relevant to as much of the higher education sector as possible, a comprehensive requirements gathering process was undertaken in the second half of 2015. This process consisted of three main components:

institutional survey around research systems
desk research around existing requirements
requirements gathering workshops.

The institutional survey around research systems was designed to look at the current use of research systems and current state of RDM readiness, current and future data storage needs and what the sector wanted Jisc to provide in the RDM space. The responses showed a range of maturity, with some universities already having plans in place, but a large proportion of respondents sought Jisc action and it was clear that research data needs were set to increase fairly dramatically over the next three years.

Desk research looked at existing requirements for RDM systems that had been put together by institutions within the previous Jisc ‘managing research data’ programme of work and those requirements emerging from Jisc’s ‘research data spring’ initiative. Projects that were initially drawn upon included:

University of Leeds Research Data Management Pilot Roadmap
Manchester Metropolitan University Draft Requirements Specification
ADMIRe Project
Kaptur
SWORD
Filling the Data Preservation Gap
A consortial approach to integrated RDMS
Data Vault
Jisc learner analytics operational requirements.

These requirements were combined with survey findings and aggregated into an overall requirements document that was consulted on and prioritized by the community offline and in consultation workshops. We did not want to create new requirements from scratch, but use best practice from the UK and around the world.

The consultation workshops took the form of a small workshop to test the water with conceptual ideas, to explore potential architectures (such as the one produced by the University of Sheffield) and to get expert input from institutions that were already tackling RDM systems. A larger event, involving 70 representatives from UK HEIs, was then organized and provided feedback on priorities and requirements. Participants also shared knowledge and experiences on current gaps in provision in RDM services, lessons learned from relevant service implementation at an institutional level and discussed aspirations and concerns about shared services for RDM.

There was consensus that Jisc should:

produce a new system that can be offered as a managed service, relieving burden from institutional IT and procurement staff
collectively develop improved interoperability between the service systems and existing institutional and external systems
produce an end-to-end system and also enable ‘pick and mix’ options to meet all university needs
procure RDM services and consultancy to support pilot institutions’ RDM requirements and implementation.

A smaller workshop with potential system suppliers, who provided feedback and advice on the proposals and requirements generated by the community, was also held. The end result of this stage of requirements gathering and analysis was the operational requirements for the RDSS that were published as part of The Official Journal of the European Union (OJEU) tender for the service.

Building a team – pilot institutions

Since the steer from the sector was that any solution needed to cater for a range of institutional needs, Jisc sought a number and a variety of pilot HEIs to partner with to develop the RDSS. In order to create a ‘balanced portfolio’ universities that had expressed an interest in joining the pilot were evaluated against criteria taking into account the size and type of institution (e.g. research intensive, small and specialist, teaching led), availability of varying types of data, their degree of RDM readiness (e.g. from greenfield sites to more established) and current use of institutional systems. As a result of this selection process, Jisc is collaborating with the following pilot institutions to develop the service:

Cardiff University
CREST (Consortium for Research Excellence, Support and Training)
Imperial College London
Plymouth University
Middlesex University
Royal College of Music
St George’s Hospital Medical School
University of Cambridge
Lancaster University
University of Lincoln
University of St Andrews
University of Surrey
University of York.

A range of goals from the pilots have been prioritized and a selective summary includes:

easy-to-use and cost-effective archiving, ingest, preservation, repository, reporting and discovery solution that can handle sensitive data
robust data storage that has growth ability for active and archive data
standard metadata profile – international for interoperability
integration with all main Current Research Information Systems (CRIS)
meets REF and funder deposit requirements (supports deposit of REF data output types).

In-depth technical and user requirements gathering has taken place with each pilot institution and these have been aggregated and prioritized across the board to define a development path.

Building a team – suppliers

While there would be some bespoke development required and there was added value to be developed across the end-to-end system, there was a decision to build from what already existed. So Jisc used an OJEU procurement process to create a supplier framework from which to select system suppliers and developers to create the service. The framework was divided into eight lots as follows:

Lot 1 – Research Data Repository Suppliers
Lot 2 – Repository Interfaces Suppliers
Lot 3 – Research Data Exchange Interface Suppliers
Lot 4 – Research Information and Administration Systems Integrations Suppliers
Lot 5 – Research Data Preservation Platforms Suppliers
Lot 6 – Research Data Preservation Tools Development Suppliers
Lot 7 – Research Data Reporting Suppliers
Lot 8 – User Experience Enhancement Suppliers.

More information on the detail of the functions of these lots can be found in the RDSS Operational Requirements document and a list of successful suppliers on the lots can be found on the Jisc RDM blog.

Meeting researchers’ needs – Data Asset Framework

RDSS will allow researchers to deposit data for publication, discovery, safe storage, long-term archiving and and preservation. However, this raises a whole range of different questions, including:

What forms of data do researchers have?
How much data are we talking about?
Where do they store their data currently?
Who else needs access to it?
How long does the data need to be kept?
What motivates researchers to share their data – or to keep it closed?

Over the course of 2016 Jisc worked with Research Consulting and the pilot institutions to try to find some answers.

To carry out this work it was decided that the Data Asset Framework (DAF) provided a useful methodology and had been tried and tested by institutions around the world. The DAF was developed in 2009 to help organizations identify, locate, describe and assess how they are managing their research data assets. It uses surveys and interviews to gather the necessary information, and we chose to follow a similar approach. Even better, six of the RDSS pilot institutions had already run a DAF survey within the last couple of years, which provided us with some ready-made data. However, when we came to analyse this data, we found that every institution had tweaked the survey questions to some extent, leaving us with a set of individually valid results that could not be meaningfully aggregated. What is more, the research data landscape has evolved rapidly since 2009. Funder policies on data sharing are much more demanding, the use of current research information systems is more widespread, and new services such as ORCID and DataCite have emerged. As a result, a lot of the information we needed to know was not covered by the original Data Asset Framework question set.

The obvious solution was to develop a new version of the survey: ‘The 2016 DAF survey’. Using the 2009 version as a starting point, Jisc staff and the RDSS pilot institutions developed a revised question set between April and June 2016. The new survey was then run by six RDSS pilots in July and August of that year. The data included 1,185 unique responses. For a summary of the headline results, see the summary slideshow.

The survey data allows us to draw some broader conclusions on the current state of RDM in the UK:

The RDSS can fill an important gap – 75% of researchers look first to their institution to preserve their data – but we know a lot of institutions cannot fully meet this need at present. This is where the RDSS can help.
Access to institutional support for RDM remains low – only 16% of respondents are currently accessing university RDM support services. This is a twofold challenge: institutions not only need to make appropriate support services available, but also make researchers aware that they exist.
We are pushing at an open door – 68% of respondents either already share data, or expect to do so in the future. Most of them do so because they believe that research is a public good which should be open to all. We just need to make data-sharing easier.
We still have a long way to go – only 40% of respondents currently have an RDM plan, and only 18% follow established metadata standards or guidelines. Delivering change will take time.

A full report on the survey results is available, as well as the supporting anonymised raw data set.

In addition to informing the requirements for the RDSS via this work, we have developed resources that could be reused. A new DAF toolkit has been published that outlines the steps involved in running a DAF survey, and makes recommendations on how to approach them.

Meeting researchers’ needs – metadata

As RDSS aims to create a multidisciplinary data service, we need to make sure that our metadata and data models and associated processes can support the use cases of a wide range of researchers from extremely diverse research domains, so we needed to test our ideas with researchers. To achieve this, Jisc worked with Clax to hold nine focus groups with pilot institutions’ researchers. A full report from the focus groups is available alongside a comprehensive set of researchers’ use cases; some of the emerging themes are laid out below:

The focus groups expressed concern about a number of areas with regard to metadata. Some can be addressed by training and support; many can be addressed by suppliers working with institutions and RDSS. A few require new technologies or culture change.
Early creation and collection of metadata was often mentioned. This can be achieved through the use of dynamic data management plans so that metadata is collected from the planning stage and updated throughout the data collection and analysis process.
Systems should preserve the form and content of the deposited data while allowing updating of the metadata to link to related data sets, subsequent publications and other materials which may have been created after the data was deposited. They should also allow updating of keywords and descriptive materials to reflect changes in the discipline. The facility to allow metadata to include links to other digital object identifiers (DOIs) and URLs – where a DOI does not exist – is essential.
It is often assumed that the collection of metadata will involve researchers in arduous and time-consuming form filling at data deposit time. This is undesirable and unlikely to produce good metadata. Instead, automation of tools, collection processes, equipment and metadata collection integrated into researchers’ workflow throughout the research will, ideally, allow a push-button submission of the data, with metadata already attached, to the repository.

Ambitions for the service

The requirements gathering and user needs work led to the ambition for the RDSS being defined, as detailed in Figure 2. This diagram shows the core RDSS platforms of repository, preservation and reporting systems in the centre, backed up by Jisc-provided data storage. This includes input from the researchers via a web user interface and other tools to manage large and sensitive data. Interoperability is key to the service and a number of integrations have been identified, such as integrations with CRIS systems, data management planning tools and publications repositories. RDSS will not function in isolation: links will be made to the scholarly communications infrastructure through the use of permanent identifiers, such as DOIs and ORCID iDs, and aggregation services, RDSS will utilise metrics and altmetrics infrastructure and where possible will communicate with funder systems.

Figure 2

RDSS conceptual diagram – see reference for larger version

RDSS will also join up with other services, policy, practice and standards that are being developed through the Jisc Research at Risk portfolio, such as being interoperable with and providing metadata to the UK Research Data Discovery Service, integrating the metrics and usage statistics work developed in IRUS Data UK and integrating ORCID into the service. The pilot’s requirements were already informed by Jisc’s funder policy guidance and will look to harness the innovation from successful research data spring projects.

Technical architecture and development approach

Much of this article has been about laying the foundations for delivery for RDSS. However, Jisc entered technical production for RDSS alpha in November 2016. A technical architecture approach was agreed with the pilots, suppliers and Expert Advisory Group in October 2016. Some of the key elements to the approach were:

An agreement that an event-driven service-oriented architecture (SOA 2.0) was necessary due to the need to scale in terms of complexity and to have a more fault tolerant architecture. SOA 2.0 provides the highest level of decoupling between services. Often SOA 2.0 systems will be described by engineering professionals with the phrase ‘dumb pipes with smart end points’. Rather than orchestrate business processes centrally, the services listen for events or notifications from other components and respond appropriately. SOA 2.0 is characterized by its distributed nature.
This architecture means that we require a suite of microservices and adaptors to transform data from external systems to our canonical data model.
The pattern for system integration is publish-subscribe.

A conceptual diagram of the architecture can be viewed here.

This work also laid out our expectations in terms of how we work with suppliers in development to testing and rolling out a production service.

RDSS data model

For RDSS to run effectively as a whole it requires an underlying data model, not just to allow researchers to fill in forms to ingest data and provide the minimum metadata for a DataCite DOI, but also to allow systems to create and push events and messages to each other to achieve the goal of an end-to-end system. The data model also allows for the enrichment and auto-generation of metadata by allowing for links to institutional systems and external, scholarly communications systems.

A data model for RDSS alpha has been produced for consultation. This current draft of the data model has been constructed under the requirements of being interoperable with as many of the products and services that have been identified in the supplier lots or elsewhere. For example, the data model is aligned to the metadata requirements of scholarly communications services.

The result of this work is a dense and complex data model, involving many entities, relations and vocabularies. The role of the consultation is to fine-tune this data model from a practical perspective and ensure that it is interoperable with existing infrastructure. This means that properties can be removed if not used within the system being described. A reduced data model (and vocabularies) can then be identified as core, with additional structure added as required.

Progress and challenges

Jisc is now in alpha development and is setting up the foundations for the service. This includes putting in place the technical architecture and on-boarding all of the repository and preservation systems suppliers in a test environment. This test environment also includes widely used research systems, such as EPrints and DSpace as well as test data, so that our suppliers and developers can work together to deliver interoperability early to our pilot institutions. A User Experience Lead has been appointed to provide principles and a governance framework across the project. Discovery, requirements gathering and testing will continue with the pilot institutions, along with community input from the sector through workshops.

Other work currently under way is around cost reporting and business modelling, so that Jisc can produce a financially attractive offer to the HE sector for the production service. Alongside that, market research is also taking place to discover the needs of institutions outside the pilot group and whether a Jisc-hosted RDM system is attractive to them.

Alpha development is due to finish in Summer 2017 and beta development will look at scalability of the system, more integrations and challenging issues such as managing large data sets, managing sensitive data sets and challenges around preserving such a diverse set of digital objects that fall outside the remit of the traditional digital preservation community. Beta development is due to finish in April 2018; however, a business case for a production service will be presented to the Jisc board by December 2017.

Throughout the alpha and beta phases, the RDSS project has been given challenging requirements from our pilot institutions, including:

defining a ‘minimum viable product’ with a multitude of systems, priorities and expectations
fitting with existing institution and researcher workflows – for example, fitting RDSS into an institutional policy with the CRIS as the front door for researchers
making preservation work for research data, when the development of systems and tools have been led by the cultural heritage system
managing large data, data too large to be uploaded over the web, so greater than 5GB and including the challenges of big data
managing sensitive data including commercial, personally identifiable information and medical data.

Further details about the project can be found on regular blog updates and material uploaded to the Research Data Network website.

[B1] EPSRC Scope and Benefits: https://www.epsrc.ac.uk/about/standards/researchdata/scope/ (accessed 26 January 2017).

[B2] Brown, M L and White, W (2013). A partnership approach to research data management In: Pryor, G, Jones, S and White, A eds. Delivering Research Data Management Services: Fundamentals of Good Practice. London: Facet. http://eprints.soton.ac.uk/id/eprint/356247 (accessed 26 January 2017).

[B3] Jisc research data shared service: https://www.jisc.ac.uk/rd/projects/research-data-shared-service (accessed 26 January 2017).

[B4] Jisc Research at Risk: https://www.jisc.ac.uk/rd/projects/research-at-risk (accessed 22 February 2017).

[B5] EPSRC policy framework on research data: https://www.epsrc.ac.uk/about/standards/researchdata/ (accessed 26 January 2017).

[B6] Research Excellence Framework 2014: http://www.ref.ac.uk/ (accessed 26 January 2017).

[B7] Research Councils UK (). Concordat on Open Research Data launched: http://www.rcuk.ac.uk/media/news/160728/ (accessed 26 January 2017).

[B8] Universities UK: http://www.universitiesuk.ac.uk/ (accessed 26 January 2017).

[B9] Jisc Managing Research Data programme: https://www.jisc.ac.uk/rd/projects/managing-research-data (accessed 26 January 2017).

[B10] Jisc research data spring: https://www.jisc.ac.uk/rd/projects/research-data-spring (accessed 26 January 2017).

[B11] University of Leeds Research Data Management Pilot Roadmap (). Project Outputs: https://library.leeds.ac.uk/roadmap-project-outputs (accessed 26 January 2017).

[B12] Parsons, T and Berry, M (2012). Research Data Management Technical Requirements: A report to ADMIRe and IS stakeholders, Nottingham, The University of Nottingham: https://admire.jiscinvolve.org/wp/files/2013/05/ADMIRe-RDM-Technical-Requirements-Report.pdf (accessed 26 January 2017).

[B13] Garret, L, Silva, C and Gramstadt, M-T (2011). Kaptur technical analysis report, University for the Creative Arts: https://vads.ac.uk/kaptur/outputs/Kaptur_technical_analysis.pdf (accessed 26 January 2017).

[B14] Jones, R (2012). Sword Data Deposit Scenarios July 3 2012 SWORD: http://swordapp.org/2012/07/data-deposit-scenarios/ (accessed 26 January 2017).

[B15] Filling the Digital Preservation Gap: https://www.york.ac.uk/borthwick/projects/archivematica/ (accessed 26 January 2017).

[B16] Miller, A (2015). Project Report: A consortial approach to integrated RDMS In: figshare. https://dx.doi.org/10.6084/m9.figshare.1480451.v1 (accessed 26 January 2017).

[B17] Data Vault: http://libraryblogs.is.ed.ac.uk/jiscdatavault/ (accessed 26 January 2017).

[B18] Effective learning analytics: https://www.jisc.ac.uk/rd/projects/effective-learning-analytics (accessed 26 January 2017).

[B19] Kaye, J (). Jisc RDM Shared Service Pilot Initial Statement of Requirements: https://researchdata.jiscinvolve.org/wp/files/2015/11/Draft-RDMSS-Requirements-Specification-V1.0.docx (accessed 26 January 2017).

[B20] Duca, D (2015). What makes up the ‘ideal’ research data management system? July 9 2015 Jisc Shared Services Workshop: https://researchdata.jiscinvolve.org/wp/2015/07/30/makes-ideal-research-data-management-system/ (accessed 26 January 2017).

[B21] Lewis, J A (2014). Research Data Management Technical Infrastructure: A Review of Options for Development at the University of Sheffield In: figshare. https://dx.doi.org/10.6084/m9.figshare.1202230.v9 (accessed 26 January 2017).

[B22] RDM Shared Services November Workshops: https://researchdata.jiscinvolve.org/wp/2015/11/23/rdm-shared-service-workshops/ (accessed 26 January 2017).

[B23] Kaye, J, Stokes, P and Bruce, R (). Jisc Research Data Shared Service Operational Requirements Zenodo: http://doi.org/10.5281/zenodo.48261 (accessed 26 January 2017).

[B24] Research Data Management Shared Service – Call for Formal Expressions of Interest https://researchdata.jiscinvolve.org/wp/2015/11/06/research-data-management-shared-service-call-for-formal-expressions-of-interest/ (accessed 26 January 2017).

[B25] Kaye, J, Stokes, P and Bruce, R (). Jisc Research Data Shared Service Operational Requirements Zenodo: http://doi.org/10.5281/zenodo.48261 (accessed 26 January 2017).

[B26] Kaye, J (). Research Data Shared Service – OR2016: https://researchdata.jiscinvolve.org/wp/2016/06/14/jisc-research-data-shared-service-or2016/ (accessed 26 January 2017).

[B27] Research Consulting: http://www.research-consulting.com/ (accessed 26 January 2017).

[B28] Data Asset Framework (DAF): http://www.data-audit.eu/index.html (accessed 26 January 2017).

[B29] ORCID: https://orcid.org/ (accessed 14th January 2017).

[B30] DataCite: https://www.datacite.org/ (accessed 26 January 2017).

[B31] DAF (). ref. 28.

[B32] Research Consulting: 2016 DAF Survey Results: https://researchdata.jiscinvolve.org/wp/files/2016/11/2016-DAF-survey-results-for-blog.pptx (accessed 26 January 2017).

[B33] Johnson, R, Parsons, T, Chiarelli, A and Kaye, J (2016). Jisc Research Data Assessment Support – Findings of the 2016 data assessment framework (DAF) surveys DOI: https://doi.org/10.5281/zenodo.177856 Zenodo; (accessed 26 January 2017).

[B34] Johnson, R, Chiarelli, A and Parsons, T (2016). Data asset framework (DAF) survey results In: figshare. http://dx.doi.org/10.6084/m9.figshare.3796305.v4 (accessed 26 January 2017).

[B35] Clax Ltd. http://www.clax.co.uk/ (accessed 26 January 2017).

[B36] Ferguson, N (2016). Report for the proposed Research Data Shared Service on focus groups held between May and October 2016 and the metadata issues and requirements identified DOI: https://doi.org/10.5281/zenodo.193018 Zenodo; (accessed 14 January 2017).

[B37] Ferguson, N (2016). Jisc Research Data Shared Service metadata focus group use cases [data set] DOI: https://doi.org/10.5281/zenodo.193011 Zenodo; (accessed 26 January 2017).

[B38] Kaye, J and Bruce, R (). Research data shared service, Poster to Open Repositories 2016: https://researchdata.jiscinvolve.org/wp/files/2016/06/RDSS_POSTER_JUNE2016_FINAL.pdf (accessed 26 January 2017).

[B39] UK Research Data Discovery Service (Alpha) (). http://ckan.data.alpha.jisc.ac.uk/dataset (accessed 26 January 2017).

[B40] IRUS for Data [data set]: https://www.jisc.ac.uk/rd/projects/research-data-metrics-for-usage (accessed 26 January 2017).

[B41] UK ORCID consortium (). https://www.jisc.ac.uk/orcid (accessed 26 January 2017).

[B42] Meeting the requirements of the EPSRC research data policy: https://www.jisc.ac.uk/guides/meeting-the-requirements-of-the-EPSRC-research-data-policy (accessed 26 January 2017).

[B43] RDSS Conceptual Technical Architecture: https://www.lucidchart.com/documents/edit/6398e8e5-51fb-46ff-8e08-7f97e8861265 (accessed 16 January 2017).

[B44] RDSS Canonical Data Model: https://github.com/JiscRDSS/rdss-canonical-data-model (accessed 26 January 2017).

[B45] Research Data Network: http://researchdata.network (accessed 26 January 2017).

[B46] Jisc RDM Blog: https://researchdata.jiscinvolve.org/wp/ (accessed 26 January 2017).

[B47] Research Data Network, ref. 46.

Insights

Case Studies

Establishing a shared research data service for UK universities

Abstract