Introduction and background
Throughout this article we use ‘research data’ as defined by the Engineering and Physical Sciences Research Council (EPSRC) research data definition: ‘Research data is defined as recorded factual material commonly retained by and accepted in the scientific community as necessary to validate research findings; although the majority of such data is created in digital format, all research data is included irrespective of the format in which it is created.’1
For a number of years, managing research data has been on the agenda of research funders and organizations, and Jisc has worked with universities and funders to seek to address related needs. Towards the end of the Jisc managing research data programme, the Data Pool project at the University of Southampton summarized the situation as follows:
‘… it is clear that the issues surrounding research data management are becoming more complex rather than less. We now understand much more about the range of data to be managed, its size and sophistication and the expectations of researchers to manage workflows and share data. We also know that at institutional level the requirements of government and funders are placing potentially significant financial costs on institutions which they are finding challenging to discharge in the present financial climate.’2
This characterizes the environment in which the research data shared service (RDSS)3 emerged as a priority development for Jisc. Through the consultation process surrounding the Research at Risk4 co-design challenge – led by Jisc in partnership with Research Libraries UK (RLUK), the Russell Universities Group IT Directors forum (RUGIT), The Society of College, National and University Libraries (SCONUL) and the Universities and Colleges Information Systems Association (UCISA) – the following high-level goals that higher education institutions (HEIs) would need a shared service to meet were identified:
- research data policy compliance, such as meeting the EPSRC mandate5
- increased sector efficiencies, such as shared procurement, shared use of systems, data reuse opportunities and interoperability opportunities
- improved integrity of research through making data available to reproduce research
- integrated end-to-end research data management (RDM) system, addressing the digital preservation gap, enhancing user experience and usability
- acceleration of RDM in institutions
- provision of a solution that can ingest data and text research outputs, therefore supporting institutions to meet open access/Research Excellence Framework (REF)6 requirements.
Initially, the need to address the inefficient approach to RDM drove the project and, through consultation with the sector, a picture emerged of a fragmented RDM landscape. This was made up of a number of existing commercial and open source services that fulfil requirements for parts of the services needed to meet universities’ goals, but there was no one system that fulfilled them all. The procurement and development of all the necessary components for a complete end-to-end RDM solution can be resource intensive for institutions to piece together. This market and the varying approaches of institutions can lead to systems with gaps in functionality or institutions with no RDM systems at all. Widespread gaps in functionality were noted in terms of interoperability between research systems and a lack of preservation actions and systems beyond backing up data. A Jisc-procured, -managed and -hosted system would relieve institutions of this procurement, management and development burden and would create financial efficiencies in the sector, as well as enabling best practice in RDM and meeting the funder mandates.
At the same time as our initial work in late 2014, early 2015, there was a national policy discussion under way about developing a Concordat between funders and universities on Open Research Data.7 This was discussed by university leaders via Universities UK8 and the need for a practical solution to support the aspirations of the Concordat was sought. The Jisc RDSS project was therefore seen as an important development to meet this need.
In response to the top-level priorities from institutions, Jisc defined a scope for the initial RDSS project. We would focus on a system that will allow the ingest, publication, long-term storage and preservation of finalized data objects for publication or archiving and would create links to existing services in the ‘data creation’ and ‘managing active data’ parts of the lifecycle (see Figure 1). Therefore it would not include provision for active ‘live’ data storage systems for objects that are being created and worked on by researchers within their own workflows.
In order to produce a system that was relevant to as much of the higher education sector as possible, a comprehensive requirements gathering process was undertaken in the second half of 2015. This process consisted of three main components:
- institutional survey around research systems
- desk research around existing requirements
- requirements gathering workshops.
The institutional survey around research systems was designed to look at the current use of research systems and current state of RDM readiness, current and future data storage needs and what the sector wanted Jisc to provide in the RDM space. The responses showed a range of maturity, with some universities already having plans in place, but a large proportion of respondents sought Jisc action and it was clear that research data needs were set to increase fairly dramatically over the next three years.
Desk research looked at existing requirements for RDM systems that had been put together by institutions within the previous Jisc ‘managing research data’9 programme of work and those requirements emerging from Jisc’s ‘research data spring’10 initiative. Projects that were initially drawn upon included:
- University of Leeds Research Data Management Pilot Roadmap11
- Manchester Metropolitan University Draft Requirements Specification
- ADMIRe Project12
- Filling the Data Preservation Gap15
- A consortial approach to integrated RDMS16
- Data Vault17
- Jisc learner analytics operational requirements.18
These requirements were combined with survey findings and aggregated into an overall requirements document19 that was consulted on and prioritized by the community offline and in consultation workshops. We did not want to create new requirements from scratch, but use best practice from the UK and around the world.
The consultation workshops took the form of a small workshop20 to test the water with conceptual ideas, to explore potential architectures (such as the one produced by the University of Sheffield)21 and to get expert input from institutions that were already tackling RDM systems. A larger event,22 involving 70 representatives from UK HEIs, was then organized and provided feedback on priorities and requirements. Participants also shared knowledge and experiences on current gaps in provision in RDM services, lessons learned from relevant service implementation at an institutional level and discussed aspirations and concerns about shared services for RDM.
There was consensus that Jisc should:
- produce a new system that can be offered as a managed service, relieving burden from institutional IT and procurement staff
- collectively develop improved interoperability between the service systems and existing institutional and external systems
- produce an end-to-end system and also enable ‘pick and mix’ options to meet all university needs
- procure RDM services and consultancy to support pilot institutions’ RDM requirements and implementation.
A smaller workshop with potential system suppliers, who provided feedback and advice on the proposals and requirements generated by the community, was also held. The end result of this stage of requirements gathering and analysis was the operational requirements for the RDSS that were published as part of The Official Journal of the European Union (OJEU) tender for the service.23
Building a team – pilot institutions
Since the steer from the sector was that any solution needed to cater for a range of institutional needs, Jisc sought a number and a variety of pilot HEIs to partner with to develop the RDSS. In order to create a ‘balanced portfolio’ universities that had expressed an interest in joining the pilot were evaluated against criteria taking into account the size and type of institution (e.g. research intensive, small and specialist, teaching led), availability of varying types of data, their degree of RDM readiness (e.g. from greenfield sites to more established) and current use of institutional systems.24 As a result of this selection process, Jisc is collaborating with the following pilot institutions to develop the service:
- Cardiff University
- CREST (Consortium for Research Excellence, Support and Training)
- Imperial College London
- Plymouth University
- Middlesex University
- Royal College of Music
- St George’s Hospital Medical School
- University of Cambridge
- Lancaster University
- University of Lincoln
- University of St Andrews
- University of Surrey
- University of York.
A range of goals from the pilots have been prioritized and a selective summary includes:
- easy-to-use and cost-effective archiving, ingest, preservation, repository, reporting and discovery solution that can handle sensitive data
- robust data storage that has growth ability for active and archive data
- standard metadata profile – international for interoperability
- integration with all main Current Research Information Systems (CRIS)
- meets REF and funder deposit requirements (supports deposit of REF data output types).
In-depth technical and user requirements gathering has taken place with each pilot institution and these have been aggregated and prioritized across the board to define a development path.
Building a team – suppliers
While there would be some bespoke development required and there was added value to be developed across the end-to-end system, there was a decision to build from what already existed. So Jisc used an OJEU procurement process to create a supplier framework from which to select system suppliers and developers to create the service. The framework was divided into eight lots as follows:
- Lot 1 – Research Data Repository Suppliers
- Lot 2 – Repository Interfaces Suppliers
- Lot 3 – Research Data Exchange Interface Suppliers
- Lot 4 – Research Information and Administration Systems Integrations Suppliers
- Lot 5 – Research Data Preservation Platforms Suppliers
- Lot 6 – Research Data Preservation Tools Development Suppliers
- Lot 7 – Research Data Reporting Suppliers
- Lot 8 – User Experience Enhancement Suppliers.
More information on the detail of the functions of these lots can be found in the RDSS Operational Requirements document25 and a list of successful suppliers on the lots can be found on the Jisc RDM blog.26
Meeting researchers’ needs – Data Asset Framework
RDSS will allow researchers to deposit data for publication, discovery, safe storage, long-term archiving and and preservation. However, this raises a whole range of different questions, including:
- What forms of data do researchers have?
- How much data are we talking about?
- Where do they store their data currently?
- Who else needs access to it?
- How long does the data need to be kept?
- What motivates researchers to share their data – or to keep it closed?
Over the course of 2016 Jisc worked with Research Consulting27 and the pilot institutions to try to find some answers.
To carry out this work it was decided that the Data Asset Framework (DAF)28 provided a useful methodology and had been tried and tested by institutions around the world. The DAF was developed in 2009 to help organizations identify, locate, describe and assess how they are managing their research data assets. It uses surveys and interviews to gather the necessary information, and we chose to follow a similar approach. Even better, six of the RDSS pilot institutions had already run a DAF survey within the last couple of years, which provided us with some ready-made data. However, when we came to analyse this data, we found that every institution had tweaked the survey questions to some extent, leaving us with a set of individually valid results that could not be meaningfully aggregated. What is more, the research data landscape has evolved rapidly since 2009. Funder policies on data sharing are much more demanding, the use of current research information systems is more widespread, and new services such as ORCID29 and DataCite30 have emerged. As a result, a lot of the information we needed to know was not covered by the original Data Asset Framework31 question set.
The obvious solution was to develop a new version of the survey: ‘The 2016 DAF survey’. Using the 2009 version as a starting point, Jisc staff and the RDSS pilot institutions developed a revised question set between April and June 2016. The new survey was then run by six RDSS pilots in July and August of that year. The data included 1,185 unique responses. For a summary of the headline results, see the summary slideshow.32
The survey data allows us to draw some broader conclusions on the current state of RDM in the UK:
- The RDSS can fill an important gap – 75% of researchers look first to their institution to preserve their data – but we know a lot of institutions cannot fully meet this need at present. This is where the RDSS can help.
- Access to institutional support for RDM remains low – only 16% of respondents are currently accessing university RDM support services. This is a twofold challenge: institutions not only need to make appropriate support services available, but also make researchers aware that they exist.
- We are pushing at an open door – 68% of respondents either already share data, or expect to do so in the future. Most of them do so because they believe that research is a public good which should be open to all. We just need to make data-sharing easier.
- We still have a long way to go – only 40% of respondents currently have an RDM plan, and only 18% follow established metadata standards or guidelines. Delivering change will take time.
In addition to informing the requirements for the RDSS via this work, we have developed resources that could be reused. A new DAF toolkit has been published that outlines the steps involved in running a DAF survey, and makes recommendations on how to approach them.
Meeting researchers’ needs – metadata
As RDSS aims to create a multidisciplinary data service, we need to make sure that our metadata and data models and associated processes can support the use cases of a wide range of researchers from extremely diverse research domains, so we needed to test our ideas with researchers. To achieve this, Jisc worked with Clax35 to hold nine focus groups with pilot institutions’ researchers. A full report from the focus groups is available36 alongside a comprehensive set of researchers’ use cases;37 some of the emerging themes are laid out below:
- The focus groups expressed concern about a number of areas with regard to metadata. Some can be addressed by training and support; many can be addressed by suppliers working with institutions and RDSS. A few require new technologies or culture change.
- Early creation and collection of metadata was often mentioned. This can be achieved through the use of dynamic data management plans so that metadata is collected from the planning stage and updated throughout the data collection and analysis process.
- Systems should preserve the form and content of the deposited data while allowing updating of the metadata to link to related data sets, subsequent publications and other materials which may have been created after the data was deposited. They should also allow updating of keywords and descriptive materials to reflect changes in the discipline. The facility to allow metadata to include links to other digital object identifiers (DOIs) and URLs – where a DOI does not exist – is essential.
- It is often assumed that the collection of metadata will involve researchers in arduous and time-consuming form filling at data deposit time. This is undesirable and unlikely to produce good metadata. Instead, automation of tools, collection processes, equipment and metadata collection integrated into researchers’ workflow throughout the research will, ideally, allow a push-button submission of the data, with metadata already attached, to the repository.
Ambitions for the service
The requirements gathering and user needs work led to the ambition for the RDSS being defined, as detailed in Figure 2. This diagram shows the core RDSS platforms of repository, preservation and reporting systems in the centre, backed up by Jisc-provided data storage. This includes input from the researchers via a web user interface and other tools to manage large and sensitive data. Interoperability is key to the service and a number of integrations have been identified, such as integrations with CRIS systems, data management planning tools and publications repositories. RDSS will not function in isolation: links will be made to the scholarly communications infrastructure through the use of permanent identifiers, such as DOIs and ORCID iDs, and aggregation services, RDSS will utilise metrics and altmetrics infrastructure and where possible will communicate with funder systems.
RDSS will also join up with other services, policy, practice and standards that are being developed through the Jisc Research at Risk portfolio, such as being interoperable with and providing metadata to the UK Research Data Discovery Service,39 integrating the metrics and usage statistics work developed in IRUS Data UK40 and integrating ORCID41 into the service. The pilot’s requirements were already informed by Jisc’s funder policy guidance42 and will look to harness the innovation from successful research data spring projects.
Technical architecture and development approach
Much of this article has been about laying the foundations for delivery for RDSS. However, Jisc entered technical production for RDSS alpha in November 2016. A technical architecture approach was agreed with the pilots, suppliers and Expert Advisory Group in October 2016. Some of the key elements to the approach were:
- An agreement that an event-driven service-oriented architecture (SOA 2.0) was necessary due to the need to scale in terms of complexity and to have a more fault tolerant architecture. SOA 2.0 provides the highest level of decoupling between services. Often SOA 2.0 systems will be described by engineering professionals with the phrase ‘dumb pipes with smart end points’. Rather than orchestrate business processes centrally, the services listen for events or notifications from other components and respond appropriately. SOA 2.0 is characterized by its distributed nature.
- This architecture means that we require a suite of microservices and adaptors to transform data from external systems to our canonical data model.
- The pattern for system integration is publish-subscribe.
A conceptual diagram of the architecture can be viewed here.43
This work also laid out our expectations in terms of how we work with suppliers in development to testing and rolling out a production service.
RDSS data model
For RDSS to run effectively as a whole it requires an underlying data model, not just to allow researchers to fill in forms to ingest data and provide the minimum metadata for a DataCite DOI, but also to allow systems to create and push events and messages to each other to achieve the goal of an end-to-end system. The data model also allows for the enrichment and auto-generation of metadata by allowing for links to institutional systems and external, scholarly communications systems.
A data model for RDSS alpha has been produced for consultation.44 This current draft of the data model has been constructed under the requirements of being interoperable with as many of the products and services that have been identified in the supplier lots or elsewhere. For example, the data model is aligned to the metadata requirements of scholarly communications services.
The result of this work is a dense and complex data model, involving many entities, relations and vocabularies. The role of the consultation is to fine-tune this data model from a practical perspective and ensure that it is interoperable with existing infrastructure. This means that properties can be removed if not used within the system being described. A reduced data model (and vocabularies) can then be identified as core, with additional structure added as required.
Progress and challenges
Jisc is now in alpha development and is setting up the foundations for the service. This includes putting in place the technical architecture and on-boarding all of the repository and preservation systems suppliers in a test environment. This test environment also includes widely used research systems, such as EPrints and DSpace as well as test data, so that our suppliers and developers can work together to deliver interoperability early to our pilot institutions. A User Experience Lead has been appointed to provide principles and a governance framework across the project. Discovery, requirements gathering and testing will continue with the pilot institutions, along with community input from the sector through workshops.45
Other work currently under way is around cost reporting and business modelling, so that Jisc can produce a financially attractive offer to the HE sector for the production service. Alongside that, market research is also taking place to discover the needs of institutions outside the pilot group and whether a Jisc-hosted RDM system is attractive to them.
Alpha development is due to finish in Summer 2017 and beta development will look at scalability of the system, more integrations and challenging issues such as managing large data sets, managing sensitive data sets and challenges around preserving such a diverse set of digital objects that fall outside the remit of the traditional digital preservation community. Beta development is due to finish in April 2018; however, a business case for a production service will be presented to the Jisc board by December 2017.
Throughout the alpha and beta phases, the RDSS project has been given challenging requirements from our pilot institutions, including:
- defining a ‘minimum viable product’ with a multitude of systems, priorities and expectations
- fitting with existing institution and researcher workflows – for example, fitting RDSS into an institutional policy with the CRIS as the front door for researchers
- making preservation work for research data, when the development of systems and tools have been led by the cultural heritage system
- managing large data, data too large to be uploaded over the web, so greater than 5GB and including the challenges of big data
- managing sensitive data including commercial, personally identifiable information and medical data.