Exploring open access coverage of Wikipedia-cited research across the White Rose Universities

The popular online encyclopaedia Wikipedia is an important and influential platform that assists with the communication of science to a global audience. Using data obtained from Altmetric.com and Unpaywall, we looked at research from the White Rose Universities (Sheffield, Leeds and York) that is cited on Wikipedia. Of that research, we explored what percentage of citations were available open access (OA) and the location of those citations to ascertain whether they were hosted by publishers or within OA repositories. This article explores the importance of access to OA research within such an important and leading platform as Wikipedia and how well it supports effective scientific communication across society.


Introduction
The purpose of the work undertaken was to investigate how much of the research published by the Universities of Leeds, Sheffield and York is cited in Wikipedia and what proportion of those citations are linked to an open access (OA) version.We propose that making the cited academic literature point to OA versions as standard helps support the foundations of existing and future Wikipedia entries.Increasing the number of OA citations within Wikipedia not only assists the online encyclopaedia's goal of access to transparent and evidence-based knowledge but also removes any barriers to access to research, which ultimately is good for academics.We also explored to what extent this is being achieved using a sample of three UK universities, with the further intention of exploring which areas of the three institutions had the most citations across the various disciplines.In addition, we considered which were OA and whether access to them was available via the universities' or a third-party OA repository or via a publisher's website.
We chose the White Rose Universities of Leeds, Sheffield and York due to their shared OA repository, their long history of collaboration, research focus and because they are all members of the Russell Group of universities.White Rose Research Online (WRRO) is a cross-institutional OA repository that hosts research outputs from the universities of Leeds, Sheffield and York. 1 The purpose of the repository is to provide a long-term home for research outputs from the three universities and preserve research for posterity.The overall aim of the White Rose Research Online is to: 1. Provide a long-term home for research outputs from the three Universities, preserving research for posterity.
2. Provide OA to full-text research wherever possible.
3. Provide a reliable source of information about the universities' research.
4. Make research easier to find, bringing scholarly works to new audiences inside and outside academia.
We looked at the inconsistencies within Wikipedia and its open model and the implications of those for research dissemination.Not all research is truly open for the rest of society to access, but given Wikipedia's global reach and importance, it has become fundamental that the research underpinning each entry is as open and accessible as possible.

Brief background to Wikipedia
Wikipedia has become increasingly important to the academic community as a platform for engaging with society on topics relating to their own fields of research.Its open edit model means that researchers can cite their own or other relevant articles in any of Wikipedia's millions of pages.However, there are no formal checks as to whether such citations link to an OA or paywalled version of a research articlein many cases the research article is hosted on a paywalled publisher's website.Given Wikipedia's transparent model of publishing and editing, it seems rather counterproductive to their purpose only to link to versions of articles hosted on a paywalled publisher's website.However, Wikipedia does promote the use of the OABOT tool, 2 which facilitates making links to the OA versions of publications.The OABOT Wikipedia entry states.'Our community does not prohibit or even discourage citing paywalled sources, but there is also absolutely no prohibition on surfacing OA versions alongside those citations, as long as the link does not violate any copyrights.' Wikipedia's three principal core content policies: neutral point of view, verifiability and not original research, mean that it does not publish original thought.All material in Wikipedia must be attributable to a reliable, published source. 3Their final core policy is the most important as Wikipedia is built upon published knowledge that is already created and hosted elsewhere.Wikipedia remains an objective platform for the sharing of knowledge rather than opinion and conjecture.Thus, it becomes increasingly important that any cited evidence within a Wikipedia entry is auditable and open for all to read.

Wikipedia and academia
Wikipedia has progressed since its early years and the reception for it in the academic community has warmed.One of the first news features on Wikipedia was in Nature, suggesting that editing the platform could be an influential way of improving a researcher's visibility and communicating their work to the academic community.of the scientific literature, it helps shape it.' 5A subsequent piece of interactive research encouraged final-year medical students to contribute to Wikipedia articles in return for academic credit. 6More recent research, in 2019, looked at disease-related articles on Wikipedia and found that higher-quality articles were more likely to cite a Cochrane Review from the Cochrane Library than lower-quality articles on the encyclopaedia. 7The authors used Wikipedia's definition of 'higherquality articles' as those that have inline citations from reliable sources.
Another piece of research found that a journal's accessibility (OA policy), as well as its academic status (journal impact factor), strongly increased the probability of it being referenced on Wikipedia. 8brarians have long been actively involved in editing Wikipedia, especially given their greater understanding of information literacy and OA.One such initiative took place at Washington State University, where they hosted a public Wikipedia edit-a-thon as part of OA Week in 2014. 9

The benefits of having academic work cited in Wikipedia
Research that explored the Web of Science database to identify and examine trends in the use of Wikipedia citations in scholarly peer-reviewed publications between 2002 and 2015 found that Wikipedia citations increased over that period for both non-OA and OA research articles. 10Citations allow Wikipedia editors to make their contributions verifiable by supporting them with trustworthy sources and enable readers to locate further information on topics of interest. 11Thus concluding that citations in Wikipedia can be considered an indication of the transfer of scholarly output to a wider audience. 12There is also evidence that readers do follow links to the peerreviewed sources that are cited in Wikipedia with data from Crossref demonstrating that in 2015/2016 it was the sixth highest referrer of Digital Object Identifier (DOI) resolutions. 13search on wind power found a possible citation advantage of Wikipedia. 14It transpired that research on this topic within the Web of Science, and cited on Wikipedia, obtained proportionally far more citations than articles not mentioned.However, there is no evidence to link Wikipedia with increased citations as they might simply be the betterquality articles, with the result that they get more citations in both Wikipedia and other sources.Another piece of research found that subjects the authors considered 'controversial', such as evolution and global warming, received more edits than 'non-controversial' topics such as the standard model in physics. 15

The benefits of Wikipedia's citation of open access versions of research
There is very little previous research that explores how much research cited in Wikipedia is linked to an OA source.Some work has been carried out in this area but only for the library and information science field, which reported it at 31.2%, with this percentage increasing for more recent literature. 16The benefits of having research cited on such a prominent platform as Wikipedia is somewhat negated when the source is not universally accessible and is behind a publisher's paywall.To some extent, this problem was brought to wider attention after the World Health Organization (WHO) and the Wikimedia Foundation collaborated on a project to expand the public's access to the latest and most reliable information about Covid-19.Earlier research also highlighted the merits of academics engaging with Wikipedia and that by working in a free, open environment, scholars can increase their potential readership exponentially.The researchers also concluded that authors could assure themselves that access is granted to individuals who might not have the opportunity to use print journals or expensive databases, thus fulfilling their role as keepers and disseminators of knowledge. 18ork by Teplitskiy et al. indicated that, for OA research articles, Wikipedia is an increasingly useful means of disseminating science.Taking into account the field and impact factor, they found the odds of an OA journal being referenced on the English Wikipedia is 47% higher than paywall journals and concluded that this significantly amplified the diffusion of OA science, through an intermediary like Wikipedia, to a broader audience. 19

Methods and data collection
A data request to Altmetric.comwas submitted on 16 April 2019 for entries that included authors from any of the three White Rose Universities who are cited at least once in a Wikipedia entry.Data presented by Altmetric.comwere tabulated with discipline data extracted from university systems.Wikipedia page entries and embedded citations were collected by Altmetric.comusing unique identifiers within the research such as a DOI, PubMed ID or ISBN, this also included the date the research was cited within a Wikipedia entry.Further bibliographic data were captured that included publication title and date.Each individual Altmetric.compage corresponding to each Wikipedia citation was also obtained.
These entries are not unique, with some pieces of research having multiple Wikipedia citations.It is important to note how Altmetric.comcaptures multiple citations of the same article across several Wikipedia entries.A single Wikipedia entry can cite the same research article several times, but this does not alter the altmetric score for that piece of research.Regardless of how many Wikipedia citations a piece of research receives, it only counts as one to prevent academics from gaming the system and increasing their altmetric score.
Exploring the Altmetric.comdata, we found that several Wikipedia entries were edited by the same editors.The origin of these editors is unknown -we can assume that they are either academics or professionals working in that particular field or citizen scientists with a vested interest in it.Further research in this area would be useful to discover the identity of the most productive editors and what patterns of editing they exhibit.Are they exclusively citing the same article or small group of articles across a variety of Wikipedia entries, and is there a pattern that shows the same author names are appearing?The latter may offer some insight into whether these entries are self-citations by the journal article authors.
We used the data to explore the number of Wikipedia citations by discipline for each of the three institutions.The data Altmetric.comsupplies are only as good as the institutional and bibliometric journal it harvests.As a result, certain fields were incomplete, and we anticipate that, based on a previous study by some of the authors of this article, a percentage of the data in relation to institutional affiliation and date of publication will be inaccurate. 20

Using Unpaywall to check for open access compliance
DOIs of all articles appearing in the Altmetric.comdata that included a Wikipedia citation were subsequently run against the Unpaywall API (application programming interface) on the same day as they were received (16 April 2019).These include publishers of articles made immediately available under the 'gold' model of OA and institutional repositories like WRRO that make research articles openly available under the 'green' model.This typically means the author's accepted manuscript (AAM) before it has been finally typeset by the publisher and is often subject to an embargo period.Unpaywall provides a number of different services, including a browser plug-in that, if a user encounters a paywall, will link to an OA version where one is available.For the purposes of this study, the primary field of interest is designated as 'is_oa', which enables us to ascertain the proportion of articles that are available OA (is_oa = TRUE) compared to those that are not (is_oa = FALSE).It is important to note also that any repository record that is under embargo at the time of data collection will return is_oa = FALSE.Whether the OA version is gold (under a Creative Commons licence) or green (with a more restrictive or no specified licence) is also significant, as Wikipedia citations to gold articles will necessarily be OA with no further intervention, whereas Wikipedia citations to articles in subscription journals may only be accessible directly from that citation if it includes the repository link.The repository link will need to be added manually, i.e. the DOI used to automatically generate a Wikipedia citation links to a closed access publisher's version.In some cases, a published version may be freely available on the publisher's platform but without an open licence present.While these outputs might not conform to some definitions of OA, Unpaywall works on an inclusive definition -'OA articles are free to read online, either on the publisher website or in an OA repository' -giving these more ambiguous outputs the label 'bronze' OA. 21e final validated data were tabulated, and descriptive statistics were produced, and the implications of the data were discussed.

Sample validation
The tools used to collect and analyse data, Unpaywall and Altmetric.comitself, are largely automated while relying on data that has been added to Wikipedia manually.Therefore, it was decided to undertake a manual check of 100 Wikipedia citations from each of the three institutional datasets to check the accuracy of the data using an online random number generator to select a random sample of 100 citations from each institutional dataset.
Selected records were manually checked to ensure data accuracy, with each record checked: firstly, to confirm that the attribution of the output to the University of Leeds, York or Sheffield was correct; and secondly, to confirm that the OA status given by Unpaywall is correct.
Attribution of outputs was checked by comparison with the output itself as available online from the publisher.It was confirmed whether one or more of the authors listed on the output had recorded their institutional affiliation as the university covered by the dataset.Of the 300 Wikipedia citations checked, the affiliation could not be confirmed for seven of the outputs as the researchers could not access the output.Two attributions were not correctly identified by Altmetric.com; in both cases, the article listed the institutions where authors had gained their qualifications, and it appears that these were being picked up as affiliations.Of the 293 sample citations where an affiliation could be validated, 291 (99.3%) had been correctly attributed.
The OA status for cited articles was checked by accessing, where possible, the output through the publisher's platform and recording the licence conditions.A web browser in private mode was used so that institutional or individual access agreements could not affect the outcome.Where the output was not openly available through the publisher's platform, Google Scholar was used to identify versions of the output available through an OA repository.These versions were then checked for public availability of the output.
In the sample of 300 Wikipedia mentions picked up by Altmetric.com,24 (8.0%) did not include a DOI and so the OA status was not available through Unpaywall.Of the OA statuses checked, 257 (85.7%) were confirmed to be correct on validation.Of the 276 citations for which an OA status was identified (discounting the 24 citations which returned no status), Unpaywall identified the correct status for 93.1%.This is similar to the precision of 96.6% found in a study by the developers of Unpaywall. 22ere discrepancies were found between the Unpaywall data sample and the manually checked OA statuses, the majority -12 out of 19 -were articles not identified as open in Unpaywall but which, on manual checking, were found to be freely available as bronze OA.
'Selected records were manually checked to ensure data accuracy' That these outputs would be revealed by manual checking but not through Unpaywall could be explained by the ambiguity in the status of these records.Bronze OA is not easily identifiable through machine-readable metadata, and it is not known how consistent Unpaywall is in picking up these outputs.Another, perhaps more likely, explanation is the time difference between the data being collected and its validation.In the intervening period, publishers may have made outputs freely available online, either permanently or for a fixed period.The Covid-19 pandemic, which struck in the middle of this study, appears to have expanded this phenomenon, with publishers temporarily removing access restrictions on pertinent research outputs.As noted above, Unpaywall uses a broad definition of OA, but it is important for research designated as 'open' to remain so in perpetuity and carry an appropriate, irrevocable licence such as Creative Commons.This is a strong argument for articles designated as bronze being excluded from OA data.
The time difference probably also accounts for four outputs that were not found to be open in the Unpaywall data but were subsequently identified as openly through a repository by checking manually.Repository content is regularly deposited or released from embargoes, so these discrepancies are to be expected.
Outputs for which OA was found in the Unpaywall data but not through manual checking were less common, with only three in the sample of 300 records, as shown in Table 1.For these outputs, data is given on the OA source found by Unpaywall, making it easier to understand the difference in results.For one of these three outputs, Unpaywall identified a version of an article that had been uploaded to a departmental webpage -this would not have been picked up by manual validation as it would not have been identified as an OA repository.Again, this points to the ambiguity that can exist in OA status as a result of judgements about what should be considered a legitimate source of OA content.Academic networking sites such as ResearchGate or Academia.eduprovide another example of this ambiguity.These sites have repository-like functionality and may be indexed in Google Scholar, but they are not actively curated, and it is questionable whether they should be considered as legitimate sources of OA content.
It is noticeable that no errors, positive or negative, were found in the data for gold OA articles made available under an OA licence.This attests to the more permanent and unambiguous status of these outputs.Results across the three institutions were similar, with a little over half of all citations available OA and York performing marginally better than Sheffield and Leeds.
Of those outputs that were OA within Wikipedia, we found a very similar pattern across the three institutions when we explored where they were hosted, as highlighted in Table 3.
Around one-third of these were found to have an OA version hosted on the publisher's  The percentages presented in Table 4 do not add up to 100% due to rounding up percentages to their nearest first decimal point The information available about what licences the Wikipedia-cited research was published under was limited, see The percentages presented in Table 5 do not add up to 100% due to rounding up percentages to their nearest first decimal point.

Discussion
The way the three institutions performed with regards to how much of their Wikipedia citations content was OA was very similar.York did best with 56%, compared to Sheffield with 54% and Leeds with 52% of their citations available to freely read from Wikipedia citations.This was a positive sign and an indication of how OA is gaining popularity, but it also highlighted there is some way to go before all Wikipedia citations are fully available.
A current limitation of that becoming possible is how much research is available OA via publisher websites or OA repositories -itself still well below 100% OA.The date of publication will also have an effect, as we might expect more recent articles to be OA with older published content behind a paywall or only available in print format.This is an area for further study.There is conflicting data as to how much UK research is open access with Research England 27 citing over 80% in their 2018 report, whereas the Curtin Open Knowledge Initiative uses data from public sources around the world to show that even though OA adoption was climbing, in 2018 it was only just above 70%. 28 those that were available OA, Unpaywall includes various data to establish whether an article is available from a publisher's website -likely to be gold, though may also be bronze OA -or from a repository and likely to be green.The results were very similar across the three institutions, with at least one-third of hosts being publishers.The remaining citations were either hosted in an OA repository or not available openly.Less than a fifth of all the cited OA outputs linked to a repository version.It should be noted that an article made OA via the green route may be available from more than one repository, not only WRRO, where co-authors are based at other universities, for example, and have deposited their manuscript in their own repository.When there are multiple locations, Unpaywall determines the best OA location, based on five ascending rules, to decide which is the most current, authoritative version. 29 we suspected, the vast majority of content we explored was published as journal articles.This was no surprise given that journal articles are the standardized format for disseminating quality research and provide virtually all citations within that medium.It should follow that citations, in relation to academic outputs, would follow that trend given the journal article's dominance in scholarly communications.
'Less than a fifth of all the cited OA outputs linked to a repository version' We explored what licences the outputs had been published under but, without manual checking of all the publications, it is impossible to get an accurate number.The most frequent of the Creative Commons licence outputs were published under the CC BY licence, which is the most dominant and accessible of licences, especially used within academic publishing.Only a small percentage were evident across the three institutions, with York performing the best with 15.5%.The Creative Commons NonCommercial and NonCommercial-NoDerivatives licences had a notably small percentage.Ideally, this study would have explored what disciplines they had been assigned to.
This research has identified that, at the time of data collection, approximately 53% of Wikipedia citations from the three White Rose Universities were available OA, whether gold, green or bronze.Nearly half of cited research was therefore inaccessible without subscription access.However, only that research available via gold or bronze would be immediately accessible to a user following a link from Wikipedia.Research articles available under the green route would need further intervention from a Wikipedia editor to link the repository version.Based on a random sample, green OA accounted for 17% of records, which is likely to be an under-estimate as some of the closed access records reported by Unpaywall may well be in a repository under embargo.Furthermore, there is no guarantee that bronze records without a defined licence will remain accessible in perpetuity.Taking all of this together, we can conclude that fewer than half of research articles cited on Wikipedia are currently linked to openly accessible records.Given Wikipedia's unique role in the information ecosystem as a bridge between informal discussion and scholarly publication, 30 this is of concern.For example, during the Covid-19 pandemic, the WHO partnered with the Wikimedia Foundation to expand public access to current and reliable information about the virus, 31 while at the start of the Covid-19 pandemic many publishers temporarily made all research on the virus freely available. 32Much of this research is likely to revert to closed access at some unspecified point in the future.The new requirements under Plan S, 33 which came into effect in January 2021 and aims to ensure full and immediate OA, should go some way to improving the situation.One result of Plan S is likely to be that more research will be available OA from publisher's websites under the gold route, which will not require further intervention from Wikipedia editors to link to a repository version.Another condition of Plan S is that AAMs deposited into repositories via the green route are not restricted by embargo and carry a CC BY licence.However, commercial publishers are resistant to this aspect of Plan S. In any case, there will still be a role for universities and their libraries to ensure Wikipedia is properly cited and that cited research is as widely accessible as possible.
One solution that has gained some traction in recent years is the hosting of Wikipedia edit-a-thons within universities.One of the three institutions involved in this research, the University of Leeds, has hosted its own Wikipedia edit-a-thons. 34These sessions involve academics and librarians coming together at the same place and time to edit academic entries in their field of research with guidance from Wiki.
There are some limitations to this research that need to be considered.It only considers research from three specific universities, and the pattern may be very different at other types of institutions.Given the close relationship of York, Sheffield and Leeds and their shared repository, there may also be significant duplication of citation that has not been addressed.There are also limitations associated with data collection, precisely how Altmetric.comand Unpaywall work, for example, with potential misattributed affiliation or incorrect results from Unpaywall.This problem was highlighted by the work of Tattersall and Carroll, who especially noted an issue with incorrect institutional affiliations. 35The inherently changeable nature of Wikipedia means that this is a snapshot at a specific point in time; results may be very different if data were collected today.
'Nearly half of cited research was … inaccessible without subscription access'

Table 1 .
Results of the manual validation of the sample of Unpaywall resultsResultsIn total, there were 6,454 citations of White Rose Universities' research on Wikipedia in the period 1922 to April 2019.Research from the University of Sheffield had 2,523 Wikipedia citations, which was marginally more than Leeds, with 2,406 citations.The University of York had 1,525 Wikipedia citations, as highlighted in Table2.The total number of items in each university's Altmetric.comdatabaseswerecaptured,excluding datasets and clinical trial records, as these received no Wikipedia citations and represent a very small percentage of the total items produced across the White Rose institutions.We included articles, books, chapters and news stories, although there was only one record of the latter, which originated in Sheffield and was a nature column piece.Biological Sciences and Medical and Health Sciences overwhelmingly had the highest number of Wikipedia citations for each institution, as noted in Table2.Whilst several disciplines were comparable across the institutions, some did much better than others.For example, Physical Sciences research from the University of Sheffield received considerably more Wikipedia citations than work in this field from Leeds or York.The University of Leeds Earth Sciences and Chemical Sciences research received much higher numbers of citations than the same categories from Sheffield or York.Despite fewer citations overall across the disciplines, York had more citations in History and Archaeology compared to Sheffield and Leeds.There were 642 Wikipedia citations that were not attributed to any discipline in the sample.
The validation of Unpaywall data highlights some of the challenges in determining and classifying OA status.Outputs made permanently open under an open licence may lend themselves to a conclusive analysis, but outside of this, there can be considerable ambiguity about how open an output is, and these statuses can change over time.Overall, there were no results that raised concerns about the general reliability of the Unpaywall OA data.

Table 2 .
Wikipedia citations of White Rose Universities by discipline

Table 3 .
platform.The remaining two-thirds did not have an OA version available through the publisher platform but did have a version available through an OA repository or had no OA host stated.These results were fairly consistent across the three institutions, with York returning the highest proportion of outputs openly available on the publisher's platform and Leeds the highest proportion of outputs openly available through a repository.Unpaywall records (de-duplicated) own

Table 4
presents no surprises by showing that journal outputs make up the largest proportion of outputs, given that the journal article is by far the dominant format to disseminate knowledge within academia.Reference entries were identified as the second most popular genre, making up no more than 3.3% of the overall total of outputs.We were unable to capture a universal description as to what a reference entry is, as it varies according to the specific publisher, and there is no consistent taxonomy used.It may refer to encyclopaedia outputs, journal articles and forms of grey literature.

Table 5 .
25search published under a Creative Commons licence was most notable, and the majority of the licensed works, with 532 published outputs, had a CC BY licence.There were 55 examples of CC BY-NC licensed research across the White Rose institutions and 93 items licensed under CC BY-NC-ND.Elsevier-specific open access licences accounted for 64 research outputs, and 58 items had open access implied.Most research outputs had no licence stated, accounting for 3,327 items.This high number is probably due to, historically, green OA records in a repository not having a licence.Five of the cited outputs were identified as being 'public domain' (PD), although it is unclear how this determination was made.In this context, public domain appears to reflect a lack of clear copyright attribution and, for practical and analysis purposes, should probably be treated the same as 'No licence stated'.The oldest publication that was available open access and cited in a Wikipedia entry was from 1910,23whilst the oldest paywalled research article was published in 1922.24It is noteworthy that publication data that is tracked in Altmetric.comappearsto go back to as far as 1666.25