An AI toolkit for libraries

Now that artificial intelligence (AI) tools are being widely used across academic publishing, how can we make informed assessments of these utilities? There is a need for a set of skills for evaluating new tools and measuring existing ones, which should enable anyone commissioning or managing AI utilities to understand what questions to ask, what parameters to measure and possible pitfalls to avoid when introducing a new utility. The skills required are not technical. Potential problems include bias in the corpus, a poor training set or poor use of metrics for evaluation. This article gives a quick overview of some of areas where AI tools are being used and how they work. It then provides a checklist for assessment. The goal is not to discredit AI, but to make effective use of it.


Introduction
A colleague walks up excitedly to you.'I've just discovered a really cool AI app to speed up the submissions process for my articles -it does what we currently do by hand twice as quick, and all you have to do is to press a few buttons!Check it out!' How should you respond?The interface certainly looks well designed and appealing.There is not much information on the site, but the developers seem to have thought of everything.Do you feel qualified to give an opinion on this tool?
The aim of this article is to outline a framework for evaluating artificial intelligence-(AI-) based tools, without the need to have or to acquire detailed technical knowledge of how they were developed or any requirement to understand computing languages such as Python, or indeed any advanced maths.Nonetheless, the criteria for evaluation described here are crucial for the successful use of AI.The article argues that the human users may in some cases be better placed to evaluate the capabilities of a tool than the original developers, who quite possibly were not in a position to appreciate the context in which it would be used.
AI tools can possibly provide a way to reduce the time taken to discover or to submit academic content; they have the potential to improve the quality of published articles by running more detailed and more accurate checks in advance of publication.However, it is not the intention of this article to provide arguments for or against the use of AI tools compared with human evaluation.Instead, the aim is to outline how such tools can be assessed in a real-world setting.There are already many tools making use of AI in our daily lives, but we don't always realize that AI is involved.For many of these tools, for example some of the components of the Google search engine, their introduction was without debate, and we have subsequently become accustomed to their strengths and weaknesses.
Like many new technologies, AI has been viewed through widely different perspectives during its long lifetime, dating back over 75 years to the 1950s 1 -from wild optimism to being written off.Unfortunately, both attitudes are wide of the mark, and neither extreme Insights -35, 2022   An AI toolkit for libraries | Michael Upshall MICHAEL UPSHALL Consultant 'a framework for evaluating artificial intelligence-(AI-) based tools' is helpful for a balanced appraisal of AI tools.Repeatedly during AI's lifetime, many highly regarded thinkers have described AI either with glowing optimism, or expected AI to bring about disaster: In the light of these warnings, should we be implementing AI tools or calling for more investigation?Should we abandon AI, or use it intelligently?The suggested solution here is to concentrate on a small subset of available AI tools and to follow a clear methodology for assessing them.

What AI tools consist of
The narrow, or limited, AI described here is based around a few components.Present-day AI for text tools for academic purposes typically comprises: The 'corpus' is the body of content that you wish to analyse, for example, all scientific research articles published in the last 20 years.The corpus contains some information or characteristic that you wish to extract.A corpus need not only be text -there are corpora for facial recognition, for example, as well as the often referred to collections of images of cats and dogs.The 'training' set is a subset of the corpus, which has been tagged in some way to identify the characteristic you are looking for.Thus, a training set for cats and dogs might be 100 images that were tagged by a human as one or the other.Another example is the Modified National Institute of Standards and Technology (MNIST) database of handwritten numbers, 10 which shows many examples of the range of styles used when humans write numbers by hand, see Figure 1.The 'test set' is the collection of documents to be used for trialling the algorithm, to see how successfully it carries out the operation.The 'algorithm' is simply the tool that looks at each item in the corpus and enables a decision to be made.An algorithm may be as simple (and frequently is as simple) as matching a pattern; so, for example, if you give the device ten handwritten examples of the numbers 0 to 9, then the machine is asked to find the closest match between the training set and the test set.A cookery recipe is an algorithm, as is a way of sorting documents -by date, or by subject.Much of computing is based around identifying the most effective algorithm to solve a specific problem, for example, how to sort a collection of numbers into numerical order.
Surveys suggest that the public has a low awareness that algorithms are being used. 11orse, there is a common misconception that when algorithms are used, they are the cause of any defects of the tool.One of the myths about present-day AI is that it is entirely about algorithms.If only algorithms were revealed in public, is the argument, then all the mysteries of AI would be revealed. 12Reports in the media have tended to reinforce this misconception, accusing the algorithm of creating the problem; a 2019 study by the European Parliament on algorithmic accountability does not mention the term corpus. 13Newspaper reports about using an algorithm to determine the results of public examination suggested the algorithm was the cause of the issue, rather than the (undocumented) way it had been implemented: 'We all remember the A-levels fiasco, when an algorithm decided what the results should be … the poorest students received worse marks'. 14re exactly, the success or failure of AI is as much based on the corpus as it is on the algorithm.If the corpus used has an imbalance of gender, ethnic group or geographical origin, then the algorithm will simply replicate that bias.
To summarize, artificial general intelligence raises many issues that, to be honest, are of little relevance to most present-day AI, even though they will keep leading-edge researchers busy for years.Narrow AI tools can, if implemented sensibly, greatly enhance our ability to carry out many of the tasks in the academic workflow.How these tools are selected and implemented is all-important.How can these tools -and the corpora they are based on -be evaluated?The role of the library is crucial in providing guidance on real-world selection, implementation and, finally, appraisal and metrics.
'One of the myths about present-day AI is that it is entirely about algorithms'

What is present-day AI?
For many years, AI researchers have been obsessed with creating AGI: artificial general intelligence.One algorithm could answer all the questions in the universe.The idea behind a universal algorithm is, in the words of AI researcher Pedro Domingos, 'If it exists, the Master Algorithm can derive all knowledge in the world-past, present, and future-from data.Inventing it would be one of the greatest advances in the history of science.' 15 The idea of a universal general intelligence was widespread in the 1960s.It fell out of favour for several years, but traces of it are still evident today in research departments.According to Wikipedia, 16 there were 72 active AGI projects running in 2020, which indicates that many researchers continue to look for a unified solution via the use of AI, rather than making use of limited tools in specific contexts, which is what this article is concentrating on.
For the purpose of this article, the master algorithm will be ignored and the focus will be a smaller set of tools, typically employed for just one purpose.Formally, the tools described here make use of what is called 'supervised' or 'semi-supervised' machine learning. 17upervised means there is some human involvement in setting up the tool, usually in determining what the correct answers should be.'Machine learning' (ML) means the use of a computer to follow a pattern, whether or not the pattern is identified by a human.'Natural language processing' or NLP means the identification of patterns in spoken or written text.

Do we know that AI is being used?
This is a more fundamental question than might be imagined.There are many examples of AI tools in use without any mention that AI is being used, although, increasingly, the impact of the AI tool might be too subtle to notice.For example, in a Google blog post about BERT, 18 an ML technique for NLP, the benefit shown was simply the ability to link a preposition with a noun.Whereas earlier search tools tended to ignore prepositions and just focus on nouns, this more sophisticated tool was able to handle a question about a traveller from Brazil to the USA.It identifies a meaningful connection between the 'to' and the 'USA'.
In social media and product literature, the term AI is frequently used as a buzzword to give the impression that a tool is more sophisticated than it really is.In practice, the kind of small-scale AI described above is very closely linked to 'string matching' or other well-established simple techniques.String matching means the use of a machine to identify instances of a sequence of characters in a text. 19Eslami claims that once users are shown they are interacting with an algorithm, rather than with a human, they are reassured; there certainly appears to be widespread suspicion of an algorithm making decisions for a human.Not revealing that there is no human involved makes things much worse; the users feel cheated, because they were not told.Google search is an example where we as everyday users acknowledge that a perfect search experience is not possible, given the size and limitations of the corpus, and we tolerate the imperfections because we are not aware of any better alternative.As two researchers put it, 'College students AND professors might not know that library databases exist, but they sure know Google'. 20

Can we combine the brain with technology?
Machines cannot think, but humans can.One way to assess AI tools is to determine what they are or are not good at.Some human activities lend themselves to automation more than others.Sorting a list into alphabetical or numerical order, for example, is an activity that a spreadsheet can do very easily, but humans can only do slowly and at a high error rate -partly because humans have a limited attention span and find it irritating to sort more than a few records.Does that mean the human brain is inadequate?Hardly, but it does imply that 'There are many examples of AI tools in use without any mention that AI is being used' human brains do not represent the ideal that all AI research is aimed at emulating.Similarly, humans have very poor information retention skills -we think ourselves clever if we can remember ten phone numbers.Miller's Law, 21 created by a Harvard psychologist, suggests that the number of objects a typical human can hold in memory is just seven.

What can we meaningfully ask of AI?
Questions we ask of AI tools may have different criteria to scientific research questions.The corpus-based approach using a training set, as described above, uses the process of inductive reasoning.This is the kind of thinking that states 'the sun rose yesterday, the sun rose today, so the chances are the sun will rise tomorrow'.Now, philosophers will argue that inductive reasoning is not scientific.Just because the sun rose yesterday does not mean the sun will rise tomorrow.We would like some external proof to enable us to sleep more peacefully.Inductive reasoning is well described by Eric Larson. 22wever, for the purpose of AI tools as described here, inductive reasoning may be adequate, indeed ideal.The goal is to use existing evidence to predict a likely inference.Typically, we look to provide good quality results that are better than a human could achieve without the tool.'Better' here meaning at least as good quality as a human but delivered faster, or better quality with no loss of time compared to a manual process, or both.Hence, for the purpose of AI in this context, you can ask an algorithm if the sun will rise tomorrow, and the machine will give you a workable answer for practical purposes.
To state that narrow AI tools make use of inductive reasoning may seem obvious, yet it is frequently ignored when humans assess the results of a machine-based process.For selfdriving cars, an error rate of one in a thousand might mean abandoning the whole project.For spam checking and spell-checking, a much higher error rate may be good enough to use the tool.

Are we AI literate?
Long and Magerko 23 define AI literacy as 'a set of competencies that enable individuals to critically evaluate AI technologies, communicate and collaborate effectively with AI, and use AI as a tool'.Here is an essential role for the information professional.Millions of people use Google every day, but there is a difference between unthinking use and critical awareness.They further define over 30 relevant factors, of which just the first five are skills I believe to be essential to the assessment and recommendation of AI tools: 1.The ability to distinguish between tools that use or do not use AI.
2. Analyse differences between human and machine intelligence.

Identify various technologies that use AI.
4. Distinguish between general and narrow AI.
5. Identify problem types that AI excels at and problems that are more challenging for AI.
To be specific, the skills outlined here do not, I believe, require the ability to code.Given the increasing use of AI tools, it is becoming more difficult to distinguish tools based on human or machine judgement (skill 1).Perhaps this skill will eventually be replaced by skill 5, the ability to identify problem types that lend themselves to an AI-based solution.

AI use in academic contexts
This section looks at some areas where AI tools are currently in use in the scholarly workflow.
'the skills outlined here do not … require the ability to code'

Spell-check
The spell-check tool provided with many common word processors is an example of a widely used and generally accepted algorithm, or collection of algorithms.Users acknowledge (and frequently complain) that spell checkers do not detect all errors that would be detected by a human.Figure 2. A typical spell checker displaying the limitations of a context-free tool 24 Users of spell checkers have learned to live with their biggest drawback: that most spell checkers accept a word that corresponds with a term in the dictionary, even if it is the wrong word in context.As shown in Figure 2, the spell checker had no difficulty finding a misspelling for 'quick', but any English speaker would know that the last word should be 'dog', not 'god'.Nonetheless, the use of a spell checker can comprehensively and consistently identify transposed letters in words.The limitations are known and tolerated.

Spam check
Checking for spam e-mails is one of the most widespread uses of AI.Spam checks use a mixture of word-and phrase-checking to identify a likely spam message.A variety of checks are run, including: • Is this an unfamiliar sender ID?
• Does the e-mail include terms such as 'offer' or 'bargain'?
As with spell-checking tools, spam checkers are imperfect but widely accepted, because the alternative, of repeatedly reviewing and deleting irrelevant emails, would make the use of e-mail difficult if not impossible.Users tolerate the small number of false positives (e-mails wrongly identified as spam).Jenna Burrell 25 differentiates various kinds of opacity in algorithms and reveals some interesting details about the criteria used to detect spam, but does not mention the corpus dimension of spam checking: e-mails from an e-mail address not in the individual's set of e-mail contacts is more likely to be spam.

Plagiarism detection
Plagiarism detection, such as Turnitin, Copyleaks and others, can use string or semantic matching, or both.The most common form is simply checking for string matches.A simple plagiarism check can be done against published articles by simply searching for a string, such as a full article title, in Google -the system typically finds a match (if one exists) in less than a second, see Figure 3.In recent years, plagiarism checks have become more sophisticated.While most common search engines can find strings of characters very efficiently, it is more challenging to find semantic matches.If the plagiarist uses the same ideas but changes a few of the words for common synonyms, the plagiarism is (currently) far less likely to be detected.This limitation does not prevent plagiarism tools being widely used by many academic publishers.

Discovery
One of the longest-established uses of AI is in content discovery.This can range from the simple recommender tool ('if you like X, you will like Y') to much more sophisticated recommenders that identify concepts in an article and match those concepts to other articles.Figure 4 shows an example from the Cambridge University Press content collection, linking book chapters to other book chapters and to articles:

Impact
Citations are one way of assessing the relative worth of an article -if it has been widely cited, it must be significant.Citation indexes for academic journals were introduced in 1955 by Eugene Garfield. 27But, of course, citations are contentious as indicators of quality.An article might be cited because the writer thinks the source article is incorrect.Citations for certain types of articles, such as review articles, are always higher than for research articles.
One reason why citations became widely adopted as a metric for article quality is that they can be counted.A human judgement ('is this article significant?') is thereby represented, however imperfectly, by an arithmetic tool.However, it is now recognized that a simple count of citations is unsatisfactory, and several modifications of the tool have been proposed, for example, the Hirsch or H-index. 28 tools have enabled a more sophisticated analysis of citations.Tools are available, for example from scite.ai, 29 Scholarcy 30 (see Figure 5) and Semantic Scholar, 31 that not only identify citations, but show if the citation supports or refutes a statement. 32gure 5. Categorizing citations into supporting or differing from earlier research 33 Currently, few researchers will be aware of the wealth of tools available to them in this area.

The need for analytics
Any intervention in the academic workflow can only be assessed for robustness, efficacy and accuracy, if its impact is evaluated.This is as true for AI tools as for any other attempt to improve the process.Accordingly, both libraries and publishers have a responsibility to identify if, and how, AI tools are used, with an attempt to identify the impact of those interventions.However, many libraries do not carry out such studies.According to a 2021 survey of library analytics practice 34 the greatest barriers to data analysis (interpreted broadly to include bibliometrics, studies of user behaviour and such like) by libraries were: Unfortunately, all these justifications for inaction are ultimately self-defeating.If a poor-quality AI utility is adopted by the library, it will take more, rather than less time to manage its use and quite possibly lead to misunderstanding and corresponding negative feedback.Introducing a tool without capturing the data to assess its success cannot be a sensible procedure.
'AI tools have enabled a more sophisticated analysis of citations'

Automating metadata
Machines have difficulty with ambiguity, while humans are tolerant of inconsistencies and small errors.However, increased use of AI has partially resolved this distinction.Today, keying 'shakspeare' into Google results in the tool automatically suggesting the closest match from its index, see Figure 6.The system may be correct, or it may on occasion be wrong, but we tolerate its errors because most of the time it works and saves us effort.Humans have learned to live with the imperfect world of AI.

Potential misuses of AI
Once developed, AI tools can be extended to domains where their validity is greatly reduced.There are several examples of this overextension of AI tools, with predictably irrelevant or meaningless results.A ranking website, Academic Influence, provides metrics for ranking departments, scholars and faculties with the slogan 'better rankings for a better education'.The site also contains a list of 'the most influential people 'Humans have learned to live with the imperfect world of AI' for the years 4000 B.C. to 2022', with Aristotle in first place, ahead of Plato, Marx and Kant.Shakespeare is listed in sixth place, with his plays and sonnets listed under his 'academic contributions'. 35In this case, a tool developed for the comparison of current scholarship has been overextended back by several hundred years to a time before scholarly communications existed, see Figure 8 -and yet the site claims that their metrics are built using 'sanity checks': 'we make sure the rankings make sense by performing "sanity checks" against other independent information sources such as periodicals, journals, and global media outlets'.Trying to establish the greatest thinkers in world history via an algorithm for researchers and academics is unlikely to produce trustworthy results.

The need for a sanity check
There is undoubtedly a need for sanity checking of AI-based tools, and humans are necessary to carry this out.Algorithms in narrow AI have no knowledge of the real world.A system that can differentiate images of cats from images of dogs could still not define what a cat or a dog is, nor could such a system identify any other animal species.However, it is tempting to apply an algorithm to a corpus way beyond a viable scope.However clever the algorithm, there is a need for a human to check that the results correspond with a common-sense view.What is meant by common sense here is quite specific, and quite limited.For example, consider an algorithm that applies subject tags to an academic article.This algorithm provides a probability ranking for the subjects physics, chemistry and politics.Using the algorithm, the following results were obtained, using a ranking between 0 and 1, see Table 1.Clearly, article 1 looks to be more about physics than anything else, while article 2 seems to be obviously about chemistry.However, the algorithm has found traces of politics content in both articles.The determination of a suitable threshold, below which the user should state 'this article is not about politics', needs to be determined by common sense -or by using a subject-matter expert to identify what the threshold should be.In this case, the machine delivers a result, but in all probability the result for politics can be discounted.

Content physics chemistry politics
'A system that can differentiate images of cats from images of dogs could still not define what a cat or a dog is' One of the most widely used metrics for text-based AI is the F1 score, 36 which measures the mean of recall and precision, usually shown on a scale between 0 (no accuracy) and 1 (perfect accuracy).However, the F1 score has limitations which are easy to recognize using common sense.Harikrishnan 37 gives an example of a pregnancy test of 100 women, which identifies five as pregnant when they are not (false positive), and ten as not pregnant when they are pregnant (false negative).A machine-based algorithm that resulted in these figures would have an F1 score of 0.8, which in other contexts might represent an acceptable score, but would certainly not be adequate for a pregnancy test.
This is where the information professional has a key role.Some idea of context, of what is or is not required in the situation, vital for ensuring that any tool delivers relevant results.Are the results relevant in context?
An example of the F1 score in use for subject tagging is described by Goh. 38This study compares humans with a machine used to classify articles by subject.The machine outperforms the postgraduates by delivering a significantly more accurate set of tags.Even more impressive, it took one postgraduate two hours to classify 247 abstracts, compared to five seconds for a machine to complete the same exercise.
A comparison of the subject tagging between the machine and the humans shows immediately that the machine consistently delivers better results than the human taggers, but the most significant inference of the study is implied rather than explicitly stated.For the purpose of subject tagging, humans were found in this controlled study to have an average F1 score in the region of 0.5 or less, while the machine result was considered to be usable with an F1 score of around 0.7.While this figure at first glance seems poor (given that a perfect score would be 1), the implication is that the F1 score should be interpreted in context, not as an absolute measure.
In other words, when comparing machine-based with human results, we should be considering relative, rather than absolute, measurements.If a machine delivers a better result than what is achievable by hand, it makes sense to adopt the machine solution immediately.
Some sanity checks can be built into the tool itself.For example, a tool to identify peer reviewers could helpfully provide an indication when an article is submitted that is outside the corpus used to identify reviewers.A tool to identify the most relevant journal for an article could have a result of 'no suitable match found' if an article is submitted to the tool that is outside the subject areas of the corpus.

The corpus and bias
Another key role that information professionals can play in the evaluation of AI tools is the awareness of potential and actual bias.Any corpus contains bias.Bias is typically independent of any AI tools.All real-world data is inevitably biased.Even seemingly neutral collections reveal unconscious bias.For example, it might be assumed that PubMed, a collection of millions of scholarly articles published on biomedical topics over the last 50 years, would comprise a statistically valid corpus, yet there are more male than female authors in PubMed.Is this surprising?A study of gender disparity in medical articles 39 found this was the case over the last 20 years.Another article 40 shows a revealing graph of male and female authorship of articles in science journals since 1955.While the proportion of female authorship is growing, males continue to author the majority of science articles.
Of course, once bias is recognized, it is possible to take steps to work around this bias, but lack of awareness of bias means that there is an important role for information professionals when recommending these tools.
'If a machine delivers a better result … it makes sense to adopt the machine solution immediately'

Credibility
What role does the information professional play in all this?Most of the criteria described above could be checked directly by the user, that is, the researcher, but most researchers have neither the time nor the knowledge to make measured comparisons of different tools.Without a solid analytical framework, humans tend to rely on instinct, which could be described as an internal assessment mechanism -they instinctively trust (or do not trust) a familiar methodology, or tools they have used before.
The role for the information professional in all this is providing credibility: providing users with external validations that enable them to trust a tool and to deploy it with confidence.Researchers will, for the most part, look for an external validation of a tool that they can trust.The information professionals provide the credibility, based on their detailed evaluation.

The AI toolkit: a framework for evaluation of AI tools
Here, in summary, is a toolkit for information professionals appraising any AI tool.Although making use of Long and Magerko's idea of AI literacy, the requirements here are much more specific.

Goal
1. What is a realistic goal?Expecting perfection for an AI utility is impossible.AI tools based on a training set cannot have 100% accuracy.Nonetheless, the accuracy they provide should be considerably greater than using humans for the same task.11.What metrics will be used to evaluate the tool?The F1 score, if used, must be interpreted in context.

Sanity check
12. Sanity check/common sense: Have the developers built in 'common-sense' limitations to prevent the algorithm being applied too widely?Am I asking a meaningful question?Is this a feasible exercise?
13. Does the tool provide feedback when a question is out of scope?
14. Based on the checks above, is the tool fit for purpose?
'The role for the information professional in all this is providing credibility'

Figure 1 .
Figure 1.Example of the MNIST database of handwritten numbers

Figure 3 .
Figure 3. Google search for an article title

Figure 4 .
Figure 4. Example of a recommender system in action

Figure 8 .
Figure 8. Entry for William Shakespeare from 'the 50 most influential scholars of all time'

Corpus 2 . 5 .Algorithm 8 .
Is the corpus large enough?Is the training set large enough? 3. What are the start and end dates for the data in the corpus?Does this matter?4. Who chose the corpus, when was it chosen and for what purpose?Details of the corpus used, like the data for a research article, should be publicly stated and accessible.What is the corpus bias? 6.Is the tool likely to raise diversity, equality and/or inclusion issues?7. Is personal data captured and reused?Have the developers provided a single-sentence summary of the methodology behind the algorithm?Evaluation and Metrics 9. Have I measured the current process before introducing any change, for example, time taken, number of errors?10.Who to evaluate: end users or subject-matter experts, or both?Internal or external?

Table 1 .
Typical predictive scores for subject tagging