Semi-Automated Methods for BIBFRAME Work Entity Description

Abstract This paper reports an investigation of machine learning methods for the semi-automated creation of a BIBFRAME Work entity description within the RDF linked data editor Sinopia (https://sinopia.io). The automated subject indexing software Annif was configured with the Library of Congress Subject Headings (LCSH) vocabulary from the Linked Data Service at https://id.loc.gov/. The training corpus was comprised of 9.3 million titles and LCSH linked data references from the IvyPlus POD project (https://pod.stanford.edu/) and from Share-VDE (https://wiki.share-vde.org). Semi-automated processes were explored to support and extend, not replace, professional expertise.


Introduction
Describing library resources with the BIBFRAME vocabulary and its core entities of Work, Instance, and Item is a resource intensive process. What had been a one record description in MARC can now be a collection of three and sometimes more entity descriptions in BIBFRAME. Cataloging in linked data RDF editors with BIBFRAME involves careful selection of, and reference to, external authority entities. Creating external authoritative links is essential to produce an accurate context for the Work entity description.
The BIBFRAME vocabulary postdates the general work set definition promulgated by Svenonius in the Intellectual Foundation of Information Organization. In the text, Svenonius argues that "… a work is a set or family of documents in which each document embodies essentially the same information or shares essentially the same intellectual or artistic content" and, "in structuring a database, work sets are used to perform two essential functions: to organize displays and to provide nodes for linking related bibliographic entities." 1 This definition underscores the value of creating and describing Work entities while foregrounding the general Work description as a set of attributes. 2 Further, Harper and Tillett 3 echoed Svenonius' assertion of the Work description's utility for the user interface and the "display" of bibliographic data. The display of these relationships is a crucial task that MARC and the systems that generated the interface to MARC were perhaps not well suited to accomplish. 4 This issue of display underscores one of the problems that linked data vocabularies help to address-displaying bibliographic relationships and their interconnections explicitly in a way that is easier to understand. These displays serve in the user tasks articulated in the Library Reference Model (LRM) delineation: find; identify; select; obtain; explore. 5 Subjects are a key category of authority control. As Harper and Tillett noted, "When we apply authority control, we are reminded how it brings precision to searches, how the syndetic structure of references enables navigation and provides explanations for variations and inconsistencies, how the controlled forms of names, titles, and subjects help collocate works in displays…" 6 The BIBFRAME Work is defined as, "…The highest level of abstraction, a Work, in the BIBFRAME context, reflects the conceptual essence of the cataloged resource: authors, languages, and what it is about (subjects)." 7 Description of the bibliographic Work entity is both a critical task and may include many linked data references, as shown above, which can be difficult to construct in a streamlined manner.
The machine learning process described herein focuses on supporting the semi-automated creation of the BIBFRAME Work description. As a potential integration target, semi-automation contrasts with completely automated cataloging and is a very specific use of machine learning. As general-purpose AI (artificial intelligence) does not exist, 8 libraries have no such AI for cataloging or one that would otherwise replace cataloger choice. Instead, this investigation is concerned with the tasks ahead required for linked data catalogers and seeks automation to support and extend, not replace, professional expertise. Specifically, this case study details the approaches to develop methods of machine learning operations in libraries that support a "human in the loop," or in this case, a professional "cataloger in the loop," semi-automated subject suggestions within an RDF linked data editor. Researchers have begun to show that for some professions, the wisest AI implementation scenario is augmenting existing expertise with semi-automated support, a form of "human compatible" AI. 9 The paper progresses next to a background survey of linked data RDF editors. Thereafter, methods on machine learning operations that libraries require are described. The results section then reports a machine generated evaluation of the predicted subjects together with a treatment of the designed human and algorithm collaborations that Annif can support. The paper concludes with a discussion on future implementation of machine learning operations in library settings.

Background: RDF linked data editors
Several metadata editors and software tools are available to catalogers who are creating BIBFRAME entity descriptions. The BIBFRAME editor, bfe, was developed at the Library of Congress, 10 and bfe is a standalone editor. In practice, bfe is not yet networked with a cloud service backend outside of the Library of Congress. A library hosted back-end may be possible following the recent redevelopment of the bfe. 11 The rebuild of the bfe interface which accompanied significant code redevelopment was influential to Sinopia user interface patterns.
The Sinopia linked data RDF Editor was expanded and built upon greatly after it was forked from the starter code base of the BIBFRAME editor, bfe. Both systems, bfe and Sinopia are significantly different in architectures and codebases in their current state. The Sinopia Linked Data RDF Editor provides cloud-based storage and a shared environment for entity description. 12 Recent enhancements have made it possible to generate a minimal MARC21 record based upon the BIBFRAME Instance Entity and a Work Entity with which it is associated.
A third, web-based tool, Metadata Maker, can be configured to create BIBFRAME descriptions. As a web-based editor, Metadata Maker is also standalone that does not integrate with a cloud-based data store. 13 Metadata Maker is often considered a useful tool for those who require a simple interface for basic bibliographic descriptions.
MarcEdit is not an editor in the same way as the previous three software tools; it is a utility that is not typically used to create records, but is used commonly for batch processing, among other routine uses. 14 MarcEdit does provide BIBFRAME creation tools that use Library of Congress's transformation code to support converting MARC records to BIBFRAME resources. 15 MarcEdit is valuable for batch processing matches of subject term entries to the linked data service at https://id.loc.gov/ and represents a critical first step in the process to enrich traditional MARC description. An Application Programming Interface (API) provided by OCLC, the Classify API, is available on the web and accessible through MarcEdit for call number assignment of MARC records. 16 These attributes may be of value in crafting parts of BIBFRAME descriptions, yet the API is curiously not yet a part of any RDF editor, though programmatically accessing OCLC data in the process of describing any of the BIBFRAME entities could further streamline linked data creation. Currently, none of the RDF editors described above offer semi-automated methods to assign any attributes in general. There are external APIs in bfe and Sinopia, which reference the linked data vocabulary terms and open URIs that are the hallmark of BIBFRAME entity description. The Questioning Authority service in the Sinopia Editor, provides this external indexing service. 17 Within the bfe editor, an https://id.loc.gov/ API for query autosuggestion of terms is provided based upon an entry from catalogers when they have a known label to search. These services are detailed on the https://id.loc.gov/ Technical Center page. 18 External APIs are instrumental to complete a BIBFRAME description.
In the machine learning based proposal here, attributes are suggested based upon input from other parts of the entity description; for example, from the set of title fields (Title Label, Preferred Title for Work; Variant Title for Work) in the BIBFRAME Work description to the Subject of the BIBFRAME work, shown in Table 1.
A set perspective views bibliographic entities as being described by attribute sets and introduces the potential for additional autosuggestions to streamline BIBFRAME entity description tasks. For example, entities described by set membership such as an Author/Agent set may produce auto-population of Subjects and/or Publisher data; Publisher Names to auto-populate Publisher Place; 19 and Author/Agent sets to Subjects auto-suggestions in BIBFRAME entity descriptions, and perhaps related Works if set properties can be sent to a prediction server configured for such a use. These possibilities are intended to streamline the creation of linked data descriptions, and such automation possibilities, with a grounding in set theoretical understandings of a bibliographic entity, 20 may also introduce forms of cataloging which, relying on a cataloger's expertise, can help to produce semi-automated, linked data descriptions in more depth and with more modeling of a bibliographic set than traditional metadata editors.

Methods
Annif 21 is an open-source machine learning software used to generate subject suggestions in linked data. An Annif tutorial available on the web is designed to help users understand the key aspects of conducting a machine learning project using an existing vocabulary. 22 In this study, the vocabulary of interest included the LCSH. The Annif software is configured to accept a Simple Knowledge Organization System (SKOS) vocabulary encoded in a TTL (Turtle syntax) file. According to Summers and others' early implementation work that described representation and conversion of MARC metadata in SKOS, "… SKOS was designed as a general tool for knowledge organization systems (thesauri, classification schemes, subject heading lists, taxonomies, folksonomies) it lacks specialized features to represent some of the details found in LCSH/MARC. " 23 The authors also noted initial problems in using skos:Concept to represent several typologies, including geographical, topical, and genre/ form concepts. Currently, the files available on the Library of Congress download pages do not intermix these typologies specifically, but instead, delineate them in separate files to obviate this concern in the interim. At the time of this experiment, LCSH did not exist in the TTL syntax, but could be downloaded as JSON RDF/SKOS. To undertake this experiment with Annif, the LCSH was converted into TTL using an RDF Syntax Library. 24 The output from the SKOS LCSH TTL conversion was made openly available on GitHub. 25 Thereafter, a metadata set of 1.3 million records from Penn Libraries with title and linked data subject associations was used in test training the LCSH vocabulary. This initial training corpus was used first to evaluate the viability of using Annif. The Penn subject-title pairs comprise titles with an associated linked data reference to a subject link. Share-VDE enrichment processes were the mechanism by which much of the Penn Libraries linked data references were added to their metadata. 26 The schema and sample data are shown in Table 2.
Catalogers can construct descriptions for subjects in fields other than the 650 MARC field. For future study, multiple subject fields in the MARC 65X range may be used as training data to explore the utility of vocabularies other than the LCSH linked data in the Annif machine learning system. It is necessary to pursue multiple vocabulary targets as not all constructed LCSH terminology may be represented in data retrieved within the LCSH linked data download.
The basic machine learning Annif training operations include: Configuring a new Annif project file-each project has a configuration file The configuration file sets the project name, project vocabulary, and project algorithm(s). In this pilot the Term Frequency-Inverse Document Frequency (TF-IDF) algorithm was used. 27 Using multiple algorithms in a single project, known as ensemble machine learning, can yield more accurate results, depending on the task and algorithms combined.   Annif makes programmatic use of the SKOS LCSH vocabulary loaded and matches subject pairs against existing vocabulary terms. Therefore, the training data in the corpus may be excluded from the output machine learning model Annif produces in the cases in which subjects are not found in the vocabulary. It is likely that new preferred labels will emerge from contemporary cataloging (e.g., recent data), while the LCSH vocabulary is also updated over time and new LCSH downloads appear to be available from https://id.loc.gov/ at least monthly.
A production machine learning system concern for this project includes the need to analyze training data drift in title subject pairs and potential vocabulary drift for LCSH labels. 28 Production machine learning operations must address issues of vocabulary and data drift for the system to remain useful and be used. Data drift occurs when "the input data has changed," and "… the trained model is not relevant for this new data." 29 Data drift in a machine learning model is a problem of entropy, one in which the trained model's usefulness declines over time because of a changed world.
After the experimental vocabulary is loaded and trained it is then possible to test subject suggestions by providing an example string of text as shown in Figure 1.
The machine learning training is expanded using the same approach. A larger corpus of 9.3 million (9,304,455) title and subject IvyPlus Platform for Open Data (POD) associations along with Share-VDE (SVDE) data were used for training. POD is a data aggregation project that involves member institutions of the IvyPlus Library Confederation and in November 2021 contained over one hundred million MARC records, fifty-five million of which are unique. 30 A sample of the POD data lake from March 2021 found that approximately 25% of unique records in the IvyPlus libraries have linked data subject associations. The SVDE data were sourced from a Program for Cooperative Cataloging (PCC) data pool project. 31 The PCC data pool is metadata which Share-VDE received from OCLC on behalf of the PCC libraries and subsequently enriched with linked data references. The outputs of SVDE data enrichment, transformation, and clustering generate a large-scale BIBFRAME discovery network. 32 A delineation of trends in genre across the training data is shown in Table 3. The genre distribution of a corpus may influence the nature of subject data. The effectiveness of prediction may be influenced by how closely genre types represented in the training data correspond meaningfully to the item being described.
Metrics and future integration of machine learning outputs from both the Penn training corpus and the combined IvyPlus POD and SVDE corpus is shown in the next section. In this project, a formal validation set was not split from the corpus, though future experimentation and expansion of the Annif software may make it possible to implement a "train, test, and split" method. 33 Thomas pointed out several implications for wise validation: "A key property of the validation and test sets is that they must be representative of the new data you will see in the future" and that, "You also need to think about what ways the data you will be making predictions for in production may be qualitatively different from the data you have to train your model with." 34 Kubeflow is an open-source tool to manage the machine learning operations of a fully implemented AI system. The Kubeflow system is described as the "…cloud-native platform for machine learning operations-pipelines, training and deployment. " 35 Utilizing platforms that can orchestrate the many steps that comprise the machine learning operation tasks will help to move implementing libraries from pilot experimentation to production implementation. TensorFlow Extended (TFX) also helps manage a machine learning data pipeline and includes software for data validation, transformation, model analysis, and serving. 36 In the scope of this paper, data validation may incorporate checking and analyzing data, e.g., methods to generate schema, detect training-serving skew, and visualize the training data to facilitate interpretation.

Results
To understand systematically the quality and efficacy of machine learning outputs Annif generates, the training models are evaluated first  with a pre-labeled test collection. By using pre-labeled tests, the software will determine systematically the way the machine learning based headings compare to human-assigned headings. A separate SDVE enriched MARC metadata set from PCC member libraries provides a target for which to test the training. 37 The PCC test data are distinct from the training corpus-they represent newer data than those used in training against the LCSH vocabulary. The Annif software is pre-configured with an evaluation command and requires an evaluation dataset.
The evaluation compared human-assigned subjects from a title to Annif-supplied subjects. The evaluation of the way Annif performs compared to a human-derived subject is reported in a series of output metrics. The Normalized Discounted Cumulative Gain (NDCG) scores were generated through Scikit-learn model evaluation. 38 The NDCG score is a ranking measure used often in machine learning-based prediction systems as a relevancy ranking of the result. 39 For context, consider the classical measures in information retrieval such as precision and recall in search evaluation. The NDCG scoring emerged from refinements and exploration around improving relevance metrics-the score represents a continuum or "graded relevance" metric, rather than the binary precision and recall metric. 40 These graded relevance metrics have a 0-1 range, in which the NDCG in Table 4 may be read as a 40% prediction accuracy, and the IvyPlus Corpus approaches a 49% NDCG accuracy, shown in Table 5. Prediction accuracy within the upper ninety percentile is typically expected for industries which rely on completely automated machine learning systems. Scores that approach the upper ninety percent threshold are typically the result of teams of data scientists working to improve training data and ensuring labels are consistent.
Metrics such as these provide only a partial understanding of the outputs of machine learning. Others have previously noted the problems with AI metrics. 41 Thomas and Uminsky's paper, "Reliance on Metrics is a Fundamental Challenge for AI," provides details of the fundamental problems of optimizing metrics and notes with compelling examples that metrics can be, and are manipulated, and that they serve as proxies for what is being measured in the real world. Thus, a system-derived evaluation score of the type presented here is not the same as true usefulness for the task at hand-the efficient implementation of linked data description. 42 In addition to evaluation metrics, a more critical consideration of the resulting subject suggestion functionality needs to turn to the integration of the API from Annif into a linked data editor, as well as the resulting linked data editor feature that allows a "cataloger in the loop" configuration of the linked data cataloging software, such that catalogers will still assign subjects based upon their evaluation of the labels suggested.
In this case, it was desirable to understand the way the Annif API could be accessed from the Sinopia Linked Data RDF editor and to ensure that the cataloger can still choose to include or discard Annif 's suggestions if they are not deemed useful. As an example, the patterns used for the Quality Assurance (QA) server API in the Sinopia linked data editor could be re-used.
These patterns entail hosting the Annif API 43 in a separate Docker container from which Sinopia can retrieve data. When the Sinopia user enters a title, request is made to the server where the Annif API Docker container is running. The data from Annif into the Sinopia editor is modeled in Figure 2, with the pilot example of the Sinopia interface as a high-fidelity mockup shown in Figure 3. Note that to differentiate the auto suggested elements, the interface presents those data elements that were retrieved from Annif in a different color scheme that also draws attention to their need to be removed if the cataloger does not deem any Annif suggestions relevant.
A further interface feature of Sinopia may also consider the inclusion of "model card" descriptions in Help documentation. The model card is an explanatory section that details the machine learning model with which the data in the project were trained to provide additional transparency. 44 By opening the model and the dataset used in the project, the catalogers in the loop can ascertain quickly how a given autosuggestion was produced and why. Similarly, datasets in which the data for Annif may be trained could include a datasheet with provenance information that tells any user of the dataset's composition and where it might be inspected. 45 This process may also be openly transferred to other systems-a hallmark of transparency that can help to alleviate any potential harmful effects of the model's implementation. 46

Discussion
A March 2021 article that appeared in the MIT Technology Review underscored the importance of high-quality labeled data for machine learning. In the article, Andrew Ng, co-founder of Coursera and Deeplearning.AI founder, explained the way to improve AI through machine learning operations with this focus on data quality: "We need to shift in mindset from big data to good data. If you have a million images, go ahead, use it-that's great. But there are lots of problems that can use much smaller data sets that are clearly labeled and carefully curated." 47 This discussion introduces a workflow concern in data engineering for libraries that employ semi-automated subject suggestions in RDF editors and beyond. Librarians have long been concerned with metadata quality. Extending this concern for metadata used in machine learning would likely usher in new metadata quality pipelines grounded in elements of data engineering pipelines, with a novel focus on machine learning operations (MLOps) for libraries. Ng is a proponent of what he deems the need for MLOps in AI-focused organizations. 48 MLOps teams in libraries will need to be concerned with keeping the model trained on monthly downloads of the LCSH vocabulary. Without constant re-training of the model, the service will eventually degrade over time as less matches of the vocabulary might be made from the training data. The speed at which the model would degrade depends on the rate of change in the source vocabulary and the extent to which newer training data are able to reflect the updated vocabulary and match the cataloger expectations for subject assignment. Overall machine learning operations are cyclical and require ongoing maintenance by professionals who can adjust training data and models as the needs of the machine learning problem change over time.

Use of Annif in RDF editors
The initial research method focused on the first portion of the title; the text string found in the 245$a field. This title capture was selected initially to ensure a minimal amount of "noise" in the training data. Yet, more research is required into ascertaining if the full 245 field (subfields $a and $b) may result in more accurate subject suggestions or the potential inclusion of summary notes, review data, or abstracts found in the 520 field. This line of research applies also to subjects-additional sustained research will be needed in incorporating more than the 650 $0 fields for the subject vocabulary and perhaps load additional SKOS vocabularies in combination with the initial LCSH linked data service. Combinations of the above machine learning scenarios will need to be evaluated. Further study into the most valuable ensemble of algorithms for subject suggestions in a linked data editor is needed.
In preparing the test set for Annif, it became clear that some types of titles work better for subject autosuggestions. There are title strings that will not map well semantically to LCSH, while other titles have more semantic naming (e.g., a book about civil rights that is named civil rights vs. one that is also about civil rights, but perhaps not named semantically). Of course, both titles are valid entries in title labels for Works, but they do not work equally well in Annif. It may be possible to set a threshold in an editor's business logic such that only more relevant results can be displayed. The implementation would likely see greater adoption and use if the system user both can understand the way the subjects were suggested and finds the autosuggestions helpful.
A more expansive framing of implementation tasks likely will require sufficient cataloger consultation. However, it is unwise for such an automated tool to simply appear in any RDF linked data editor. Rather, a focused consultation with the community it hopes to support would be in the best interests of machine learning implementation, particularly if the data feminism principle of the embrace of pluralism is echoed in and influences the consultation. Specifically, "Data Feminism insists that the most complete knowledge comes from synthesizing multiple perspectives, with priority given to local, Indigenous, and experiential ways of knowing. " 49 To date, this work has been presented to catalogers in two separate forums focused specifically on using the RDF editor Sinopia. At an online virtual meeting of the Sinopia User Group in Spring 2021, a demonstration was presented followed by a group discussion of Annif API uses with LCSH. Since then, the catalogers who are focusing on evaluating Sinopia at the University of Pennsylvania library have discussed approaches to incorporating the Annif API in their work and have suggested the inclusion of programmatically querying OCLC data to support streamlined entry of available, already existing labels. There will be a fuller discussion in both groups before the Annif API is implemented in the cloud-based Sinopia editor. A software ticket in the Sinopia Editor GitHub repository can be followed to track progress toward in implementation. 50

Conclusion
Ethical considerations of semi-automated subject description are a necessary inquiry and conversation with practitioners will serve to inform the development of ethical practice. If automation is to be useful for the communities it seeks to support, it must be ushered in with profound appreciation for, and in collaboration with, the professionals the automation would support. To that end, future work will include a focus on user evaluation of Annif machine learning outputs, with a focus on gathering practitioner scoring as one metric in a multimodal assessment of the Annif suggestions. Thereafter, scoring over various combinations of algorithm ensembles can further provide a pathway to ascertain the most useful types of semi-automated machine learning to streamline library description using linked data vocabularies.