Bibliographic Data Science and the History of the Book (c. 1500–1800)

Abstract National bibliographies have been identified as a crucial resource for historical research on the publishing landscape, but using them requires addressing challenges of data quality, completeness, and interpretation. We call this approach bibliographic data science. In this article, we briefly assess the development of book formats and the vernacularization process in early modern Europe. The work undertaken paves the way for more extensive integration of library catalogs to map the history of the book.


Introduction
Library catalogs are essential tools in information science, and their utilization has been greatly advanced by digitalization. 1 The need to manage and organize the ever-increasing body of digital information has motivated the development of new concepts and technologies, such as Linked Data, which was first introduced some 20 years ago and has been on the agenda of most National Libraries since then. During the last decade, the concept of Linked Open Data (LOD) emerged to emphasize the importance of open licensing of the data resources. Metadata collections of published material that different libraries hold are particularly suitable for interlinking and enriching with different semantic layers. 2 LOD represents a crucial step in taking full advantage of digital resources through the integration of web sources and open, reusable metadata and its enrichment. 3 National bibliographies have been traditionally used as a tool for information retrieval. This article demonstrates our quantitative approach to book history, where bibliographic collections are considered as research material, rather than a mere retrieval tool. A key feature in this work is that whereas the analysis of full texts has drawn considerable attention in digital humanities, in our analysis metadata collections form the primary target. 4 This article relates closely to the data management efforts in National Libraries as it claims that it is extremely important that we can rely on data quality and completeness in order to make robust statistical claims. Thus, even when it is certain that no cumulative integrated catalog of bibliographic data will be perfect and free from errors, we argue that a metadata collection can often be sufficiently representative of important trends in the history of the book and knowledge production. This hypothesis comes with substantial research potential but it is yet to be systematically explored and tested. Related earlier studies include work in analytical bibliography and book history that have produced highly interesting interpretations by charting long-term developments in the history in books, 5 or have at least discussed the different opportunities associated with national bibliographies. 6 The use of bibliographic metadata as a research object has, however, proven to be challenging as obtaining valid conclusions critically depends not only on the overall understanding of the historical context but also on technical issues of data quality and completeness. Subsequently, research cases that build on quantitative analysis of these data collections, have remained few.
We have started to develop novel ways of addressing these needs by algorithmically harmonizing and integrating different sources of bibliographic metadata maintained by the research libraries. We call this approach bibliographic data science (BDS). It is specifically targeted at enabling the use of bibliographic metadata as a research object, deriving from the more generic paradigms of open science and data science. 7 We propose that large-scale, automated harmonization efforts can enhance the overall reliability and commensurability between independently maintained metadata collections, thus complementing LOD and other technologies that primarily focus on data management and distribution. Hence, bibliographic data science aims to fill an important gap in the field as it is commonly observed that bibliographic metadata has high amounts of inaccurate entries, collection biases, and missing information. We aim to show how many of these issues can be overcome, so that large-scale quantitative analysis of bibliographic metadata becomes more reliable, by turning to two historical research cases: the rise of the octavo format in printing in Europe and the breakthrough of vernacular languages in public discourse.
Our analysis covers the overall publishing landscape in the period c.1500-1800 based on four large bibliographies. Thus, our analysis allows us to assess publishing activity beyond what is accessible by the use of individual national bibliographies alone, as we have recently suggested. 8 We have extensively harmonized selected metadata fields of the Finnish and Swedish National Bibliographies (FNB and SNB, respectively), the English Short-Title Catalogue (ESTC), and the Heritage of the Printed Book database (HPBD). Altogether, these four bibliographies cover over 6 million entries of print products in Europe and elsewhere, and 2.64 million harmonized entries from the investigated period (1500-1800), ranging from the 16,365 entries in the FNB to 2.1 million entries in HPBD, which is a compilation of 45 smaller, mostly national, bibliographies 9 (Table 1).
Bibliographic data science shifts emphasis from data quantity and management toward data quality and statistical analysis, and has potential for wider implementation in related studies and on other bibliographic metadata collections of which there is certainly no shortage in the Galleries, Libraries, Archives, and Museums (GLAM) sector. Our work indicates that whereas national bibliographies have essentially been about mapping the local canon of publishing, integrating data across borders should be managed in a systematic way that can take into account specific local circumstances. Although print culture has obviously been tied to the nation and its culture, it reflects broader Europe-wide cultural processes that deserve to be analyzed. Integrating data across the borders set by national bibliographies helps us to get at the wider processes and trends and, eventually, to overcome the national view in analyzing the past.

Bibliographic data science
For various reasons, it is important to comprehend that supporting quantitative, data-intensive research is not the original or intended goal of analytical bibliography. Primary motivation for cataloging has been to preserve as much information of the original document and its physical creation as possible. This includes potential errors caused by the printer. 10 If, for instance, a place name is wrongly spelled on the title page, for cataloging purposes it is relevant also to preserve that misspelling. For anyone desiring to work on quantitative approach to bibliographic metadata, this is a crucial point to understand and respect. Moreover, the contents in bibliographic metadata collections are the products of at least three multi-layered historical processes. First, the digitization of traditional card catalogs may have meant an exclusion of material that was regarded as less important or covered elsewhere. Second, the collection of early national bibliographies has in general been based on a collection of existing bibliographies that were originally collected for other purposes. 11 Naturally, the national bibliographies have not been able to include everything published, albeit the effort toward completeness has been remarkable in many cases. Third, the records reflect different historical practices of printing, publishing, and cataloging (with respect to variant states of editions, for example).
In 18th-century Sweden, for instance, printing laws and decrees formed a crucial part of political discourse and this was of great economic value to the book industry, 12 whereas in Britain this was the case to a much lesser degree. Such practices are noticeable in the bibliographic metadata collections, but they tell us more about the printing industry, not necessarily about other social and political phenomena, such as language relations, that we might want to study through the data. Any historically interested study using national bibliographies must therefore be attentive to these historical layers contained in the data in order to propose reasonable interpretations of quantitative data analysis. Our work builds on traditional bibliographic research, and we are using established definitions of bibliographic concepts where possible. 13 Available bibliographic metadata is thus seldom readily amenable to quantitative analysis. Biases, inaccuracies, and gaps hinder productive research use of bibliographic metadata collections. Varying standards and languages pose challenges for data integration. Our use of the term bibliographic data science implies that bibliographic data is viewed as quantitative research material, and systematic efforts on our part are carried out to facilitate this by ensuring data reliability and completeness. In this work, we focus on a few selected fields, namely publication time and place, language, and physical dimensions. Our data harmonization follows similar principles and largely identical algorithms across all metadata collections. We have removed spelling errors, disambiguated and standardized terms, augmented missing values, and developed custom algorithms that can convert the raw MARC notation to numerical page count estimates, for instance. 14 We have also added derived fields, such as print area, which quantifies the paper consumption in sheets for a unique copy of a document; the combined print area across different documents in a given time period can be used to quantify the breadth of printing activity. Moreover, we have used external data sources on authors, publishers, and places to enrich and verify bibliographic information. Automation, scalability, and quality control are critical, as the data collections may contain information on millions of documents. Hence, we have incorporated best practices and tools from data science, such as software libraries, unit tests, tidy data and reproducible workflows. Bibliographic data science is based on an iterative process where improved understanding often leads to enhancements in data harmonization and validation that can be incorporated in the automated processing steps.
Ideally, such harmonization and validation efforts are fully transparent both in terms of data and source code. 15 The cumulative research process has equipped us with a vast body of methods that support research use of bibliographic metadata collections. We are sharing our algorithms for bibliographic data science through the bibliographica R package. 16 In contrast to code availability, many of the most comprehensive bibliographic metadata collections are not yet generally available as open data, however, and they may be difficult to obtain even for research purposes. The lack of open data availability forms a major bottleneck for transparent and collaborative development of bibliographic data science. This might be gradually changing, however. The National Library of Finland, for instance, recently made available the complete MARC entries of the FNB 17 under an open data license allowing modification, reuse, and sharing of derivative versions. As we demonstrate with the FNB collection, open data availability enables the sharing of a reproducible workflow from raw data to harmonization and analysis. We share the harmonized version prepared and used in this study; it is openly available and linked from Helsinki Computational History Group website, 18 and can be further verified, investigated, and enriched by others. The harmonized data sets can be further integrated and converted into Linked Open Data and other popular formats in order to utilize the vast pool of existing software tools. As a next step, we are planning to incorporate our validated harmonization algorithms in the Linked Open Data Release of the FNB. Combining large-scale harmonization with existing data management infrastructures could open up new doors for research on national bibliographies.
The HPBD catalog is a compilation, and incorporates parts of the other catalog. 19 In summary, the HPBD contains 19,400 records from FNB (before 1827), and c. 56,000 records from SNB (1600-1800). Hence, the HPBD potentially covers the complete FNB and SNB (Table 1). However, this is not likely to introduce major bias in the current analysis as the smaller FNB and SNB catalogs form a negligible fraction ( 2%) of the HPBD. The British Library ESTC collection is not mentioned by name in HPBD, but it is mentioned that HPBD includes 55,400 records from Incunable Short-Title Catalogue and Books printed in the German-speaking countries and of German books printed in other countries (1601-1700) from the British Library; this suggests that the overlap between HPBD and ESTC in the investigated period is at most 0.5% since in ESTC we have identified 1054 Incunables, and 321 German books from German speaking regions (Germany, Austria, Switzerland) printed before 1701.
Data harmonization and management is only the starting point for analysis, albeit an important one. In addition to improving the overall data quality and hence the overall value of LOD and other data retrieval infrastructures, the harmonization enables statistical analysis of the complete metadata collection with scientific programming environments such as R 20 or Python 21 , which provide advanced tools for modern data analysis and statistical inference. Whereas large portions of data analysis can be automated, efficient, and reliable research use requires collaboration between traditionally distinct disciplines, such as history, informatics, linguistics, and data science. Finding the right combination of expertise may be challenging.

Language and format of early modern publications
The hand-press period is particularly fruitful for quantitative research on books because there were remarkably few changes in printing technology from 1450 to approximately the 1830s. It has been famously claimed that Gutenberg himself would have been able to operate a printing press in late 18th-century London since it would have been so similar to the one found in mid-15th-century Mainz. As revolutionary as the movable type printing press was for early modern culture and economy in general, it is a good fortune for our aspirations to understand the development of early modern publishing that there were no game-changing technological innovations for the next 400 years or so after Gutenberg's time. 22 In our research on different bibliographic metadata collections we have come to realize that the relatively stable nature of printing opens up different avenues for cross-European research. For example, we can estimate the long-term development of book formats in some detail across Europe, which in turn is significant for understanding the relevance of printing for the changes in public communication. This is why for this article we have developed two Europe-wide bibliographical metadata cases to analyze the rise of octavo format and process of vernacularization in the early modern period. This tests also the metadata collections in their different levels of data harmonization and respective levels of historical representativity. Both of these research cases represent large-scale Europe-wide transformations that took place predominantly during the hand-press era, but an inspection of them through several metadata collections and by zooming in and out in the material show intriguing variety in the publication profiles of European cities. The cases also make it possible to discuss how the employed methodology, varying levels of data harmonization, and gaps in data affect the analyses, thus paving the way for new research and guidelines for future data integration in this field.

The rise of octavo in the Enlightenment period
The general trend in the metadata collections that we have studied is that the octavo format supersedes other printing formats during the 18th century. 23 This can be measured by looking at a simple title count of documents published in different formats, or by studying the paper consumed in different documents. We have chosen the latter to better account for books of different sizes and lengths. In this article, we use print area, which quantifies the amount of sheets used for unique copies of titles. Earlier we have also studied total paper consumption, which additionally takes the possibly variable print run estimates into account.
When we examine the publishing trends of book formats in the HPBD, we notice that on a general European level the rise of the octavo format is particularly strong during the 18th century, and further supported by the ESTC and SNB (Figure 1) where octavo is not only the fastest gainer of the market, but also holds the largest share of the print area by the end of the 18th century. If we look at particular places with respect to octavo share in HPBD, a striking feature is the octavo share in German cities of Frankfurt, Leipzig, Halle, and Berlin (Supplementary Figure 1). The manner in which folio drops and octavo rises in German soil during the 18th century suggests that the octavo format was the high rising star of the Enlightenment.
Among this type of general Europe-wide trends, there are of course local differences, and for example in Turku (Supplementary Figure 1), and Finland that was part of Sweden at the time, the rise of octavo comes later than in Sweden in general. This was due to the fact that the main part of the documents printed in Finland were official documents, pamphlets, and theses. If we look at the share of the different formats in Turku, another way of saying this would be that printing in Turku only takes off in the later 18th century whereas in Stockholm the hand press printing industry seems to have reached a different level of maturity earlier (Supplementary Figure 1). The simplest explanation for the success of the octavo format is that it was particularly suited for smaller books that could be carried around and read practically anywhere, whereas the quarto (and folio) were more commonly used in governmental and academic documents; pamphlets and in larger books alike. 24 We have analyzed the relevance of the rise of octavo with respect to book printing in the case of "history" publishing earlier. 25 Of course, larger formats in book printing carried certain prestige also in the 18th century, even when reading started to be partly removed from stately mansion libraries and the price of the book turned out to be a decisive factor for dissemination of ideas. 26 When considering quarto and octavo publications, it is quite telling that David Hume (1711-1776) wanted his History of England to be printed in quarto-sized fine-paper six-volume set in late 1760s (as it had appeared earlier), but the editions that were actually published after 1767 until Hume's death (including the 1778 posthumous edition) are octavo editions in eight volumes. The octavo editions might have lacked the exclusivity and finesse of heavier tomes with large margins that connoisseurs preferred for aesthetic reasons, but it was particularly the cheaper and smaller formats, octavo and duodecimo, that changed the nature and relevance of printing and reading in the later part of the 18th century.
We have included one union catalog, ESTC, in this study. It is evident that ESTC is not complete in the sense that it would include all the recorded documents in different libraries. Going through some of the records of larger repositories, such as the National Library of Scotland, quickly reveals that their collections include at least dozens of documents not yet recorded in the cumulative ESTC. This, however, is not a problem for our analysis because we are mainly focused on general trends that do not require all the possible records in order to be reliable. 27 One particularly interesting feature of the ESTC is the high proportion of duodecimo documents (Figure 1). At the end of the 18th century, duodecimo in the ESTC reaches the same level as the fast-declining folio. Compared to the proportions of gatherings in the HPBD, for example, this is a highly noticeable feature. In the SNB, folio is at the same level as duodecimo, but the share of the total volume is much lower than in the ESTC. This can be largely explained by differences in the printing costs and because the market responds to the demand of cheaper reads. 28 If we analyze different cities based on book format proportions, we realise that it was especially places in North America (such as Boston and Philadelphia), Ireland, and Scotland (Dublin, Edinburgh, and specifically Glasgow) where the duodecimo format has the highest share of the print area (Supplementary Figure 1). Interestingly, in London, folio seems to keep its relatively high share even in the latter part of the 18th century. Also in the traditional university towns of Oxford and Cambridge duodecimo does not rise to the two most common formats in the later 18th century, which is noteworthy in the Anglo-American context.
The most complex data set that we used in this article is the HPBD. 29 This is not an integrated metadata unit (such as the ESTC for example), but rather a collection of various bibliographic collections with varying amounts of data with issues of duplicates and the like. Thus, all the analysis of the HPBD need to be executed with additional caution, although we have validated our key observations by ensuring that similar trends are to be found in the other metadata collections that we used. Thus, we can rely on the general trends that are apparent in the HPBD. However, the more specific the analysis becomes, the more careful we need to be. One general feature of the HPBD when it comes to the question of format, along with the earlier noted sharp rise of the octavo, is the relatively large proportion of folio books (Figure 1). It is worth noting that with respect to HPBD, the folio format keeps a fairly large share of the total print area of published documents until mid-18th century. We may notice a similar trend also in ESTC in Figure 1, whereas in SNB folio seems to have been on a sharper drop for a longer The dominant document format in the 17th century together with folio was quarto throughout Europe. There is an unusual peak during the civil war era in ESTC caused by the Thomason Tracts. 30 This means that  because of the cataloging rules of including different variants in ESTC, bookseller George Thomason was able to gather so many of these with respect to civil war pamphlets that there is a noticeable statistical peak caused by them. This needs to be noted, but it does not change the overall general trend. 31 The quarto format was, as said earlier, the common document format for pamphlets and other shorter pieces. When we look at the HPBD ( Figure 1) we see that quarto's share is fairly constant throughout the early modern period. In the ESTC, however, there has been a declining curve since the second half of the seventeenth century. This is because of the more rapid increase of other formats. In the ESTC the quarto format does not decline in absolute numbers, but like all other book formats, its absolute numbers are rising in the 18th century. It is also interesting to notice that there seems to be a correlation between the document language and format.  Comparison of documents published in English, Latin, and other languages in London (Supplementary Figure 3) suggests that especially duodecimo was the preferred format for books printed in other languages than English and Latin, whereas octavo was the one used proportionally more in Latin books than others. Especially the small share of folio documents in Latin is interesting. Also, the quarto share of Latin in this respect in London is noteworthy.

Vernacularization in Europe, 1500-1800
Vernacularization refers to a historical transformation in local language relations. Multilingual systems in which one language (in Europe often Latin) was reserved for learned communication, whereas local vernacular languages used in everyday communication started to erode and local languages gained increased prominence. They were made into vehicles for discussing politics, science and culture. This process happened at different speeds in different parts of Europe. Judging from today's teleological perspective, vernaculars such as English and French gained prominence already in the 1600s, whereas for the German and Swedish languages this development happened in the 18th and 19th centuries. For many smaller languages in Europe, such as Finnish or Czech, this development happened in conjunction to nation building in the latter half of the 19th century. Ultimately, vernacularization is an open-ended process. For many potentially vernacularizeable languages the transformation never took place and a similar process could potentially take place also in the future as language relations are in a constant flux. The dominance of English today in many parts of Europe, in a sense, marks a reversed transformation. Linguists and historians have from various perspectives paid attention to vernacularization as a process, 32 but this article takes a novel approach by investigating metadata collections that contain millions of titles and related bibliographic information and thus provide a previously unexplored source to trace how the process of vernacularization materialized in concrete publications.
While language relations differ considerably all over Europe, there is one measure that paints a picture of vernacularization as a general trend in European publishing: the share of publications in Latin. All of our four metadata collections show an indisputable declining trend in the share of publications in Latin in the period 1500-1800, but there are noticeable differences to the timing and proportions of the transformation, which are partly explained by historical trajectories mirrored in the data but also by the composition of the data itself. The HPBD (Figure 2) provides the geographically broadest overview of the decline of Latin, but as a data set it includes most gaps and uncertainties. Nevertheless, in the HPBD the decline of Latin in the 18th century is most rapid and it happens later than for the ESTC and SNB (Figure 2). This may be a result of the composition of the database with many metadata collections being predominantly focused on the 18th century. The earlier decline of Latin in Britain corresponds with our previous knowledge of the early establishment of English as the main language of high-level communication. Well-known symbols for using English such as Shakespeare and the Royal Society anticipate this, 33 but once the comparison based on national bibliographies can be brought to a more reliable level, we can provide a statistically more accurate picture of this. The available data does nevertheless suggest that the decline of Latin in Britain is more drastic than it has been previously anticipated.
The SNB and FNB allow us to zoom in and look at the Swedish case more closely and compare the different properties of the bibliographies. While the SNB portrays the general trend for the Swedish realm, it is also clear that Stockholm as a publication center dominates the image ( Figure  3). Looking at the FNB, which consists mostly of publications from Turku (Åbo), one of the four university towns in the realm (excluding Tartu), shows that the distinct publication profile of university towns are sometimes hidden under the national average. Still, also in Turku, we find a concrete decline in the share of Latin publications, but the decline was definitely later although the Academy in Turku has been described as one of the most utility-oriented universities in the Swedish realm and thus also most prone to use Swedish in academic texts. 34 One special feature with the FNB has to do with the different roles of Swedish and Finnish as languages. While Swedish became a stronger candidate for academic publications, Finnish emerged as a written language especially in shorter religious and economic texts. Vernacularization was in this case not a process involving two languages, but three.
Keeping in mind the uncertainties relating to the HPBD, an inspection of smaller university towns suggests that this is a wider trend. The university town, the capital, and the commercial centers had different linguistic publication profiles and vernacularization as a process happened in different phases. An analysis of languages used in publications from Cambridge, Oxford, Leiden, and G€ ottingen (Supplementary Figure 2) shows how Latin lingered on, but also in these cases, like in Turku the local languages did gain a much more prominent position by the end of the 18th century. Compared to the absolute publishing centers in Europe, Paris, and London, the development happens really late. Interestingly, the metadata collections tell us about national trends, such as an early decline of Latin or competing vernaculars, but when viewed in comparison we can also see patterns that cross national boundaries, such as different types of publishing milieus in commercial towns, university towns or capital cities. All of Europe had a cultural debt to sources from Antiquity, but this debt materialized differently in the places that were almost self-sufficient culturally (like Paris and London) or the university towns that embodied learning by attaching themselves to Latin traditions. 35 Since both vernacularization and the rise of octavo seem to be inherently related to a modernization of public discourse, reading, and writing, a final question is then if the change in the popularity of formats in the 16th and particularly 18th centuries is related to the shifts in language in the same period. It seems that there is no simple answer to this. Quite naturally, in all of the studied metadata collections, the vernacular languages obtain a growing share of published books in the octavo format (Supplementary Figure 3). A closer look at cities with different publication profiles shows that the matter was more complicated. For the ESTC the share of octavo books is for most cities higher in Latin books than for English books (Supplementary Figure 4). Also in the SNB, both Latin and Swedish books tend to navigate towards smaller formats at the end of the 18th century, but the HPBD's records for German cities point at the octavo format being used more often in German-language books than in Latin books (Supplementary Figure 4). While there is not a clear correlation between language and format, the analysis of format nonetheless helps to qualify earlier research. Henrik Horstbøll has shown that the octavo format was particularly popular in Denmark with small histories that stood for a leisurely reading, 36 but by looking at a much bigger sample, it is clear that the octavo format became more popular in other genres as well, including books published in Latin in university towns. Additional data and content analysis will in the future allow to look more closely at how genre, language, and format relate to one another, and to what extent the rise of smaller formats and different languages reflect the emergence of new genres.

Discussion
This article has sought to demonstrate that something as seemingly trivial as document sizes and language of titles can have a crucial role when considering the emergence of public sphere in early modern Europe. The relationship between reading habits and broadly circulated written documents in the Enlightenment period can be looked at differently when we can learn more about the relevance of octavo-sized book and the rise of local written languages in Europe during the 18th century. For a better and more reliable understanding of these processes we have developed and used tools of bibliographic data science.
Our work is part of the emerging trend towards the utilization of large digital data resources in publishing history. 37 Many of the problems relating to scalable data processing and interpretation were similar to the ones we have encountered in the context of bibliographic metadata collections.
We have investigated four different types of bibliographic metadata collections (FNB, SNB, ESTC, and HPBD). As similar datasets national bibliographies are not only about mapping the national traditions of publishing, but can also be studied comparatively and ultimately be integrated across borders to help to overcome a national perspective in analyzing the past.
The power of a large-scale statistical approach is that broad patterns in knowledge production are often overwhelmingly clear, despite occasional inaccuracies and collection biases in individual data sets as we have shown. Already the HPBD can be used to assess some general trends in publishing history although it does not compete in data reliability and level of harmonization with the other bibliographic metadata collections that we prepared for this study. Unlike for the other collections, we did not customize the harmonization process for HPBD and this should be hence considered preliminary, the correspondence of the observed patterns between this and the other collections demonstrates the scalability of our approach. This is exemplified by our key observations on vernacularization and the rise of the octavo, which are supported by similar trends across the four bibliographic metadata collections that we have assessed. For a more detailed comparison across European cities, further harmonization and augmentation of bibliograhic metadata collections are needed. 38 Integration of collections demands further work in detecting duplicates, different editions and translations across catalogs in a reliable way. Our present work provides a starting point and the initial guidelines for more extensive analysis and integration.
Bibliographic data science derives from the already established field of data science. It associates this general paradigm specifically with quantitative analysis of bibliographic data collections and related information sources. While having a specific scope, BDS is opening up pragmatically oriented and substantial new research opportunities in this area, as we have aimed to demonstrate.
Our future work envisions continued harmonization and data integration for the HPBD and expanding the study to cover public communication more broadly. As we have extracted and harmonized publisher information from imprints from ESTC and FNB, 39 it is possible to connect that data to full-text collections such as the ECCO. Our vision also includes studying how the materiality of printing is related to developments in newspapers. 40 We consider the material developments within the printing industry as crucial ingredients in the emergence of new types of public communication that transformed Europe in the 18th century.
Our current harmonization strategies are based on manually implemented rules for data processing. Future developments can take further advantage of established machine learning techniques in order to reduce the need for human input and improving the overall scalability of data harmonization. Modeling the emergence of the publishing landscape across Europe could also borrow spatio-temporal analysis methods from ecology and related fields. When combined with a proper quality control, such quantitative, data-driven approaches can have potential for wider implementation in related studies in the digital humanities. Moreover, digitalization has provided new opportunities for open sharing of research data and analysis methods. Taking full advantage of these developments can support collaborative and cumulative research use of bibliographic collections.

Conclusion
We have conceptualized a new approach, bibliographic data science, to expand the research potential of bibliographic cataloging and classification. Whereas national bibliographies can provide comprehensive quantitative insights to the overall historical dynamics of the evolving publishing landscape across time and geography, we have encountered specific and largely overlooked challenges in using bibliographic metadata collections for historical research. Biases, gaps, and inaccuracies in data collection may remarkably hinder productive research use of the bibliographies, and drawing valid conclusions critically depends on efficient and reliable harmonization and augmentation of the raw entries. Here, we have overcome some of these challenges by specifically tailored open data analytical ecosystems that facilitate robust statistical research use of bibliographic collections. This approach has potential for wider implementation in related studies and bibliographies, and provides guidelines for more extensive integration of national collections, thus moving towards a more precise view of print culture beyond the confines of national bibliographies.

Supplementary Material
The source code and harmonized version of the Finnish national bibliography Fennica (FNB) used in this study, based on the original open MARC records published by National Library of Finland, is available through Helsinki Computational History Group (COMHIS) website. 41