Time Period Categorization in Fiction: A Comparative Analysis of Machine Learning Techniques

Abstract This study investigates the automatic categorization of time period metadata in fiction, a critical but often overlooked aspect of cataloging. Using a comparative analysis approach, the performance of three machine learning techniques, namely Latent Dirichlet Allocation (LDA), Sentence-BERT (SBERT), and Term Frequency-Inverse Document Frequency (TF-IDF) were assessed, by examining their precision, recall, F1 scores, and confusion matrix results. LDA identifies underlying topics within the text, TF-IDF measures word importance, and SBERT measures sentence semantic similarity. Based on F1-score analysis and confusion matrix outcomes, TF-IDF and LDA effectively categorize text data by time period, while SBERT performed poorly across all time period categories.


Introduction
The increased production and publication of literary works, especially fiction, combined with the rise in computerized environments for information retrieval, have underscored the need for effective and improved cataloging strategies to manage the expanding volume of these works effectively. 1 The need for such an organization is particularly evident in fiction, a form of literature whose popularity has made content analysis a significant topic in knowledge organization. 2Specifically, a noteworthy issue identified is the lack of categorization by time, which is a significant feature in understanding and organizing literature. 3In this context, time (henceforward called time periods) does not pertain to the publication date but rather the temporal setting of the narrative, such as specific historical periods or events where the story unfolds.Studies affirm that time is an integral component of subject cataloging, with a study by Bates highlighting that 16% of all searches involved time-related terms. 4However, categorizing fiction has proven challenging due to its multimodality and the interpretational nature of fiction. 5Multimodality means that fiction often includes multiple modes of expression or elements, such as narrative style, plot, characters, dialogue, themes, and more.These elements can vary significantly from one work of fiction to another.Furthermore, fiction is subjective and open to interpretation.Different readers may interpret the same work of fiction differently based on their perspectives, experiences, and cultural backgrounds.This subjectivity makes it challenging to create rigid categories for fiction.The scarcity of time period metadata in extensive catalogs like Libris in Sweden is particularly apparent in fiction literature.This scarcity is likely due to the complex and timeconsuming process of deciding what time period metadata to incorporate, involving consideration of text characteristics, which, if not carefully conducted, can lead to miscategorization.This study examines three machine learning methods as a feature engineering means for automatic time-period categorization.The application of machine learning methods may help address time-period cataloging limitations for fictional literature.Using Swedish historical fiction literature data from the Swedish Literature Bank, 6 the study evaluates which feature engineering technique yields the highest F1 score for predicted time periods, investigating techniques like Latent Dirichlet Allocation (LDA), Sentence Bidirectional Encoder Representations from Transformers (SBERT), and Term Frequency-Inverse Document Frequency (TF-IDF).By leveraging these techniques, this study aims to contribute to the development of time period categorization strategies, emphasizing the importance of time as a significant feature in the organization of literary works.

Time period categories
Time periods are used to understand history and the world around us by organizing time into chunks and naming them. 7There are well-known historical periods such as the Medieval period (1050 to 1520), Renaissance period (fourteenth-seventeenth century), and World War I (1914-1918). 8Periods can encompass wars, revolutions, inventions, and much more.Most often, a period gets its name after the event has already occurred, and since there are not many rules about what constitutes a period, almost anything can be labeled as one. 9However, a period often marks significant changes affecting many people, and it needs an identifiable name to discuss these changes. 10So, even though events happen constantly, only a few will be named and remembered.Some events may be merged, separated, or forgotten along the way.Historical time periods are not always definitive and researcher's views may differ.Scientists find it challenging to agree on when the periods begin and end and how to best describe and interpret the events and periods. 11The disagreement arises because scientists have diverse preoccupations and are from different nations and societies where events such as wars, plagues, and revolutions are experienced differently. 12However, well-established periods convey a category that is generally agreed upon. 13In this study, the focus will be on Swedish time periods, therefore a Swedish time period division is consulted which is created by Staffan Bergsten, a literary historian, and Lars Elleström, a Professor of Comparative Literature. 14They suggested dividing the Swedish literary periods into the following divisions: • 1200-1526 Medieval period • 1526-1718 Era of Great Power • 1718-1772 Age of Liberty • 1772-1809 Gustavian period On many points, Bergsten's and Elleström's division of time periods corresponds with the time period divisions in the Metadata Agency. 15The Metadata Agency provides information on metadata standards practices and cataloging instructions in Sweden.Libraries widely use these standards to input standardized information into Libris, a joint catalog of Swedish libraries featuring around 7 million titles. 16Libris serves many users, from librarians and catalogers to publishers and patrons.Bergsten and Elleström formulated their divisions from a historical perspective, while Libris and the Metadata Agency shaped theirs based on the data stored in the Libris system.Like much else in Libris, librarians and catalogers create and refine time periods as they catalog material such as e-books, physical books, audiobooks, etc.
The Metadata Agency presents a list of their own of time periods ranging from 1520 to 1809, divided as follows: • 1520-1611 Vasa period • 1611-1718 Era of Great Power • 1654-1718 Karolinska Period • 1718-1772 Age of Liberty • 1772-1809 Gustavian period A brief explanation of each the periods are given below, beginning with Medieval period, which is present in Bergsten's and Elleström's division of time periods but not in Metadata Agency.
The Late Medieval Period, also called the Middle Ages, is divided into three parts: the Early Medieval Period, the High Medieval Period, and the Late Medieval Period, where in Sweden, the medieval period ranged from around 1200 to around 1526. 17 This period is considered later in the Nordic countries than in the rest of Europe because it took longer for Christian culture and the ideas that characterize the medieval period to take root. 18he period from the accession of Gustav Vasa to power in 1521 to the accession of Gustav II Adolf in 1611 is commonly referred to in Swedish historiography as the Vasa period.Although Gustav Vasa was on the throne from 1521-1560, the Vasa period also encompasses the regencies of Gustav Vasa's sons and his grandchild until 1611.
Era of Great Power marks two critical events: the Reformation and the Era of Great Power itself, spanning 1526-1718.The Reformation marked the publishing of the New Testament in Swedish and laid the foundation for a new Swedish written language.The Era of Great Power is the period when Sweden was at its most powerful, many wars were fought, and the Swedish territory was at its largest. 19he Karolinska period is a part of the Era of Great Power, occurring between 1654-1718.The name derives from the tenures of Karl X Gustav, Karl XI, and Karl XII as Sweden's regents.Sweden was involved in several major wars against Denmark and Russia during this era. 20here are several ways of defining the era that follows the Era of Great Power.One could either consider the political character after the Era of Great Power, where Sweden was ruled primarily by the parliament, and the king had less power than before, or one could look at the flourishing of science and early industrialism characterizing this period.The Gustavian era followed the Age of Liberty (1718-1809) and ended the liberty that Sweden had gained.In 1772, Gustav III executed a coup and assumed the power of the parliament. 21he time periods presented by the Metadata Agency are broad categories; nevertheless, they are sufficient for categorizing literature.Comparing Bergsten and Elleström with the Metadata Agency's standards, the time period that is missing is the Medieval Period.In this study, I use data that has time period in the Medieval Period; therefore, Medieval Time Period is added as a category to the Metadata Agency's as can be seen in Figure 1.

Feature engineering techniques
In this study, LDA, TF-IDF, and SBERT are used as machine-learning techniques to create features from text data.Subsequently, random forest, Support Vector Machine (SVM), logistic regression, and neural networks are machine learning algorithms since they represent specific models employed for text classification.Machine learning, which can be part of an AI operation, comprises techniques that can automate manual labor and provide faster decisions to humans.Machine learning uses data and algorithms to mimic how humans learn, gradually improving its accuracy. 22here are three divisions of learning algorithms, supervised-, unsupervised-, and reinforcement learning.Supervised learning means that the algorithm is presented with inputs and known outputs.This means that the machine receives both text and the correct categories to which each text belongs and then trains to learn, for example, to categorize texts.On the other hand, unsupervised learning means that the algorithm finds patterns without requiring example inputs with known outputs.In other words, the model has no hints on categorizing each data example and must find its own rules for doing so. 23Hence, supervised learning is used when there are known outputs, and unsupervised learning when the outputs are unknown.The last group of machine learning algorithms is reinforcement learning.Reinforcement learning is a type of machine learning algorithm where the system learns by receiving rewards for correct actions.This process helps the system understand and improve its actions over time. 24ommonly, reinforcement learning utilizes artificial neural networks, enabling software agents to determine the most appropriate actions in a virtual setting to achieve their objectives, however, this will not be used in my study as it is a machine learning technique used when an agent applies actions within an environment to reach a goal through trial and error.
This research uses both supervised and unsupervised learning since there are known and unknown variables.Unsupervised learning was used to create the text features, meaning that the machine, without a human's supervision, created the texts' characteristics with LDA, TF-IDF, and SBERT. 25For the classification part, supervised learning was used, meaning that a human needs to verify if a text has been categorized correctly or not, here SVM, logistic regression, and neural networks were used.
As mentioned previously, machine learning automatically identifies patterns in data.Supervised machine learning is often employed in predictive analytics since this approach learns the relationship between specific input features and an output from historical data examples. 26Once the model is established, the model can be used to predict outcomes for new data.Figure 2 shows that the training data needs to contain descriptive features data, such as fiction text features and the correct time period category for each text before machine learning algorithms can be applied to make predictions. 27 machine learning, a feature is an individual measurable property or characteristic of a phenomenon being observed. 28It is the primary data used to predict the output using an algorithm.Therefore, choosing informative, discriminating, and independent features is a crucial step for effective algorithms in classification.Feature engineering is a step in the machine-learning workflow that involves creating, modifying, or selecting features.The aim of feature engineering is to represent, in the case of this article, the textual data in numerical values.Each feature engineering technique can represent the data in a unique way, capturing different nuances and patterns.This diversity can be especially beneficial in complex datasets such as fiction texts where multiple underlying patterns exist.Not all feature engineering techniques will be equally effective for a given dataset, therefore, experimenting with various techniques can help in identifying which yields the best performance for the predictive model.When analyzing text data, features can be extracted and represented in various ways.In this article, three techniques are used, TF-IDF and SBERT and LDA.

Sentence closeness with SBERT
In natural language processing, words are often represented as numerical vectors in a high-dimensional space through a method called word embedding.Sentence-BERT, commonly referred to as SBERT, is an adaptation of the well-known BERT model designed to generate embeddings for entire sentences.SBERT uses a Siamese network which is an architecture with a specific kind of neural network model used primarily for tasks that involve determining the similarity between two inputs. 29Each twin in the Siamese network is provided with the same two sentences, and then each twin processes the sentences independently and produces embeddings for that sentence.Both twins have the same architecture and share the same weights, so they process their respective sentences in an equivalent manner.After the sentence embeddings are produced, the similarity between these embeddings can be measured.The goal during training is to enhance the similarity in embeddings for sentences with similar meanings and decrease it for those with differing meanings.When SBERT processes a sentence, it outputs a fixed-length vector, which serves as a representation or embedding of the entire sentence's semantic content.The underlying mechanism involves the conversion of the input sentence into a series of token embeddings, leveraging the foundational capabilities of the pre-trained BERT model, as described by Reimers et al. 30 The twin or Siamese structure is crucial for training SBERT and teaching it to produce meaningful sentence embeddings, but only a single twin from the trained model is used to convert new texts or sentences into embeddings.

Word significance with TF-IDF
Term Frequency -Inverse Document Frequency (TF-IDF) is a statistical model that measures the significance of a word in a document or collection of documents by calculating the frequency of the word in the document and then weighting it according to how common or uncommon the word is across all the documents.The term frequency (TF) is the number of occurrences of a specific word in a given document.In contrast, the inverse document frequency (IDF) measures how common or uncommon the word is across all documents in a collection.The IDF score is calculated as the logarithm of the ratio of the total number of documents in the collection to the frequency of the word within those documents.This means that words with a high document frequency will receive a lower IDF score, while words with a low document frequency will receive a higher IDF score. 31The use of document frequency in TF-IDF is crucial because it permits the weighting of words based on their rarity across the entire collection of documents, as opposed to their frequency in a single document.
Once TF-IDF values are calculated, each document can be represented as a vector.The dimensions of this vector correspond to terms in the entire corpus.If there is N unique terms across all documents, each document will be represented as an N-dimensional vector where each dimension contains the TF-IDF score for a term.This vectorization ensures that every document, regardless of its original length, is represented in a consistent format suitable for machine learning models.After applying TF-IDF, each text in the corpus will be represented as a multi-dimensional vector where each dimension corresponds to a unique term from the entire corpus, and its value is the TF-IDF weight.These vectors capture both the frequency of terms in individual documents and the distinctiveness of terms across the entire dataset.This duality helps to balance out the importance of terms in a way that is often effective for tasks like text categorization.

Topic similarity with LDA
Latent Dirichlet Allocation (LDA) is a popular generative probabilistic model of the topic modeling family with the aim of determining a set of topics that belong to documents.The fundamental assumption is that documents are mixtures of topics, and topics are mixtures of words. 32DA does not understand the content as humans do, rather, it identifies patterns based on co-occurrence.LDA is typically not used as a feature engineering technique, however, LDA can effectively extract underlying topics from texts. 33If certain topics or themes are strongly associated with specific time periods in the corpus (e.g., industrialization topics linked to the Industrial Revolution), LDA can identify and extract these topics, aiding in the categorization.In the second step (after preprocessing, which is the first step), a decision must be made regarding the number of topics believed to be present in the corpus.Several methods can assist with this decision, the first one is including the use of coherence scores. 34Another approach to determining the optimal number of topics is to test various numbers of topics and evaluate their performance.If the categorization improves with an increase in topic number, continue increasing it until the performance plateaus or declines -and ultimately select the topic count that yields the best performance.Given a set number of topics to find, LDA iteratively assigns various words in each document to one of the topics based on the probability and distribution of the topics in the document and the words in the topics. 35This is done until a stable state is reached where the topic assignments do not change much.For each document, the topic distribution is derived from the proportion of words assigned to each topic.Meaning that for each topic, it calculates which words are most likely to appear.It also calculates the distribution for each document over topics.This distribution represents how much of the document can be described by each topic.To get the feature vectors, LDA has been run on a corpus and has assigned various words to topics; each document in the corpus can be described as a distribution over these topics.So, if there are 20 topics, each document will be represented by a vector of 20 numbers (each between 0 and 1, and summing up to 1), where each number represents the proportion of the document that is described by that particular topic.In the context of time periods, if there's a consistent thematic change across time periods in the fiction texts, LDA picks up on these changes.For instance, texts from a particular time period might heavily feature topics related to "war" and "politics", while another might focus on specific names of kings and queens.The resulting topic distributions are then used to categorize the texts by their time periods.
In conclusion, these machine learning methods, sentence closeness with SBERT, word significance with TF-IDF, and topic similarity with LDA, will be employed as feature engineering techniques to create features from raw data before performing classification.The subsequent sections will provide detailed explanations of how each of these methods is utilized.

Materials and method
In this study, Swedish historical fiction texts were categorized using three machine learning techniques.The aim is to address the need for improved cataloging strategies in response to the increased production and publication of literary works, particularly fiction, and the challenges associated with categorizing them by time periods.The study employs machine learning techniques to overcome these challenges and aims to explore and compare various feature engineering techniques and machine learning algorithms to improve time-period categorization for fictional literature.The goal is to contribute to the development of efficient time period categorization strategies in the context of literary works, focusing on temporal settings embedded within narratives.The research employs machine learning techniques to create features and categorize texts using supervised and unsupervised learning methods.In supervised learning, the algorithm is provided with inputs and known outputs to learn how to categorize texts, while unsupervised learning identifies patterns without known outputs. 36Both techniques are utilized in the research since there are known and unknown variables.Unsupervised learning is used to create text features without human supervision, while supervised learning is used to verify whether the predicted time period categories are correct or wrong.
The experiments are structured into two distinct phases.In the first phase, machine learning is not employed, and the focus lies on data preparation.This involves manually categorizing the data into time periods and subsequently segmenting the data into chunks as described in the sections "Sample" and "Assignment of Time Periods."The second phase incorporates machine learning techniques and consists of four steps: preprocessing, feature engineering, classification, and evaluation, as described in the list below.The research outlines the experimental process in the "Machine Learning Workflow" section.It encompasses the following steps: 1. Preprocessing the texts 2. Applying LDA, TF-IDF, and SBERT to extract features from the texts.3. The texts were categorized into time periods by employing classification algorithms: SVM, logistic regression, random forest, and neural network.4. The results were evaluated through metrics such as recall, precision, F1-score, and confusion matrix analysis.

Sample
This study utilized historical fiction novels from the Swedish Literature Bank, a nonprofit initiative offering free access to digital versions of Swedish literature with no copyright.Initially, 48 novels were identified and retrieved in full text by submitting the phrase "historical novels" to the search interface.Historical fiction novels were selected due to their basis on historical events, making them easier to categorize into time periods compared to purely fictional novels.Since the novels lacked time period data, they were manually categorized, yielding a dataset of 35 novels-a small dataset for machine learning purposes. 37However, the novels were few but extensive, ranging from 38,000 to 186,000 words.To mitigate the lengthy novels, the novels were divided into smaller chunks, producing 400 individual texts (each time period having 100 texts each) with around 3,500 words each.This segmentation increased the dataset size and rendered the texts more manageable for machine learning processes.Herein, "novels" refer to whole, uncut novels, and "texts" or "data" to the segmented versions of the novels.This distinction is crucial as feature engineering and categorization were applied to the smaller text segments, not the whole novels.

Assignment of Time Periods to Novels
Both the Metadata Agency and Bergsten and Elleström provide suitable time period categories for literary categorization.However, time periods by Bergsten and Elleström and the Metadata Office were slightly adjusted to align with the data used in this study.A considerable number of novels were set in the Medieval period, warranting its inclusion as a category.The Vasa period was excluded due to the presence of only one novel from that period.Since it is not possible to create subcategories in this study, the Karolinska period is merged with the Era of Great Power.Moreover, some periods started and ended in the same year, leading to ambiguity.To address this, the end of one period was set a year before the start of the next.Consequently, the revised time period categories can be seen in Figure 3.
Once the date ranges for the time periods were established, the novels were then first examined for numerical dates and then assigned their corresponding time periods.To do this, a script was developed to scan each novel for any mentioned numerical years, crucial for determining the temporal setting of the story.The years extracted from each novel were manually examined and then assigned a time period as can be seen in Figure 3. Novels without mentioned years in the text were excluded from the dataset.If a novel cited numerous different years (30 or more), and none of the years occurred more than the others, then these texts were also excluded to avoid presumptive categorization.The remaining novel texts had a clear date range and could therefore be included in the dataset.As an additional measure, the titles of the novels were compared to the extracted years to enhance the predictive performance of the time period categorization since the titles of historical fiction novels contain significant details like names of kings, queens, wars, or years.When the categorization was done, each time period category contained 100 texts.The steps of extracting years from novels are illustrated in Figure 4.

Machine learning workflow
The experiments performed in this study follow a certain generic workflow.The workflow gives information about which steps are to be performed and in which order.However, it does not provide information on which algorithms to use.In the following sections the workflow of this study is explained in four steps, see Figure 5. Lastly a machine learning workflow specifically developed for this study is presented with details on which algorithms are involved in the process, see Figure 6.

Text pre-processing
The necessity for preprocessing varies depending on the classification task, from no preprocessing in automatic document dating, 38 to extensive preprocessing in temporal extraction. 39Preprocessing is also dependent on what kind of algorithm is used, SBERT is a pretrained language model that uses punctuation and stop words and hence does not need preprocessing. 40While TF-IDF operates on word level, word or term frequency is weight that depends on the distribution of each words.Meaning that TF-IDF expresses the importance of words in documents and therefore the texts need to be preprocessed. 41LDA also needs preprocessing since its purpose is to capture the semantic, if there is noise in the text the model will get confused. 42he preprocessing step in this study included a lower casing of the texts called case folding.This is done to get the exact representation of the words that mean the same thing and reduce the vocabulary size and allow for better generalization; for example, "King" and "king" have the same meaning, but when they are not converted to the same case, they are represented as two different words.The texts need to be split into smaller parts to be vectorized (see next step in the process called feature engineering); therefore, the texts are tokenized (split) into words.Not all words and punctuation need to be present, for example, words like "I", "and", "she", "you", and special characters like ".", ":" and "?" are filler words and characters.They do not contribute to the interpretation of the text when it comes to LDA and TF-IDF, and therefore the stop words and punctuation were removed. 43Also, the most frequent words, occurring more than 90% and the least frequent words, occurring 10%, were removed.The texts were initially sliced into chunks of around 7,000-9,000 words, but the performance of the models was very poor, therefore different word counts was tested.Slicing the texts even smaller, around 3,500 words, gave better results in the clustering and final evaluations of the models.
Output from this step: Texts are cleaned from unwanted words and special characters.The novels were also chunked into shorter texts.

Feature engineering
When the texts are pre-processed, they are transformed into a sequence of words.Since machine learning algorithms cannot use words in this raw format, the words need to be transformed into numerical values.Each text has to be represented as lists of numbers called feature vectors.Three different techniques were used in this study: LDA, SBERT, and TF-IDF.
LDA.Each document is represented by a vector of topic probabilities ranging from 0 (minimum probability that the topic resides in the given document) to 1 (maximum probability that the topic resides in the given document).Different numbers of topics were tested to discover the optimal number of topics.Fewer topics than 5 gave poor results, where the algorithm wrongly classified a text about 50% of the time, equaling a wild guess.There was a clear improvement when increasing the number of topics; 10 yielded a higher predictive performance than 5 topics, and 15 higher than 10.This pattern continued to topic number 20.After 20 topics, there were minimal improvements -therefore 20 topics were chosen.From topic number 20 to 25, the results dropped from making a correct prediction 79% of the time to being correct 78%, either forming empty topics or previous topics were broken into separate topics to create new topics forcibly.

SBERT.
In contrast to the experiments with LDA and TF-IDF, the data was not preprocessed because SBERT needs punctuation to distinguish between sentences.Each text in the corpus is broken down into individual sentences, then each sentence is assigned an embedding vector by means of the SBERT.These sentence embeddings are then aggregated to get a single representation for each whole text.The aggregation was done by averaging the sentence embeddings. 44TF-IDF.Since TF-IDF assesses the uniqueness of words in distinguishing documents, it is essential to remove words that lack such discriminative information.In this experiment, words that frequently appear in 90% of the documents were removed since they carry little meaning and do not give information to distinguish one document from another.On the other hand, words that are too rare (appear in less than 10% of the documents) or unique were also removed.Since the texts are only OCR:ed and most often not proofread, misspelled words occur now and then, or a person's name that only appears in a single document is unlikely to help identify relevant documents.
Output from this step: Texts are converted into so-called feature vectors, which are numerical representations.By examining the text from various perspectives, the three distinct algorithms have produced their own representations.
Model training.This study uses four classification algorithms: logistic regression, SVM, neural networks, and random forest.Previous studies show that Logistic Regression performs well for text classification and is used when the dependent variable is categorical; moreover, this model is used to solve binary classification problems. 45Since this study has four categories of time periods, it becomes a classification problem involving multiple classes called the multiclass classification problem.The approach to solving the multiclass classification problem is "one vs. all." Meaning that if category A is picked, the remaining categories (B, C, D) act as one category, resulting in (A) vs. (B, C, D).
A Support Vector Machine (SVM) is a machine learning method used to classify data points into different categories.Data points are scattered in space, and the goal is to find the best hyperplane or boundary that separates these data points into different groups.SVM finds this boundary in such a way that it maximizes the gap between the closest data points from different groups.This "best" boundary is determined by a small set of key data points called support vectors. 46In these experiments, the optimal parameter values were determined through exploration, with a selected cost parameter value of 1.7 and a gamma parameter value of 0.01.These specific values were identified as providing the best performance within the context of the experiments.
Neural networks, exemplified by the Multi-layer Perceptron (MLP) algorithm with backpropagation, represent a sophisticated approach to machine learning.These networks can handle both linear and non-linear relationships within data, making them versatile for a wide range of tasks.This study employs Sklearn's MLP algorithm to learn from input datasets. 47In these experiments, it was found that utilizing 100 neurons resulted in the most optimal outcomes.Random forest is a powerful ensemble learning method that harnesses the strength of multiple decision trees.In this approach, each tree relies on a random of features, ensuring diversity in the predictions.As the number of trees in the forest grows, the overall predictive accuracy converges to a stable limit.A random forest's effectiveness depends on the individual tree's performance and the correlation among them. 48or these experiments, 10 trees was used since it proved to yield the best results.
All four classification algorithms (logistic regression, SVM, neural networks, and random forest) are supervised algorithms.Therefore, to enable supervised learning, each novel has been pre-labeled with its respective time period (see Assignment of Time Periods to Novels).K-fold cross-validation, set to 10 folds, was used to train and test each model.
Output from this step: Classification models that predict the probability of a text belonging to a certain category.
Model evaluation.The evaluation and analysis of the models involve several steps.Firstly, precision, recall, and F1 scores are computed on the text level, considering each time period and machine learning technique.Secondly, a confusion matrix is employed to identify instances of misclassification between time periods, providing insights into the model's performance.Finally, the texts within a novel are aggregated to assess the extent to which an entire novel may be incorrectly classified into a time period class.
The three feature engineering techniques, LDA, TF-IDF, and SBERT, were evaluated separately and compared through the F1 score.The F1 score is the harmonic mean of precision and recall and a common measure to rate how successful a classifier is.The highest possible value of an F score is 1, indicating perfect precision and recall, and the lowest possible value is 0 if either the precision or the recall is zero; see Equation. 49In the case of a problem involving categorizing many classes or categories, an overall F1 score for all the categories are not computed.Instead, a one-vs.-allscoring is used as a method to determine the F1 score for each category, where one text belongs to only one category.With this method, the performance of each category is assessed individually as can be seen in Figure 10.Precision calculates the proportion of positive identifications that were actually correct. 50TP refers to True Positives, which are the instances that are correctly identified as positive by the model.FP refers to False Positives, which are the instances that are incorrectly identified as positive by the model, see Equation 1:1.A good classifier should ideally have a precision value of 1 (high).Recall calculates the proportion of actual positives that were correctly identified by the model and is known as sensitivity or true positive rate and should ideally have a recall of 1 (high), see Equation 1:2. 51FN refers to instances that are false negatives, meaning the number of positive instances that were incorrectly identified as negative by the model.The F1 Score is a metric that combines both precision and recall providing a single measure of a model's performance, see Equation 1:3.Therefore, when calculating F1 score, precision and recall should come as close to 1 as possible, which also means that false positives and false negatives should come as close to 0 as possible.
Equation 1. Precision, Recall and F1 score Recall TP TP FN = + (2) To gain deeper insights into the model's performance, a confusion matrix was employed.This matrix allows one to visualize the distribution of true positives, true negatives, false positives, and false negatives, offering a more detailed understanding of the model's classification performance across different categories.A confusion matrix is generated separately for each feature engineering technique and classification algorithm, resulting in a total of 12 matrices.However, only the most noteworthy ones, which reveal interesting results, are included in this presentation.The remaining diagrams are accessible on the GitHub 52 page for further examination.
Further analysis is conducted to determine the frequency with which a novel is correctly categorized based on its constituent text chunks (referred to as Pieces in the figures).This analysis aims to illustrate how well each novel was categorized based on its smaller constituents, designated as "texts" in this study.To do so, I started by tallying the number of texts within the novel that were incorrectly classified.A value of 0 was assigned for correct categorization and 1 for incorrect categorization.The formula for this calculation is the number of misclassified texts divided by the total number of texts in a novel, and gives the misclassification rate for each novel.Each row in the table corresponds to a novel, with the "Pieces" column indicating the number of pieces for each novel.The SVM, logistic regression (lr), neural networks (nn), and random forest (rm) columns display the ratio of misclassified pieces or texts divided by the total misclassified pieces.For instance, in Figure 14, SBERT was employed as a feature engineering method, and the first row indicates that this novel is comprised of pieces.SVM, logistic regression (lr), and neural networks (nn) successfully classified half of the novel correctly (0.5), while random forest (rm) misclassified all pieces of the novel incorrectly (1.0).It is worth noting that the higher the values, the more texts or pieces of the novel are misclassified.

Machine learning workflow of this study
The generic machine learning workflow (Figure 5) shown at the beginning of the Method section gives information about which steps are included in the process, while Figure 6 below gives information about which algorithms are used in each step.The texts were cleaned from unwanted words and special characters before the feature engineering step (SBERT-data remained uncleaned).Each feature engineering technique (LDA, SBERT, TF-IDF) was separately applied to the data to get sets of feature vectors.Then logistic regression was used to predict which time period category a text belonged to.Lastly, these predictions were evaluated with F1 score: each feature engineering technique was evaluated separately for each time period category.

Results
The results are presented separately for each feature engineering technique and how they performed through precision, recall, F1 score, and confusion matrix for each time period category.The results are divided into three sections, the first "The performance metrics on text level" shows the performance of each feature engineering technique across four classification algorithms.Moreover, the F1 average over all time period classes are calculated.
Section "Time period analysis with confusion matrix on text level" shows a detailed summary of the performance of a classification algorithm.
Section "Performance metrics on novel level" shows how each novel is classed based on its texts.Three heatmaps are provided for each feature engineering technique.Each heatmap is segmented into four distinct time periods, namely Medieval, Era of Great Power, Gustavian Period, and Age of Liberty.Within each of these time periods, experiments were conducted using four different classification algorithms: SVM, random forest, neural networks, and logistic regression.The color scale indicates that as the shades of green become deeper, the predictions become more accurate, while as they shift toward yellow and orange, the predictions become less accurate.

The performance metrics on text level
The first figure in the results (Figure 7) shows the topic-based machine learning technique LDA and how it performed across the four time period categories.Out of the four time period categories, Age of Liberty had the highest precision, recall, and F1 scores, followed by Medieval period.However, both Era of Great Power and Gustavian period had less predictive performance of 0,74, which equals to correct predictions in 74% of the time.Figure 8 shows that TF-IDF have high scores on precision, recall, and F1-score across all time period categories.However, Medieval period and Age of Liberty has the highest scores, followed by Era of Great Power and Gustavian period.
As a contrast to TF-IDF, Figure 9 shows that SBERT have low scores on all metrices.Medieval period and Age of Liberty have scores close to 0,5 which means that only 50% of the texts were categorized correctly.Era of Great Power and Gustavian period have even lower F1-scores which means that more text was categorized wrong than correct.

Comparison between techniques with F1-score
Figure 10 shows only F1-score comparison between the three techniques.
The data shows that TF-IDF performed well across all four time periods, with scores ranging between 0.92 and 0.99.LDA also performed well, with scores ranging between 0,74 and 0,92.SBERT had the lowest scores overall, with scores ranging from 0.39 to 0.521.In terms of the specific time periods, the Age of Liberty has the highest F1-scores across all three techniques, while the Gustavian Period has the lowest scores.

Confusion matrix
In this section, the results are presented through a selection of confusion matrices chosen for their ability to provide valuable insights into the data.While a total of 12 confusion matrices were generated, only a subset is included here.The matrices featured in this presentation shed light on specific instances where two categories become mixed, offering a glimpse into the nuances of the classification process and revealing potential patterns or areas for improvement in the model's performance.Figure 11 depicts the confusion matrix for LDA and neural networks.Regardless of the choice of classification algorithms, a notable pattern emerges -the time period classes "Era of Great Power" and "Gustavian period" show a degree of overlap, leading to instances of misclassification between these two categories.This same pattern can also be observed for TF-IDF, indicating a recurring mix-up between these two categories, see Figure 12.In the case of confusion matrices for SBERT, it is challenging to determine any noticeable patterns due to the consistently low prediction accuracy.Consequently, most classes exhibit significant mixing and misclassification.

Classification on novel level
Through examination of the values within this heatmap table, it is evident to what extent portions of the novels were accurately categorized with minimal misclassifications (values close to 0, depicted in green), and  conversely, the degree of misclassification (values close to 1, shown in orange or red).These results provide insights into the performance of each classification algorithm in predicting the time periods of novels, helping to identify how much of a novel is misclassified.As shown in Figure 13, representing TF-IDF, a larger number of novels were correctly categorized compared to Figure 15, which represents LDA.However, it is worth noting that TF-IDF exhibited strong performance in conjunction with SVM, Logistic Regression, and neural networks, while LDA, paired with random forest, outperformed TF-IDF and random forest.Figure 14 displays SBERT results, and the majority of the results fall within the yellow, orange, or red range, indicating a significant portion of misclassifications within novels.

Discussion
The study results suggest that TF-IDF and LDA are promising feature engineering techniques for analyzing text data across different time periods, while SBERT may be less effective in this context.Both LDA and TF-IDF performed consistently well across all four time periods, with scores ranging from 0.74 to 0.95, while SBERT had lower scores across all four periods, about and under 0.5.Several factors may contribute to the differences in performance between these techniques.It can also be noted that TF-IDF, when paired with SVM, neural networks, and logistic regression, demonstrated excellent performance, whereas random forest underperformed.In contrast, when considering LDA in combination with SVM, random forest, neural networks, and Logistic Regression, all machine learning classification algorithms performed similarly.However, the F1 score was not as high as with LDA than with TF-IDF.An explanation for the positive outcomes could be that LDA and TF-IDF, well-established techniques in natural language processing, might be particularly well-suited for analyzing texts spanning a diverse range of time periods.Additionally, LDA and TF-IDF rely on statistical methods that can identify patterns and trends in text data, which may be particularly useful for identifying similarities and differences across periods.However, SBERT displayed poor performance, possibly because the texts are written in old Swedish, while SBERT is trained on newer Swedish.SBERT is a more recent technique employing neural network models for generating vector representations of sentences, and using too many sentences may result in lower-quality embeddings, which affect the categorization.Another factor contributing to the suboptimal results in SBERT could be the novel chunking methodology, as novels were segmented based on word count rather than, for instance, chapters.
As Saarti discussed, fiction content analysis is challenging due to its multimodality and interpretational nature. 53Nonetheless, as the results of this study suggest, machine learning techniques can be used to effectively find and analyze patterns and relationships in large amounts of data and categorize features of historical fiction texts.The categorization of fiction has traditionally relied on formal and external aspects such as genre, literary form, author, place, and language.However, as pointed out by Almeida and Gnoli, and as corroborated by the results of this study, a more effective approach to categorizing fiction is to consider the actual content of the novels. 54By doing so, it becomes possible to capture the thematic essence of fiction novels and provide a measure of predictive performance in categorization.
Additionally, as machine learning techniques become more available, areas of fiction categorization that was previously explored manually are now explored with machine learning, such as identifying if a text is fiction or nonfiction, as done by Manger, 55 or, more commonly, categorizing based on genre 56 or publication dates. 57To improve the categorization of fiction, it is necessary to not only reassess traditional categories like genre and publication date with machine learning but also to consider less traditional approaches for categorizing fiction, such as time period categorization.Time as such is not a new phenomenon when organizing literature in libraries, still, only a small fraction of literature has been categorized into time periods.The findings in this study have several implications for future research and practical applications.First, researchers and practitioners interested in analyzing text data across different time periods may benefit from using LDA or TF-IDF as feature engineering techniques, as these techniques have shown consistent performance across all time periods.Second, more experiments should be done to optimize SBERT for time period categorization, i.e., vary the number of input sentences to get higher performance or chunk the texts in different ways.It is important to note that this study had some limitations, including the small sample size, not enough data to cover Vasa period.Future research could explore the performance of these techniques on larger or more diverse datasets with more time periods and consider investigating different pre-processing and categorization algorithms.Moreover, this study explored one-to-one categorization, meaning that one text could only belong to one time period, future research could explore multiple possible categories for each given novel.
The analysis of the confusion matrices provides valuable insights into the classification results and suggests potential reasons for the mixing of two time periods, particularly the "Era of Great Power" and the "Gustavian period." The confusion matrices consistently show an overlap between these two time period classes, regardless of the choice of classification algorithms (LDA, TF-IDF, or SBERT).This recurring misclassification suggests that there may be similarities or shared characteristics between novels from these two historical eras, making it challenging for the model to differentiate them accurately.Further feature analysis or refinement may be needed to better distinguish between these two categories.The observation that TF-IDF also shows the same pattern as LDA of mixing between the "Era of Great Power" and "Gustavian period" indicates that this issue is not specific to one algorithm but might be related to the data.It raises questions about whether some common terms or themes span across these two periods, potentially leading to confusion during classification.
The heatmaps (Figures 13-15) visually represent the extent to which a novel is misclassified.In these heatmaps, a value of 0 indicates that 0% of the novel is misclassified, while a value of 1 signifies that 100% of it is misclassified.When considering how catalogs create metadata, a common guideline for representing the time periods of a literary work is that the work should include a minimum of 20% of its content dedicated to describing one or more time periods.Notably, most of the predictions in the heatmaps were misclassified by less than 50%, indicating that even if the entire novel is not correctly classified, valuable hints can be gained regarding the likely time period it represents.This level of granularity allows for a more nuanced understanding of the thematic elements within a novel, potentially aiding in the categorization process despite occasional misclassifications.
These heatmaps also reveal that dividing novels into smaller pieces and classifying them individually works effectively, as the classification is consistent mainly across different classifiers when the pieces are reassembled into a whole novel.This consistency in predicting both correct and incorrect classes is more pronounced when using TF-IDF and LDA compared to SBERT, which does not exhibit such a clear pattern.
Upon closer inspection of a specific novel, Novel 19, which was misclassified by all techniques, belonged to the correct class of "Era of Great Power." However, it was wrongly classified as "Gustavian period" by the machine.This misclassification occurred because the novel's theme revolves around sea-faring, ships, captains, war, and Russia.During the Gustavian period, a war against Russia in 1788 involved sea-faring activities near the Finnish border.Novel 19's storyline, which mirrors a war against Russia and includes various aspects similar to the Gustavian period, led the machine learning techniques to associate it more closely with that time period, even though its correct classification was "Era of Great Power." This example highlights the nuanced challenges these models face in capturing and differentiating similar themes and historical contexts within literary works, sometimes resulting in misclassifications that need further investigation and refinement.

Conclusion
This study aimed to evaluate and compare three feature engineering techniques to discover which technique excels in engineering features -with the aim to categorize historical fiction novels in their correct time period: Medieval Period, Era of Great Power, Age of Liberty, and Gustavian Period.This was done by experimenting with the following machine learning techniques: LDA, TF-IDF, and SBERT.Overall, the results suggest that TF-IDF and LDA are promising feature engineering techniques for categorizing text data across different time periods, while SBERT produced poor results for all three time periods.Examining confusion matrices has provided valuable insights into the classification results, particularly revealing persistent challenges distinguishing between the "Era of Great Power" and the "Gustavian Period." Regardless of the classification algorithm used (LDA, TF-IDF, or SBERT), recurrent misclassifications suggest potential similarities or shared characteristics between novels from these two historical eras.This necessitates further feature analysis or refinement to enhance differentiation.Furthermore, the effectiveness of dividing novels into smaller pieces for individual classification is highlighted.This approach demonstrates consistent predictions when pieces are reassembled into a whole novel, particularly with TF-IDF and LDA, and offers valuable insights into a novel's likely time period, aiding categorization despite occasional misclassifications.

Figure 1 .
Figure 1.time periods with date ranges.

Figure 2 .
Figure 2. illustration of training data composition with descriptive features and target time period categories for preceding machine learning application and predictions.

Figure 4 .
Figure 4. Phase 1: the process of extracting years form novels without machine learning.

Figure 6 .
Figure 6.specific machine learning workflow for this study.

Figure 7 .
Figure 7. Performance of lDa with sVM, random forest, neural networks, and logistic regression across different time period categories.

Figure 8 .
Figure 8. Performance of tf_iDf with sVM, random forest, neural networks, and logistic regression across different time period categories.

Figure 9 .
Figure 9. Performance of sBert with sVM, random forest, neural networks, and logistic regression across different time period categories.

Figure 10 .
Figure 10.Comparison of f1-score for each feature engineering technique (tf-iDf, lDa, sBert) and machine learning algorithms (sVM, random forest, neural networks, and logistic regression) across time periods.