Detecting susceptible communities and individuals in hospital contact networks: a model based on social network analysis

In the real world, the spread of various epidemics can be modelled using the SEIR disease transmission model. Early detection and prevention of susceptible individuals is an effective method of controlling the spread of infectious viruses. This paper presents a study that detects susceptible community on a hospital contact network, which encompasses patients, nurses, doctors and managers. The goal of our work is to identify susceptible communities with patients and healthcare workers, and analyse the independent contact networks of various roles to determine the high-influence nodes within the hospital contact network. If these high-influence nodes are part of the susceptible community, they should be the focus of observation to prevent the spread of the virus. The proposed model combines social network analysis method and machine learning method with the disease transmission model. This study employs the classic overlapping community detection method CPM for community detection and the PageRank algorithm to rank node influence. The experimental results, obtained from real-world hospital contact networks over a 4-day period, demonstrate the effectiveness of the model.


Introduction
In recent years, as the quality of human life has improved, more and more people are concerned about the healthcare.Healthcare refers to the maintenance and improvement of physical and mental well-being, prevention of illness, disease, and injury, etc.There are many researchers to do studies related to this issue.
At the same time, artificial intelligence technology (AI) is transforming the healthcare industry.AI involves utilise computer algorithms and models to perform tasks that normally require human intelligence, such as learning, reasoning and problem solving.The main advantage of AI is its ability to analyse massive data at faster and more accurate rate than humans, and it is this "data processing" and "deep analysis capability" that is further improving the development of healthcare.
AI-based healthcare refers to the use of AI technologies in the healthcare industry.The goal of integrating AI into healthcare is to enhance the efficiency and effectiveness of healthcare delivery and improve patient outcomes.AI can be used to analyse large amounts of medical data to identify patterns and make predictions, assist in medical diagnosis and treatment planning, monitor patient health and alert healthcare providers to potential issues, and automate routine tasks to improve efficiency.The use of AI-based data analysis can help healthcare providers and researchers to identify patterns and insights within the data that may be difficult or impossible to detect using traditional statistical methods.

AI-based data analysis applications in healthcare
Practically, AI-based data analysis applications in healthcare include medical imaging analysis, virtual health assistants, personalised medicine, drug discovery, predictive analytics, etc.At present, the significance of data analysis methods in the healthcare industry can be viewed from multiple perspectives.The following aspects are particularly noteworthy: • Medical imaging analysis -The data analysis algorithms can be used to analyse medical images, such as X-rays by Savadjiev et al. (2019), MRIs, and CT scans by Shomirov and Zhang (2021), to identify patterns and anomalies that may be difficult for human radiologists to detect by Chen and Sung (2021) and Ker et al. (2017).This can help with earlier and more accurate diagnosis of conditions such as cancer and heart disease.• Predictive analytics and Disease diagnosis -Machine learning algorithms can be trained to predict health outcomes (such as Dermatology and Radiology) based on patient data, such as vital signs, medical history and lifestyle factors by Manne and Kantheti (2021).These predictions can be used to identify patients at high risk of developing certain conditions and enable early interventions to prevent or manage the disease.The application of machine learning and deep learning algorithms to analyse large volumes of medical data can facilitate faster and more accurate disease prediction and diagnosis by medical professionals by Lauritsen et al. (2020).• Patient management and care -AI algorithm can enable personalised patient management and care, which involves analysing vast amounts of medical data to predict disease progression, timely intervention and treatment, and improved patient survival rates and quality of life by Jadczyk et al. (2021) and Tătaru et al. (2021).• Healthcare resource allocation and utilisation -Analysing medical data can enable healthcare institutions to better understand the distribution and utilisation of medical resources, thereby enabling the timely adjustment of resource allocation and utilisation to improve the efficiency and quality of medical services.For example, the use of artificial intelligence can help alleviate the regional differences in China's medical resources and optimally solve the problem of unbalanced resource utilisation by Deng et al. (2023) and Kong et al. (2019).
In addition, we can also combine social network analysis and healthcare, and apply traditional algorithms or machine learning methods to the analysis of healthcare data, so that create a well-life for human beings.

Social network analysis and data analysis in healthcare
Social network analysis (SNA) plays a crucial role in studying disease propagation models.The spread of infectious diseases is a social behaviour, and SNA can help us understand the relationship between social network structure and transmission behaviour, thereby improving the prediction and control of disease transmission.
Specifically, SNA can help us understand the following aspects: • Transmission path and risk -SNA can help us identify the transmission path and risk of diseases in social networks, thus enabling medical institutions to predict and control the spread of diseases.• Disease transmission speed and scale -SNA can help us predict the transmission speed and scale of diseases in social networks, thus enabling early preparation of epidemic prevention and control.

• Potential transmission barriers and intervention strategies -
SNA can help us identify potential transmission barriers and intervention strategies to reduce the risk and impact of disease transmission.
In recent years, the early detection of infectious diseases has become a significant research focus in computational epidemiology and social network/complex network analysis, particularly due to the outbreak and epidemic of Coronavirus disease 2019 (COVID-19) and influenza.Infectious diseases often spread rapidly, widely and unpredictably, causing enormous losses of life and property.Typically, infectious disease models incorporate population behaviour and interpersonal relationships in contact networks.
Our research aims to detect the susceptible population by applying social network analysis to contact networks and combining them with epidemic models, it is a meaningful and useful way to reduce the damage of infectious disease.In this work, we focus on designing a mode of detecting some communities in the hospital contact network and mining some key nodes in these communities, we utilise the social network analysis method in this work.
To summarise, the major contributions of this work are as follows: • Susceptible detection Model -We propose a susceptibility detection model to characterise the susceptibility of the community in the patient contact network.At the same time, we also consider four different roles of patients, doctors, nurses and hospital managers contained in the contact network.If there are not any patient in a community, this community is not a susceptible community.Therefore, our model can not only detect the community but also determine whether the community is a susceptible community, so as to decide whether to implement epidemic prevention and control in the community.In particular, in this model, we utilise the traditional Cliques Percolation method to detect overlapping communities.• Detect Key Nodes in the Community of susceptible -We use machine learning method to calculate and rank the importance of different roles (patients, doctors, nurses and managers) in the network of their respective roles and determine whether they are in the community of susceptible people detected by above, to obtain the key nodes that need to be focused on monitoring for infection.
• Results and analysis of visualisation based on real-world contact network data -We conduct extensive experiments on a real-world hospital contact network.We analysed the data in the contact network within 4 days respectively, to detect the highly cohesive susceptible communities, analysed the respective contact networks of patients and healthcare workers, and visualisation of our results.
The rest sections of this work is structured as follows.In Section 2, we explain the SEIR (Susceptible-Exposed-Infected-Recovered) model and review the existing social network analysis related research on spreading model of epidemics; the fundamental definitions related to the work on community in contact network, and the basic community detection algorithm and key node mining algorithm are given in Section 3, at the same time, we elaborate our model and algorithm of this paper; in Section 4, we apply our model to real datasets, test the effectiveness of our model, and present and analyse the experimental results.Section 5 summarises the work and proposes our future work.

Related work
This section, we overview the state-of-art in the SEIR model and social network analysis in epidemics.

SEIR model in epidemics
The SEIR model is a widely used mathematical model for studying the spread and control of infectious diseases.Numerous studies have focused on the application of SEIR models in epidemiology, with the aim of improving our understanding of disease transmission dynamics and developing effective prevention and control strategies.Some notable works include the analysis of the SEIR model in the context of SARS, COVID and MERS outbreaks, the study of transmission patterns and parameters for various infectious diseases using SEIR models, and the development of optimisation algorithms for parameter estimation and model fitting.
The SEIR model incorporates four states, "S" is susceptible, "E" indicates exposed, "IA" is infectious-asymptomatic and "R" means recovered.According to the existing studies, the state of susceptible can transition to infectious-symptomatic state per unit time β, while the period of exposed transit to the state of infectious with the infection probability per unit time γ , by Huang et al. (2020).Thus developing effective methods for detecting susceptible individuals is crucial for preventing and controlling the spread of epidemics.Figure 1 provides an illustration of the SEIR model in the contact network.
Figure 1 depicts the virus carriers (patients) as grey nodes and their direct contacts as purple nodes, who are highly susceptible to the infection.It is worth noting that although the green nodes do not have direct contact with the virus carriers, they are still highly susceptible to the virus as they have contact with three purple nodes, i.e. direct contacts.Consequently, they belong to the susceptible group.Similarly, the yellow node, who has contact with a green node, belongs to the second-level susceptible population.Funk et al. (2018) developed a stochastic SEIR model for forecasting real-time infectious disease and applied it on the dengue fever in Thailand.Cauchemez et al. (2008) estimate the impact of school closure on the transmission of seasonal influenza by the SEIR model, the result of this work shows that school closure can reduce the incidence of influenza by up to 40%.Y. Yang et al. (2021) modelled prevention and control strategies for COVID-19 by detecting γ − Quasi − Cliques on contact network.Ganyani et al. (2020) used a modified SEIR model to estimate the generation interval of COVID-19, which is the time in a transmission chain, and they found the generation interval for COVID-19 is shorter than that for SARS and MERS.There is a review article discussed about the use of SEIR model in studying the transmission dynamics of infectious diseases in different factors, such as age groups, locations by Heesterbeek et al. (2015).

Social network analysis
Social network analysis (SNA) is an analytical method for the detailed study of human communication, interaction and other complex behavioural patterns (cited from Y. Yang, Peng, et al., 2022).It is used to study individuals/organisations and their connection structures.Social network data is usually presented in the form of a social graph (shown in Figure 1), where nodes represent individuals and edges represent relationships between individuals, where relationships may be personal friend relationships, contact records with other individuals, and so on.According to our previous work (cited from Y. Yang, Hao, et al., 2022), for a dynamic social network, there are four kinds of cliques in the evolution process of social networks: unchanged cliques, changed cliques, added cliques and vanished cliques.Similarly, for the contact network, the corresponding susceptible communities will also change in different time streams.
Before the emergence of COVID-19, there were several studies that investigated the transmission of various viruses through contact networks, including Acquired Immune Deficiency Syndrome (AIDS) (cited from Bearman et al., 2004), sexually transmitted diseases (STDS) (cited from C. Wang et al., 2020) and Plague (cited from Chang et al., 2020).Kuperman and Abramson (2001) indicate some disease contact networks can be described as small-world networks, this kind of disease has two characteristics: (1) dense local clusters composed of local connections and (2) multiple distinct local clusters in different communities or areas connected by long-distance links (cited from Watts & Strogatz, 1998).Creswick and Westbrook (2010) applied social network analysis to the informative interactions of doctors, nurses, health professionals and clerks in the social network of the Australian Metropolitan Teaching Hospital.This research suggests that social network analysis can be used to examine complex medication advice-seeking interactions among hospital ward staff, providing useful quantitative baseline data for comparing the impact of interventions on interactions.Vanhems et al. (2013) utilised a modified SEIR compartmental mathematical model for predicting the epidemic dynamics of COVID-19, taking into account the presence of pathogens in the environment and various interventions.The findings of this study demonstrate that the quarantine of individuals who have had close contact with infected individuals and the isolation of confirmed cases is effective in controlling the spread of COVID-19.
After the outbreak of the COVID-19 epidemic, Z. Yang et al. (2020) used the COVID-19 epidemic data in 2020, SARS data and recent population migration data in China to train an AI model to obtain the epidemic curve of COVID, so as to effectively predict the peak period and epidemic scale of the COVID-19 epidemic in the near future, confirming the conclusion that it is necessary to extend the epidemic control period.McCall (2020) proposed that protecting the healthcare workers is one of the keys to preventing outbreaks of disease transmission, but this work did not give specific plans and measures for protecting healthcare workers.Vaishya et al. (2020) proposed some applications of AI in the COVID-19 pandemic, including AI can be used for contact tracing and monitoring, drug and vaccine development, and patient case prediction, but their work only provides the idea how to apply AI on the disease transmission and several ideas for prevention, but no specific methods and models are given.
Based on some shortcomings of the above related works, in our work, we are committed to proposing a complete model that uses AI methods to analyse contact networks, track contacts and detect susceptible communities, so as to prevent and control the epidemic.

Model for the detection of the susceptible in hospital contact network
The problem statement of our work is characterised in the first subsection of this section, and fundamentals of community detection and key nodes ranking are elaborated in the second subsection.

Problem statement
Usually, there are several roles included in a hospital contact network, such as doctors, nurses, managers and patients.The patient will carry the virus of some infectious diseases (such as COVID-19), the nurses, doctors and managers who contacted to patient are possible to become virus carriers (exposed, susceptible) in the process of multiple contacts (or prolonged contact) with the patient, and they will carry the virus to more healthcare workers who don't contact with patients directly (other nurses, doctors or managers in the office areas).
Therefore, the purpose of this paper is to design a model that can detect healthcare workers who have high contact frequency, or long contact time with patients, at the same time, they also have frequent contact with other workers in their office areas.For these kinds of healthcare workers, we will suggest conduct stricter epidemic prevention tests on them, to reduce the possibility of virus transmission.
Therefore, the model is divided into the following three steps: • Detecting contact networks where patients and healthcare workers coexist and mining susceptible communities • Calculating the importance of healthcare workers with different roles in their respective scope of work separately, ranking their importance, and obtaining a Top-N list of healthcare workers who are most likely to contact others • Determine whether the healthcare workers in the Top-N list belong to the susceptible population (i.e.exist in a susceptible community).
Figure 2 is the big picture of our model, it explains our goal in detail.At first, we will detect the communities in the contact network; then we will rank the nodes in sub-contact networks with different roles (Nurse, Doctor, Patient and Manager), the top-N list nodes show as the yellow nodes in the subgraph.Finally, we will determine the nodes (from top-N node-lists) in the communities.

Detection of the susceptible in hospital contact network
To understand our method well, at the beginning of this section, we will briefly introduce some basics of social network analysis methods that we will apply in our model.
(1) Cliques Percolation Method (CPM), References: Adamcsek et al. (2006), Zhang et al. (2007), and Kumpula et al. (2008).The CPM is used to discover overlapping communities, and a clique is a collection of nodes where any two nodes are connected, i.e. a complete subgraph.The nodes within a community are strongly interconnected and have a high edge density, making it likely for cliques to form.As a result, edges within communities are more likely to result in large complete subgraphs, while edges between communities are impossible to do so.Community detection can therefore be performed by identifying cliques in the network.A k-clique refers to a complete subgraph with k nodes in the network.When two k-cliques overlap with k−1 nodes, they are considered to be connected.The collection of all connected k-cliques constitutes a k-clique community.The CPM algorithm identifies k-cliques and then combines them to form larger clusters based on their overlapping nodes.The CPM algorithm includes several steps: • Identify all k-cliques in the network • Create a graph where each k-clique is a node and there is an edge between two nodes if they share k−1 nodes • Identify the connected components of this graph, where each connected component represents a cluster of k-cliques that share overlapping nodes.The parameter k determines the size of the cliques and the level of granularity of the resulting clusters.A larger k will result in fewer, larger clusters, while a smaller k will result in more, smaller clusters.The value of k is determined by user, usually we choose the value of k in the interval (4, 7).
(2) PageRank in undirected graph, References: Page et al. (1999), Avrachenkov et al. (2015), Grolmusz (2015) and M. Wang et al. (2017).PageRank algorithm, also known as web page ranking algorithm, is a technology that is calculated by search engines according to the mutual hyperlinks between web pages (nodes) to reflect the relevance and importance of web pages (nodes).
The main idea is: • A web page with numerous links from other web pages is considered to be more significant and therefore has a higher PageRank value.
• The PageRank value of a page that is linked to by a page with a high PageRank value will also increase.The algorithm works by assigning a score to each page based on the number and quality of incoming links it receives.The score is determined iteratively, with each iteration representing a vote for the page.The initial score for each page is assumed to be 1/N, where N is the total number of pages in the graph.During each iteration, the score for each page is updated based on the scores of the pages that link to it.The updated score for each page is determined by summing up the scores of all the pages that link to it and dividing that sum by the total number of outgoing links from each of those pages.
The process continues iteratively until convergence, which occurs when the score for each page no longer changes significantly.The final score for each page represents its PageRank, which can be interpreted as the probability that a random walk starting from a random page in the graph will eventually end up at that page.
PageRank is also a useful tool for ranking pages in undirected graphs because it accounts for both the number and quality of links to a page.Pages with high PageRank scores are considered more important, and are more likely to be highly ranked in search results or other applications where ranking is important.In a social network, we replace the concept of a "web page" with a "node" and calculate the importance of that node in the network.The formula is as follows: Among them, u,v represent the nodes in the graph, B u represents the node connected to the node u, and L v represents the number of neighbour nodes of the node v.
In the domain of social network analysis, several methods have been developed to rank the importance of nodes, including the HITS (Hyperlink-Induced Topic Search) algorithm, degree centrality, closeness centrality, betweenness centrality, eigenvector centrality and PageRank algorithm.Out of these methods, PageRank is considered more suitable for large-scale networks due to its better scalability.Originally developed to evaluate the weight of web pages, the algorithm takes into account the propagation power of nodes, which is directly proportional to their influence on the entire network.Therefore, nodes with stronger propagation ability are deemed to be more important.Consequently, we choose to employ the PageRank algorithm to rank the importance of nodes in the contact network in this work.
(3) Our Method In our method, to detect the susceptible community and the highinfluence individuals in the contact network, we use CPM method to mining overlapping community to make sure the community includes patients and healthcare workers, then we apply the PageRank method on four different roles contact network (Nurse, Patient, Doctor and Manager) respectively, get the top-N node list, finally, to find the nodes which not only belong to top-N node list but also community.The process of our method is divided into three steps: • Detect susceptible community by CPM method with undirected contact graph • Ranking the frequent-contact-patient healthcare workers by PageRank method with different roles sub-contact network, respectively • Determine whether the frequent-contact-patient healthcare workers are included in the susceptible community we obtained in the step1 and list them.Algorithm 1 describes our method as a pseudocode as follows: Algorithm 1 Epidemic Prevention Model for Hospital Contact Network 1:   In Algorithm 1, we apply the CPM algorithm and the PageRank algorithm, and traverse and check the communities and the top-N nodes obtained by the above two algorithms.Therefore the time complexity of this algorithm is composed of three parts.Generally, the worst case of the CPM algorithm is O(k 3 × n 3 ), where k is the given parameter (the size of the cliques being searched for), n is the number of nodes in the network; The time complexity of PageRank algorithm is O(n + e), where n is the number of nodes in the network and e is the number of edges in the network; for the third part, since we traverse the community obtained above and each node in the community, so the time complexity is O(c × C), where C is the number of communities obtained by the CPM algorithm, and c is the number of nodes in each community.Therefore, the worst time complexity of Algorithm 1 is

Dataset
This dataset is a contact network in real word between patients, patients and healthcare workers (Nurse, Doctor and Manager) in a hospital ward within 4 days in Lyon, France (cited from Vanhems et al., 2013).This study included 46 healthcare workers and 29 patients, record one time each 20 seconds.In our work, to get the frequent-contact-person, we take every 3 times/60 seconds as a count unit of contact record.
Table 1 shows the data of this contact network with 4 roles within 4 days.The table shows the number of nodes and edges, average degree and maximum degree of network, and network density of each role network.Compared with the total role network (Dataset All), the density of the patient network is lowest, indicating that patients have little contact with each other.In contrast, the density of the nurse network is the highest, and nurses have a lot of contact with each other.
Table 2 shows each single day contact network with all roles.This table shows the number of nodes and edges, average degree and maximum degree of network with each  2 is shown in Figure 3, each subfigure represents a single day.
In addition, to better illustrate the different people with different roles included in the daily contact network, we provide a hospital contact network graph within 4 days containing different role distributions in Figure 4, where each sub-graph represents a single day.

Results and analysis
(1) As we mentioned in Section 3, the first step is the detection of k-cliques community by the CPM method, the results of k-cliques community with different k are shown in Table 3.For evaluate the quality of community detection with different value of parameter k, the EQ value is introduced in Table 3 (Nicosia et al., 2009).EQ (Enhanced Modularity-based Quality) metric is a measure of the quality of overlapping community detection results.The EQ metric can be used to compare the results of different algorithms for community detection and to select the best community partition.
In Table 3, for the Dataset Day1, it is observed the highest value of EQ is obtained when k = 4. Thus we keep k = 4 as the parameter and save the results of Community partition of Day1.In the same way, we judge the EQ value from the Day2 to the Day4 and select the appropriate community division parameters.The grey row in Table 3 indicates the value of k selected when the EQ value is the highest in each single day data.

Conclusion
Overall, since several infection diseases can be modelled as an SEIR model, an effective way to stop the spread of the virus is to detect and prevent susceptible individual/community as early as possible.Based on this premise, this paper uses CPM community detection algorithm to detect the contact network of hospitals to obtain susceptible communities, then analyses the independent contact network of different roles by PageRank to obtain high-influence nodes.In the experimental part, the experiments are conducted on a realworld hospital contact data, and the most suitable parameter k of K-cliques community and parameter n of Top-n nodelist are found.The results show that the model can effectively solve our problem.
In future work, we tend to try to use dynamic network community detection algorithms to monitor contact networks including timestamp in real time, analyse and evolve disease transmission models, and find susceptible communities and key individuals.

Figure 1 .
Figure 1.A simulation of SEIR model in a contact network.

Figure 2 .
Figure 2. Overview of our model.
Set C ← CPM(G, k) 15: NurseList ← PageRank(G n , k) 16: PatientList ← PageRank(G p , k) 17: DoctorList ← PageRank(G d , k) 18: ManagerList ← PageRank(G m , k) 19: for each c in C: 20: for each node in c: 21: if node ⊂ NurseList || PatientList || DoctorList || ManagerList: 22: RS ← node 23: end In Algorithm 1, lines 1-6 give the input data of the algorithm, including the complete contact graph G, the subgraph G n , G p , G d and G m of different roles, as well as parameters k and n.Lines 7-12 are the initialisation of Community Set C, Top-n nodelists with four roles, and the Result Set.In Lines 13-18, the CPM algorithm and the PageRank algorithm are utilised to obtain the community detection outcomes of the contact network graph G and the nodelists of importance rankings of the different role subgraphs.The essential idea of our model is presented in Lines 19-22, it travel each community within the community set C, and for each node within community c, if it contained in NurseList, PatientList, DoctorList or ManagerList, put it in the result set RS.

Figure 3 .
Figure 3. Contact Networks within 4 days in the hospital: (a) Contact network on Day 1; (b) Contact network on Day 2; (c) Contact network on Day 3 and (d) Contact network on Day 4.

Figure 4 .
Figure 4.The contact network of different role distributions in the hospital within 4 days: (a) Contact network of Role Distribution on Day 1; (b) Contact network of Role Distribution on Day 2; (c) Contact network of Role Distribution on Day 3 and (d) Contact network of Role Distribution on Day 4.
(3) For the step 3 of our model, we determine the nodes both in susceptible community and Top−n nodelist with each single day and single community.Each subfigure in Figure6represents a susceptible community, where subfigures (a)-(b) represent the two communities of day1, subfigures (c)-(d) correspond to the communities of day2, and subfigures (e)-(f) and (g)-(h) represent the susceptible communities of day3 and day4, respectively.For each subfigure, the label on the node represents the role and number of the each person, among that, PAT represents the patient, NUR represents the nurse, MED represents the doctor, and ADM represents the manager.In addition, the orange node represents the key node ( i.e., the node belonging to the Top−n nodelist by PageRank).

Table 1 .
Datasets I. Contact networks in the hospital within multiple roles.

Table 2 .
Datasets II.Contact networks in the hospital within 4 days.

Table 3 .
The results of k-cliques community within 4 days.

Table 4 .
The node list of k-Cliques Community.

Table 5 .
Top-N nodelist of each role.