Single-cell RNA seq analysis of erythroid cells reveals a specific sub-population of stress erythroid progenitors

ABSTRACT Background : Erythroid cells play important roles in hemostasis and disease. However, there is still significant knowledge gap regarding stress erythropoiesis. Methods : Two single-cell RNAseq datasets of erythroid cells on GEO with accession numbers GSE149938 and GSE184916 were obtained. The datasets from two sources, bone marrow and peripheral blood were analyzed using Seurat v4.1.1, and other tools in R. QC metrics were performed, data were normalized and scaled. Principal components that capture the variation of the data were determined. In clustering the cells, KNN graph was constructed and Louvain algorithm was applied to optimize the standard modularity function. Clusters were defined via differential expression of features. Results We identified 9 different cell types, with a particular cluster representing the stress erythroids. The clusters showed differentially expressed genes as observed from the gene signature plot. The stress erythroid cluster differentially expressed some genes including ALAS2, HEMGN, and GUK1. Conclusion The erythroid population was found to be heterogeneous, with a distinct sub-cell type constituting the stress erythroids; this may have important implications for our knowledge of steady-state and stress erythropoiesis, and the markers found in this cluster may prove useful for future research into the dynamics of stress erythroid progenitor cell differentiation.


Background
Erythrocytes, a type of blood cell, are involved in a wide variety of processes, including gaseous exchange, pH and redox homeostasis maintenance, controlling vascular tone, and blood coagulation [1][2][3].
Erythrocytes are produced by a process called erythropoiesis, which occurs mostly in the bone marrow and ends in the bloodstream.Transforming multipotent hematopoietic stem cells into adult red blood cells (RBCs), which are highly functional specialized cells, requires a complex maturation process including many morphological changes [4].
Steady-state erythropoiesis produces many erythrocytes, with adults making about 2.5 × 10 6 erythrocytes every second [5].Inadequate or inhibited erythrocyte synthesis results in increased erythroid production.The steady-state erythropoiesis that normally compensates for erythrocyte loss is suppressed when inflammation from infection or tissue injury is present.These conditions favor the BMP4-dependent stress erythropoiesis pathway [6].Stressinduced BMP4-dependent erythropoiesis differs from steady-state.
Stress erythropoiesis is the process by which the body increases the production of RBCs in response to various stressors, such as anemia, hypoxia, or inflammation [6].Important for this process are cells called reticulocytes.According to studies, stress erythropoiesis causes the bone marrow to produce and release more reticulocytes [7].It is believed that this is the body's technique for rapidly increasing the amount of fully developed red blood cells in circulation in response to a stressor, thereby meeting the increased oxygen demand that results from the stress.
Numerous studies have looked into the involvement of erythropoietin (Epo) and other signaling pathways in regulating stress erythropoiesis at the molecular and cellular levels.For example, research has shown that the transcription factor hypoxia-inducible factor 1 (HIF-1) is a key regulator of stress erythropoiesis, as it controls the expression of erythropoietin in response to hypoxia [8].Also, IL-1 β promote the proliferation of stress erythroid progenitors (SEPs) [7].As the SEPs proliferate, increased expression of Epo by the kidney promotes the transition of SEPs to stress BFU-E.Epo also induces the expression of erythroferrone (Erfe), which antagonizes hepcidin and releases iron for the differentiation of stress BFU-E [9].As a result of these occurrences, stress erythropoiesis is rapidly induced, leading to the production of stress BFU-E during the subsequent 6 days and a steady increase in the percentage of reticulocytes over the subsequent 14 days of recovery [7].These stress erythroid progenitors are derived from shortterm hematopoietic stem cells (ST-HSCs) in the bone marrow [33].These cells have the potential to generate all cell lineages, however, upon homing to the spleen, signals in the splenic micro-environment commit these cells to the erythroid lineage.Stress erythroid progenitors are present in healthy individuals, but following lineage restriction, the SEPs proliferate, but do not differentiate, generating a transient amplifying population of progenitor cells [34].
In this study, we analyzed single-cell RNA-seq data of erythroid cells from bone marrow and peripheral blood in healthy adults, to determine the cellular heterogeneity of erythroid cells.The datasets used were those generated by Xie et al., [10] and Jain et al., [11] and deposited in NCBI GEO of accession numbers GSE149938 and GSE184916 respectively.

Data description
In this study, two transcriptomic data of blood cells from the bone marrow and peripheral blood deposited at the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) database were harnessed for data integration of erythroid cells, clustering analysis, and marker identification.The workflow is as shown in Figure 1.The considered dataset is available at the NCBI GEO database.The GEO accessions of the scRNA seq dataset are GSE149938 and GSE184916.The blood samples used by Jain et al, to obtain their scRNA seq data, GSE184916, were collected from three adult healthy donors, which have been stored in the blood bank conditions and assayed at day 1 and day 15.We used the data assayed at day 1.Xie et al., profiled the transcriptomes of 7551 human blood cells representing 32 immunophenotypic cell types, including hematopoietic stem cells, progenitors and mature blood cells derived from 21 healthy donors.We used the erythrocytes dataset obtained from the bone marrow, GSE149938.

Data integration
The two scRNA seq datasets were integrated using Seurat v4.1.1 for a more comprehensive understanding of cell behavior and function in reticulocytes.This method identifies anchors between pairs of datasets and use them to either harmonize information or transfer information from one dataset to another.The Seurat package (version 4.1.1)implemented in R (version 3.6.0)was used to remove batch effects, reduce dimension and cluster cells based on the UMI count.The two datasets were first split into two Seurat objects, after which the datasets were normalized and variable features were identified for each dataset independently, using the NormalizeData() and the FindVariableFeatures() in Seurat.The features that were repeatedly variable across the two datasets were selected for integration.The integration was then performed by first identifying anchors using the FindIntegrationAnchors() function, then integrating the two datasets together with IntegrateData() function with the default of 2000 features.

Clustering analysis
After doing some simple filtering for minimum gene and cell observance frequency cut-offs, data was initially normalized on the log scale in Seurat v4.1.1.First quality control was done to remove low quality cells (cells with low gene counts and high mitochondrial genes).The features were viewed using a violin plot.Low gene count was marked at <200 genes, while high mitochondrial level was marked at >5%.The data was then normalized in order to be able to compare the gene expression across multiple cells.
Regression techniques were used to remove further technical artifacts and minimize noise.Following quality control procedures, we determined principal components using our dataset's most variably expressed genes [12].Using techniques similar to those used by Macosko et al., [13], significant principal components were identified, which were then used for downstream analysis.These significant principal components were then carried forward to perform cell clustering and to improve visualization.Seurat's FindClusters () function [12] was used to cluster cells, at a resolution of 0.6, into an ideal number of clusters for discovery of sources of heterogeneity, and t-distributed stochastic neighbor embedding (t-SNE) was used to visualize the cells, which reduced the information contained in the chosen significant principal components to two dimensions [14].

Marker identification
'FindAllMarkers' from the Seurat package was used to detect cell type/cluster-specific signature genes, while 'FindMarkers' was applied to identify the DEGs between any two given groups.AUC scores of signature genes obtained by pROC represented the specificity of each gene for each cluster (ranging from 0 to 1; the larger the score, the more unique to each cluster).The filtered criteria for signature genes and DEGs was fold-change ≥ 1.5 or ≤0.67 and adjusted P-value ≤ 0.05.The marker with the lowest p-value was considered most significant, and named as a top marker for a particular cluster or cell type.

Limitations to the study
When compared to other nucleated cells, the number of transcripts per RBC is far lower.Since there are fewer transcripts, the principal component structure is simpler, and the resulting clustering pattern is lee resistant to random influences.Also, the datasets used for this analysis were gotten from RBC mRNA, which are not the most abundant RNA species in RBCs.As such, the heterogeneity of the erythroid cells may not have been fully captured.Instead, micro-RNAs are most abundant in RBCs and could provide a better dataset for RBCs heterogeneity analysis.

Results and discussion
Erythroid cells sub-populations as described by the differentially expressed genes Single-cell RNA-seq data of erythroid cells generated by Xie et al., [10] and Jain et al., [11] and deposited in NCBI GEO of accession numbers GSE149938 and GSE184916 respectively were obtained.The datasets were from two different sources, the bone marrow and the peripheral blood.The datasets were analyzed using Seurat v4.1.1,and other tools in R.
We integrated the two single-cell datasets followed by scaling of the dataset, linear dimensionality reduction and visualization by t-distributed stochastic neighbor embedding (t-SNE) plot.From the t-SNE analysis of the integrated datasets, we identified 26 different clusters (Figure 2 Next, the top marker genes in each of the clusters were identified and a heatmap plotted (Figure 3(A)) to show these genes and their expression pattern in the different cell types.A gene signatures plot (Figure 3(B)) was also plotted to illustrate which genes are differentially expressed in each of the cluster or the cell types.Furthermore, t-SNE plots were used to visualize the differentially expressed genes in different clusters (Figure 3(C)).The different cell types showed a number of differentially expressed markers.For example, only one cluster showed a high expression of the HBG2 gene, which was corresponding to the cell type called the F-cells.The HBG2 is one of two genes responsible for the synthesis of the γ chains present in fetal hemoglobin (HbF) [15].HbF is the main haemoglobin in human zygotes within the second and third trimesters of intra-uterine life, and its expression continues until 4-6 months after birth before its synthesis is down-regulated to about 1%-2%, when β-haemoglobin becomes the predominant haemoglobin [16].Higher levels of HbF associated with sickle cell disease is linked with milder crises [17], which has led to the many studies on trying to induce a sustained increase in HbF production.
Two clusters corresponding to the HSCs and the Erythroblasts expressed HBA1 and HBA2, while they differentially expressed (LYZ and S100A9) and HBB respectively.The HBA1 and HBA2 genes code for the alpha globin chains in hemoglobin [18], with a defect in the genes leading to alpha thalassemia.
Another cluster corresponding to the D-cells differentially expressed BLVRB and HBD, while another cluster corresponding to the MEPs differentially expressed the ITGA2B.The BLVRB gene encodes for the protein biliverdine reductase, which catalyses the final stage of heme metabolism.The HBD gene codes for the hemoglobin delta subunit of hemoglobin, and is normally expressed in adults, though in small amounts.A mutation in the HBD gene results in Delta-thalassemia [19].ITGA2B gene, also known as CD41, encodes integrin alpha chain 2b.Integrins are integral membrane proteins with alpha and beta chains.Alpha chain 2b is post-translationally cleaved into disulfide-linked light and heavy chains that connect with beta 3 to generate a platelet-expressed fibrinogen receptor that is essential to coagulation.Mutations disrupting this function cause thrombasthenia [20].
A cluster corresponding to the early reticulocytes showed the most number of differentially expressed genes with varying percentages of expression, including SLC25A37, HBM, SERF2, RPL41, RPL30 and YBX1.On the other hand the cluster corresponding to the late reticulocytes differentially expressed FKBP8.Differentiating erythroid cells express the SLC25A37 gene, which codes for the Mitoferrin-1 protein in humans [21].As a key component of mitochondrial iron homeostasis, this protein shuttles ferrous iron from the mitochondrial intermembrane space to the mitochondrial matrix, where it is used in the production of heme groups and Fe-S clusters, which are essential for erythropoiesis [22].Deficiencies in this protein have been linked to erythropoietic protoporphyria [23].

Role of the stress erythroid differentially expressed markers in stress erythropoiesis
Given the importance of stress erythropoiesis in the amelioration of the symptoms of sickle cell anemia, we tried to understand the functions of some of the markers that were differentially expressed in the stress erythroid cluster.Stress erythroid progenitors initiate the process of stress erythropoiesis, which rapidly increases the production of reticulocytes and subsequently erythrocytes to compensate for the lost erythrocytes due to hemolysis of the sickle RBCs.The stress erythroid cluster of the erythroid population differentially expressed a number of markers (Figure 3(B)), which were expressed by only a small number of cells in the cluster.The marker ALAS2 was expressed by only about 2% of the cells in this cluster.This gene encodes the mitochondria erythroid-specific 5'-aminolevulinate synthase (ALAS2) protein, also known as erythroid ALA synthase, which is an enzyme that catalyses the first step (a rate limiting step) in heme biosynthesis in erythroid cells [24].This gene is expressed in fetal liver and adult bone marrow, providing sufficient heme for hemoglobin biosynthesis.
Another marker that was expressed (at a p-value of 0.5) by a similar percentage of cells in this cluster was the HEMGN gene, which is homologous to human erythroid differentiation-associated gene (EDAG).An et al. [25] identified HEMGN as a transcriptional modulator of hematopoietic development.GATA1 and HOXB4 control HEMGN's own transcription [26].Although HEMGN is primarily expressed in hematopoietic cells that are actively developing blood cells, it is downregulated as blood cells differentiate.HEMGN recruits histone acetyltransferase p300 to acetylate GATA1 in order to favourably regulate erythroid differentiation of human CD34 + cells [27].HEMGN has recently been demonstrated to hasten human CD34 + cells' entry into the cell cycle, increase their proliferative ability, and increase their capacity for repopulation [28].Additionally, HEMGN partly replicates the role of HOXB4 in encouraging the expansion of mouse myeloid progenitor cells ex vivo [26].A number of stresses, including differentiation induction, proliferation stimulation, irradiation, and hypoxia exposure, interestingly cause HEMGN expression to increase [27,29,30].This suggests that HEMGN may regulate how well HSPCs function under stress conditions, leading to the stress-induced induction of stress erythropoiesis.
Another gene differentially expressed by a small percentage of the cells in the stress erythroids cluster is the GUK1 gene, which encodes the enzyme guanylate kinase in humans [31].Erythropoietin modulates this enzyme by stimulating its phosphorylation.This enzyme is necessary for stress erythropoiesis because it generates GDP for nucleotide synthesis, a metabolic route that is considered to be a limiting factor under these conditions [32].

Conclusion
The erythroid population showed heterogeneity, with a specific sub-cell type forming the stress erythroids.This could be significant for understanding steadystate and stress erythropoiesis, and the markers that were identified in this cluster could be helpful in the examination of the dynamics of stress erythroid progenitor cells differentiation during stress.Sickle cell disease is characterized by severe loss of erythrocytes which leads to induction of stress erythropoiesis in order to compensate for the loss.A study into the characterization of the stress erythroid population will give a more robust understanding of the treatment strategies for sickle cell disease.

Figure 1 .
Figure 1.Workflow showing the processes followed in each stage of the analysis of the scRNA seq datasets.

Figure 3 .
Figure 3. Heterogeneity observed in the erythroid cell population: (A) Heatmap of the integrated scRNAseq dataset showing the expression patterns of the top marker genes in the 9 cell types.In the plot, high values are in yellow while how values are in purple.(B) Gene signature plot showing the average expression of differentially expressed features in the different cell types.This is showing both the magnitude of expression of the marker genes and the proportion of cells expressing the gene.(C) t-SNE plots displaying transcription activities for the differentially expressed genes in the different clusters of the erythroid population.Each t-SNE plot shows a particular gene marker that is differentially expressed in a specific cell type.(D) Violin plot showing the DEGs of the stress erythroid population in the different clusters.