A 12-gene panel in estimating hormone-treatment responses of castration-resistant prostate cancer patients generated using a combined analysis of bulk and single-cell sequencing data

Abstract Background Castration-resistant prostate cancer (CRPC) represents one type of advanced prostate cancer (PCa) with a median survival time of 1–2 years. Currently, there is a lack of reliable gene panels in predicting hormone treatment (HT) responses due to limited knowledge of CRPC-specific tumor-microenvironment (TME) characteristics. Methods In this study, we first screened for up-regulated genes in CRPC samples using bulk-sequencing data retrieved from TCGA online database, and further investigated the expression status of these genes in four sets of downloaded single-cell RNA sequencing (scRNAseq) data: GSE117403 containing 16 normal human prostate samples; GSE141445 containing 13 PCa samples; GSE176031 containing 11 PCa samples and GSE137829 containing 6 CRPC samples. Results We identified a series of CRPC-specific TME characteristics including an enriched number of PEG10+ neuroendocrine cells, elevated expression of PPIB/CCDC74A/GAPDH/AR genes in tumor cells, increased expression of FAP/TGFB1 in cancer-associated fibroblasts (CAFs), suppressed immune environment featured by enhanced M2 macrophage polarization, T cell exhaustion and increased number of regulatory B cells. We further established a 12-gene panel using these characteristics and showed that this panel could separate CRPC samples from PCa samples (AUC of 0.78), and CRPC patients with higher panel scores tended to have treatment failure or progression (R = −0.47, p = 0.019). Conclusions Based on these unique TME characteristics of CRPC, we established a prediction tool for estimating the duration of HT responses in PCa treatment. Our results suggest mechanisms by which prostate cancer becomes castrate resistant. Further study of PEG10 (and/or others) to evaluate therapeutic efficacy should be considered.


Introduction
Prostate cancer (Pca) remains one of the most common malignant tumors that endanger men's health.in 2022, there are 268,490 new Pca patients in the United states (1 st ) which account for 27% of the newly diagnosed cancer cases in men [1].hormone therapy (ht) or androgen deprivation therapy (aDt) is the first line of treatment for Pca, however, after ht, around 10-20% of the cases would progress to a more aggressive stage: castration-resistant prostate cancer (cRPc) [2,3].treatment options for cRPc are largely limited, and the median survival time for men with metastatic cRPc is 1-2 years [4], hence it is necessary to establish a reliable prediction tool for cRPc progression while supporting the clinical decisions of switching therapy methods before delaying to cRPc stage.
Understanding the cRPc-specific tumormicroenvironment (tMe) features is the key to developing an accurate prediction tool.Recently, with the help of single-cell RNa sequencing (scRNaseq) technology, one can elucidate the specific tMe characteristics and intercellular communications in cRPc samples at a cellular level.Dong et al. [5] revealed a luminalneuroendocrine transdifferentiation pattern in 6 human cRPc samples; Wang et al. [6] discovered a set of new features related to the divergence of neuroendocrine prostate cancer (NePc, an advanced type of cRPc) from the original androgen-dependent Pca; cheng et al. [7] identified a small set of cRPc-like cells in hormone-naïve early Pca samples and proved the relevance of these cells with biochemical recurrence.however, none of the current cRPc-related scRNaseq studies have involved systematic comparison of the differences between normal prostate samples, Pca samples and cRPc samples, not to mention designing prediction algorithms based on these differences.
in this study, we re-investigated tMe features using 4 sets of downloaded scRNaseq data involving normal prostate samples, Pca samples, and cRPc samples, combined with bulk-sequencing data of cRPc samples from tcGa database and discovered a series of cRPc-specific features.Moreover, we established a 12-gene panel based on these features and showed that this panel had a good performance in predicting ht responses during Pca treatments.

Patient information
two patients diagnosed with primary Pca and one patient diagnosed with cRPc were involved in this study.their frozen tumor tissue sections were obtained from Pathology department of shenzhen People's hospital.this study was performed in accordance with the guidelines of the 1975 Declaration of helsinki.

Preparation of TCGA PRAD cohort
Gene expression data (counts and FPKM: fragments per kilobase of exon per million mapped fragments) and sample phenotype data of the prostate adenocarcinoma samples (tcGa-PRaD) was retrieved from Ucsc-Xena database (https://xenabrowser.net/).Normal samples were determined as 'solid tissue Normal' in 'sample_type.sample'term; tumor samples were determined as the rest in 'sample_type.sample';samples with recurrence were determined as tumor samples with 'Yes' in 'new_tumor_event_after_initial_treatment' term; samples without recurrence were determined as tumor samples with 'NO' in 'new_tumor_event_after_ initial_treatment' term; ht_Yes samples were determined as recurrent samples with 'hormone therapy' in 'therapy_type' (onset of castration resistance status after initial hormone treatment), and ht_NO samples were determined as non-recurrent samples with 'hormone therapy' in 'therapy_type' .

Identification of differentially expressed genes
to screen for differentially expressed genes, we applied Deseq2 (R package) to raw-count reads of ht_Yes and ht_NO samples.log2Fc of 1 and p.adjust value of 0.05 were used as significance cutoff.

scRNAseq analysis of four datasets
count matrix files were downloaded from GeO database (www.ncbi.nlm.nih.gov/geo/)under their accession numbers (Gse117403 containing expression profiles of 105,208 cells from 16 normal human prostate samples [8]; Gse141445 containing expression profiles of 31,402 cells from 13 Pca patients [9]; Gse176031 containing expression profiles of 17,144 cells from 11 Pca patients [10]; Gse137829 containing expression profiles of 20,174 cells from 6 cRPc patients5].For each dataset, a seuratObject was first constructed for each sample/patient using R package (Version 3.1.2)[11], followed by a doublet removal step using DoubletFinder R package [12] (10% double formation rate).samples from each dataset were integrated using 'Findintegrationanchors' and 'integrateData' .For the integrated object, a 'scaleData' step was first applied, followed by a 'RunPca' step, 'FindNeighbors' step, and a 'Findclusters' step.the clusters were visualized using UMaP (Uniform Manifold approximation and Projection) method.

Similarity analysis using Pearson correlation method
Data normalization among different datasets was performed using log2(cPM + 1), and mean expression values for all genes were calculated for each subgroup.Pearson correlation value between two subgroups was calculated using R cor.test() function and the correlation results between all subgroup combinations were visualized using R 'pheatmap' package.

TCGA and CPGEA module score calculation
lists of tcGa and cPGea module genes were retrieved from Wang et al. [6].For each involved gene, a zscore transformation was performed to log2(cPM + 1) normalized matrix, and module score for each cell was calculated as sums of zscore values of all involved genes.Module scores between cells from different subgroups were compared using a student t-test and a p.value of 0.05 was used as significance cutoff.

Identification of subgroup-featured genes
subgroup-featured genes were determined as genes with the highest expression values across all subgroups.Gene expression values were first normalized using log2(cPM + 1), and for a given gene a, if its expression values were significantly higher in cells from subgroup B (dataset c) compared to the values from other subgroups (significance was determined using R limma-trend package), and gene a was expressed in over 20% of the cells from subgroup B, gene a was considered as one of the B subgroup-featured genes.KeGG enrichment analysis of subgroup-featured genes was performed using 'cluster-Profiler' R package (p.adjust< 0.05), and top enriched KeGG terms were selected and displayed.

Patient information
the cPRc patient involved in the validation process of this study was enrolled in shenzhen People's hospital.this cRPc patient had a laparoscopic radical prostatectomy in the year 2018, followed by three rounds of endocrine therapy (Bicalutamide: 50 mg; Goserelin: 10.8 mg), and the progression happened in the year 2022.this is a luminal cRPc based on the diagnosis.

Immunostaining
tumor tissue sections from two Pca patients and one cRPc patient were obtained from the pathology department of shenzhen People's hospital.Prior to antibody incubation, all sections were washed in 3% h 2 O 2 at room temperature for 10 min.the following two primary antibodies were used in this study: anti-ePcaM (1:500, abcam, ab223582) and anti-PeG10 (1:500, proteintech, 14412-1-aP).after being rinsed with 1X PBs at room temperature for 3 times, the sections were first incubated with primary antibodies for 2h at room temperature, followed by 3 times of 1X PBs rinsing, 1h of incubation with goat anti-rabbit h&l cy5 secondary antibody (abcam ab6564), and 3 more times of 1X PBs rinsing.the following stains (green: alexa Fluor tM 488 tyramide, invitrogen B40922; red: alexa Fluor tM 555 tyramide, invitrogen B40923) were used in the final staining step.

Ligand-Receptor (L-R) intensity calculation
lR intensities between different subgroups in Gse137829 dataset were calculated as follows: 1) location information of all proteins was retrieved from thPa (the human Protein atlas, www.proteinatlas.org)website, in which secreted proteins were determined as 'secreted' in 'Predicted location' term, and membrane proteins were determined as 'Plasma membrane' in 'Predicted location' term; 2) Protein-Protein interaction information was retrieved from BioGRiD (www.thebiogrid.org)database [13]; 3) intensity between subgroup a (ligand) and subgroup B (Receptor) was added up by the intensity of each lR pair calculated by multiply of mean expression value of ligand (secreted protein from subgroup a) and mean expression value of receptor (membrane protein from subgroup B).P value was performed using method described in Ji et al. [14].a randomized process was performed to all cells for 1000 times, and the intensity was compared to these randomized intensities.the ranks of target intensities were used as statistical values.

Receiver operating characteristic (ROC) analysis
ROc analysis was performed using R package 'ROcR' and 'pROc' , and the aUc (area under curves) values were calculated using auc() function.

Identification of differentially expressed genes in TCGA CRPC samples
to screen for differentially expressed genes in cRPc patients, we first retrieved gene expression data of prostate adenocarcinoma samples from the cancer Genome atlas (tcGa-PRaD).two sets of samples were selected: PRaD samples with recurrence after hormone treatment (ht-Yes, 14 samples) and PRaD samples without recurrence after hormone treatment (ht-NO, 53 samples).Detailed sample information is listed in supplementary table 1. a further transcriptome-wide screening of differentially expressed genes was performed using Deseq2 [15], and the results are illustrated in Figure 1a. 10 up-regulated genes (log2Fc > 1 & P.adjust < 0.05) and 27 down-regulated genes (log2Fc < −1 & P.adjust < 0.05) are identified (listed in supplementary table 4), and their relative expression values across these ht-Yes/ht-NO samples are listed in Figure 1B.

Construction of single-cell transcriptome atlas
to investigate cPRc-specific tMe features from single-cell levels, we retrieved four scRNaseq datasets from GeO database: Gse117403 containing expression profiles of 105,208 cells from 16 normal human prostate samples [8]; Gse141445 containing expression profiles of 31,402 cells from 13 Pca patients [9]; Gse176031 containing expression profiles of 17,144 cells from 11 Pca patients [10]; Gse137829 containing expression profiles of 20,174 cells from 6 cRPc patients [5].
Due to the disturbance of batch effects across these four datasets (10x Genomics platform used in Gse117403/Gse141445/Gse137829; seq-Well platform used in Gse176031), we performed scRNaseq analysis separately to these four datasets instead of the integrating them together.to compare gene expression levels across these four datasets, a log2(cPM + 1) normalization was first performed to all involved cells, followed by a transcriptome-level comparison using limma-trend method [16,17].all the expression-level analysis was performed using this method.
We performed infercNV analysis to all epithelial cells from these four datasets using expression profiles of t cells (from Gse141445, Gse176031 and Gse137829 datasets), and the results are listed in supplementary Figure 1. compared to epithelial cells in normal prostate samples (Gse117403), more copy number variance features were found in two tumor samples (Gse141445, Gse176031) and cRPc samples (Gse137829).
similarity among subgroups from each dataset was evaluated using Pearson correlation analysis (detailed in Materials and Methods).the results are summarized in Figure 2D.Basal and luminal subgroups from Gse176031 dataset (tumor) are clustered in a separate branch, suggesting that cells from this dataset are more different than cells from other datasets.this might be possibly due to different platforms during library construction; luminal and cycling cells from Gse141445 dataset (tumor) are clustered together with luminal cells from Gse137829 dataset (cRPc), suggesting the intrinsic similarity among these subgroups.there is a huge increase in luminal cell ratio from Pca and cRPc samples compared to this from normal samples (Figure 2e), suggesting that luminal cells are the main cancerous cells in these tumor/cRPc samples.
to investigate the tumor-associated features of these subgroups, we applied two Pca-related gene panels [6] to these datasets and the results are shown in Figure 3a.tcGa module is derived from tcGa-PRaD dataset which includes 199 upregulated genes; cPGea module is derived from cPGea (chinese Prostate cancer Genome and epigenome atlas) database which includes 248 upregulated genes (listed in supplementary table 2).luminal subgroups have relatively higher tcGa/cPGea module scores compared to those in basal/neuroendocrine/others subgroups; tumor/cRPc subgroups have relatively higher module scores compared to those in normal subgroups.these results further support the robustness of the analysis process.
subgroup-featured genes (genes with the highest expression values among all the subgroups as calculated using limma-trend, detailed in Materials and Methods) are further summarized (illustrated in Figure 3B) and their KeGG enrichment results are listed in Figure 3c.interestingly, featured genes in neuroendocrine cells from cRPc samples are enriched in KeGG pathways including GnRh (Gonadotropin-releasing hormone) secretion and GnRh signaling pathways.Moreover, these enrichments are not seen in neuroendocrine cells from normal samples, indicating more GnRh production might be related to cRPc progression.

Generation of a 6-gene panel using CRPC-specific epithelial features
the expression status of the top 10 upregulated genes in ht-Yes samples is also examined among epithelial subgroups across all 4 datasets.Neuroendocrine subgroup from cRPc samples has the highest expression scores of these genes, followed by luminal subgroup from cRPc samples, as illustrated in Figure 4a.specifically, both PEG10 (Paternally expressed 10) and CCDC74A (coiled-coil domain containing 74 a) have higher expression values in neuroendocrine/luminal subgroups from cRPc samples compared to those from other subgroups (Figure 4B). to further verify the expression status of PEG10 in cRPc epithelial cells, we labeled Pca tissues (Patient 1 and Patient 2) and cRPc tissue (Patient 3) with anti-ePcaM (green) and anti-PeG10 (red) antibodies, and the immunostaining results are shown in Figure 4c.More PEG10-expressing epithelial cells (green + red) are examined in cRPc tissue compared to these in Pca tissues, which further supports our scRNaseq results.
secreted ligand is one of the most direct methods that target cells use to reshape their microenvironments.to investigate the communication status among cRPc epithelial subgroups, we calculated ligand-Receptor (l-R) intensity between different subgroups, and the results are summarized in Figure 5a. the lR intensity between neuroendocrine cells (secreted ligands) and luminal cells (Membrane Receptors) is the strongest among all subgroup combinations, suggesting the unusually frequent communications between these two subgroups.top-ranked l-R interactions are listed in Figure 5B.among them, PPIB (Peptidylprolyl isomerase B, secreted from neuroendocrine cells) -GAPDH (Glyceraldehyde-3-phosphate dehydrogenase, receptors from luminal cells) ranks 1 st among all the l-R interactions, indicating that this interaction might play an important role in cRPc progression.
as demonstrated previously, PEG10/CCDC74A/PPIB/ GAPDH have higher expression values in neuroendocrine and luminal subgroups from cRPc samples (Figure 5c).Moreover, we also find that AR (androgen receptor), a transcription factor that could be regulated by GAPDH [18], has the highest expression values in luminal cells from cRPc samples (Figure 5c).this PPIB-GAPDH-AR combination might be a key regulator in cRPc progression.together with EPCAM (epithelial marker), we established a 6-gene panel containing PEG10/CCDC74A/PPIB/GAPDH/AR/EPCAM, and further examined the performance of this panel in the tcGa cohort, as shown in Figure 5D.among tcGa PRaD samples, this 6-gene panel has the highest expression scores in ht samples with recurrence; specifically, the expression score is significantly higher in tumor samples compared to these in normal samples (pvalue: 3.54e-08); significantly higher in PRaD samples with recurrence compared to these in PRaD samples without recurrence (pvalue: 0.028); significantly higher in ht samples with recurrence compared to these in ht samples without recurrence (pvalue: 0.043).Regarding ROc analysis, this 6-gene panel has an auc (area under curve) score of 0.69 in separating tumor samples from normal samples, 0.56 in separating PRaD samples with recurrence from PRaD samples without recurrence, 0.74 in separating ht samples with recurrence from ht samples without recurrence (Figure 5e).specifically, we further explored the relationship between expression scores of this 6-gene panel and recurrence time.among tumor samples with recurrence, samples with shorter (time < = median) recurrence time have higher panel scores compared to samples with longer (time > median) recurrence time (short mean/median: 0.32/0.082;long mean/median: 0.12/-0.37;pvalue: 0.52); samples with higher panel scores (score > = median) have shorter recurrence time compared to these with lower (score < median) panel scores (higher mean/median: 754.57/582 days; lower mean/median: 966.71/766 days; pvalue: 0.19).Pearson correlation analysis between 6-gene panel scores and recurrence time reveals a weak (not significant) negative correlation (-0.24, p = 0.064), as shown in Figure 5F.among ht samples with recurrence, samples with shorter (time < = median) recurrence time have significantly higher panel scores compared to those with longer (time > median) recurrence time (short mean/ median: 0.90/0.83;long mean/median: −0.17/-0.53;pvalue: 0.015); samples with higher panel scores (score > = median) have shorter recurrence time compared to these with lower (score < median) panel scores (higher mean/median: 866.67/679 days; lower mean/median: 1314.08/1376days; pvalue: 0.14).Pearson correlation analysis between 6-gene panel scores and recurrence time reveals a significant negative correlation (-0.42, p = 0.038), as shown in Figure 5G.all these results suggest that this 6-gene panel has a good performance in this ht-tcGa-PRaD cohort.
t cells were clustered into regulatory t cells (marked by IL2RA/FOXP3/IKZF2/CD4), effector t cells (marked by GZMK/GZMA/GZMB), memory t cells (marked by CCR7/ TCF7/LEF1), and natural killer (NK) cells (marked by CD3D -/NKG7/GNLY) (Figure 7a).among t cells from Gse141445 dataset, there are 266 regulatory t cells (8.50%), 572 memory t cells (18.27%), 2,006 effector t cells (64.07%) and 287 NK cells (9.17%); among t cells from Gse176031 dataset, there are 387 regulatory t cells (13.77%) and 2,424 effector t cells from Gse176031 dataset; among t cells from Gse137829 dataset, there are 220 regulatory t cells (8.97%), 1,153 memory t cells (47.00%) and 1,080 effector t cells (44.03%), respectively, as shown in Figure 7B. the relative expression status of involved markers in defining each subgroup is shown in Figure 7c. in t cells from cRPc samples, we find relatively higher expression of immune checkpoint proteins including TIGIT, LAG3 and PDCD1 compared to those from Pca samples (Figure 7D). the expression of tGFB1 is also higher in t cells from cRPc samples compared to these from Pca samples.all these suggest a more severe tcell exhaustion process in cRPc samples, leading to further immune suppression of the tumor microenvironment.
B cells were clustered into naïve B cells (marked by CD79A/CD79B/MS4A1) and plasma B cells (marked by CD79A/DERL3) (Figure 7e).among B cells from Gse141445 dataset, all cells are defined as naïve B cells; among B cells from Gse176031, there are 109 naïve B cells (57.07%) and 82 plasma B cells (42.93%); among B cells from Gse137829, there are 283 naïve B cells (68.86%) and 128 plasma B cells (31.14%), respectively, as shown in Figure 7F. the relative expression status of involved markers in defining each subgroup is shown in Figure 7G.Regulatory B cells are one type of immune-suppressive B cells marked by high expression of TGFB1 and IL10. in cRPc samples, we find higher expression of TGFB1 in naïve B cells compared to these in Pca samples, suggesting more regulatory B cells in cRPc samples, which might also contribute to an immune-suppressive microenvironment.

Generation of an updated 12-gene panel using CRPC-specific TME features
We added 6 more genes identified from cRPc-specific tMe features (FAP/IL10/TGFB1/TIGIT/LAG3/PDCD1) to the original 6-gene panel and further explored the performance of this panel in tcGa cohort.among tcGa PRaD samples, this 12-gene panel has the highest expression scores in ht tumor samples with recurrence, which is consistent with the previous 6-gene panel; specifically, the expression score is significantly higher in tumor samples compared to these in normal samples (pvalue: 3.54e-08); significantly higher in Pca samples with recurrence compared to these in Pca samples without recurrence (pvalue: 7.1e-04); significantly higher in ht samples with recurrence compared to these in ht samples without recurrence (pvalue: 6.6e-03), as shown in Figure 7i.compared to the 6-gene panel, this 12-gene panel has slightly better discriminating ability in separating ht samples with recurrence from those without recurrence (auc: 0.78 vs. 0.74), as shown in Figure 7J.
Regarding correlations with recurrence time, compared to the 6-gene panel, there is a slightly better negative correlation between panel scores and recurrence times in Pca samples with recurrence (Figure 7K, R: −0.27 vs. −0.24)and ht samples with recurrence (Figure 7l, R: −0.47 vs. −0.42),suggesting a slightly better performance of this 12-gene panel in estimating the outcomes of ht Pca patients.

Discussion
in this study, through a combined analysis of bulk-sequencing data and 4 sets of scRNaseq data, we successfully identified a series of cRPc-specific features such as increased number of PEG10 + neuroendocrine cells, enhanced expression of PPiB/GaPDh/aR gene set in tumor cells, elevated expression of TGFB1 in tMes suppressed immune environments such as M2 macrophage polarization, t cell exhaustion, as well as increased number of regulatory B cells.Moreover, we established a 12-gene panel using these cRPc-specific features screened in this study, and found that this panel had a good performance in estimating the duration of ht responses in Pca patients (auc = 0.78).
the involvement of placental gene PEG10 in neuroendocrine prostate cancer (NePc, one type of highly aggressive variant of cRPc) have been reported in many studies.Kim et al. [20] showed that PEG10 is associated with neuroendocrine differentiation after aR axis-directed therapy, and akamatsu et al. [21] reported that PEG10 could promote the progression of NePc through stimulating cell-cycle progression via RB1 and TP53 loss and regulating snail expression via tGF-beta signaling.in this study, we found that PEG10 is up-regulated in both ht Pca samples with recurrence and neuroendocrine cells from cRPc samples, suggesting that PeG10 is a promising target in cRPc exploration.
aR-signaling axis is crucial in all stages of prostate cancer, and the involvement of AR in cRPc progression has been observed for a long time.Visakorpi et al. [22] observed that over 30% of cRPc patients have aR gene amplification; chen et al. [23] reported that increased expression of aR gene was sufficient to accelerate the cRPc process.Generally, it is considered that aR is negatively correlated with neuroendocrine differentiation in prostate cancer [24][25][26], however, in recent years, several studies have found co-expression of aR and Ne markers in prostate cancer patients [25,27,28], and su et al. [29] have found high expression of both aR and Ne markers in a small subset of prostate cancer patients, especially among cRPc cases, suggesting aR might still exert its androgen response and anti-apoptotic effects in these patients which might be related to cRPc progression.Moreover, we also find that PPIB-GAPDH pair ranked 1 st among all communication combinations between cells from neuroendocrine subgroup and luminal subgroup in cRPc samples.GAPDH could enhance the transcriptional activity of AR in prostate cancer cells [18], and PPIB could regulate cell-cycle related genes including CCDC74A [30] which further promotes cell proliferation [31].taken together, this PPIB/GAPDH/AR/CCDC74A gene set might play an important role in regulating cRPc progression.
During tumor progression, various components from tMes co-contribute to this process.caF cells represent one of the most abundant components and play important roles in tumorigenesis, progression, metastasis and recurrence in prostate cancer [19].caF cells could also contribute to the development of cRPc, as reported in eder et al. [32] and Kato et al. [33]. in our study, we found higher expression of FaP (caF marker gene) in both fibroblast cells and caF cells in cRPc samples compared to these in other samples, which is consistent with previous report [34].Moreover, we also found higher expression of tGFB1 in caF cells from cRPc samples compared to this in other samples.tGF-beta plays opposite roles in tumorigenesis in early stage of prostate cancer (inhibitor) and advanced stage of prostate cancer (promoter) [35], and an increased secretion of tGF-beta secretion in caF cells might facilitate the cRPc development process.
Regarding immune features, we discovered a series of immunosuppressed phenotypes in cRPc samples such as enhanced M2 macrophage polarization process, enhanced tcell exhaustion process, and increased number of regulatory B cells, suggesting that repressed immune system might facilitate cRPc progression.it is noteworthy that none of the immune-related cells are identified in Gse117403, which is consistent with its original publication [8].this might be due to the filtering procedure during their single-cell library construction, and hence in this study we did not involve immune cells from normal samples, which might affect the comprehensiveness of our conclusions.cRPc might arise from a small subset of cancerous cells [7], and due to the relatively small number of these cells, traditional bulk-sequencing results might not capture the minor expression changes of certain genes, which might also affect the performance of our algorithm, as most of the involved genes were identified through tcGa datasets.Moreover, compared to 6-gene panel, the 12-gene panel has only slightly better performance in separating ht samples with recurrence from those without recurrence, which might also be due to the limitation of bulk sequencing datasets.More datasets, especially scRNaseq datasets should be included to generate a more accurate algorithm.lastly, the limited cRPc patient number is also a limiting factor of this study, and in the future, more patients should be involved in the validation process.

Figure 1 .
Figure 1.combined analysis of bulk-sequencing data and scRnAseq data.(A) Volcano plot representing the identification of differentially expressed genes in HT-Yes (hormone-treated Pca patients with recurrence) samples vs. HT-no (hormone-treated Pca patients without recurrence).Up-regulated genes (p.adj < 0.05, log2fc > 1) are labeled in orange and down-regulated genes (p.adj < 0.05, log2fc < -1) are labeled in blue.(B) expression status of up-regulated and down-regulated genes across HT-Yes and HT-no samples.(c) Uniform Manifold Approximation and Projection (UMAP) plot showing the distribution patterns of subgroups in each dataset.(d) Bar plot showing the number of cells in each subgroup from each dataset.(e) dot plot representing the relative expression level (color) and the ratio of gene-expressing cells (size) of all marker genes used in the definition of each subgroup.

Figure 2 .
Figure 2. epithelial component analysis across four datasets.(A) UMAP plot showing the distribution patterns of subgroups in each dataset.(B) Bar plot showing the number of cells in each subgroup from each dataset.(c) dot plot representing the relative expression level (color) and the ratio of gene-expressing cells (size) of all marker genes used in the definition of each subgroup.(d) Heatmap representing the similarity of expression patterns among different subgroups.The expression pattern is defined as the mean expression values of all genes within one subgroup, and a Pearson correlation analysis is performed to compare the similarity.(e) Bar plot representing the relative ratio of each epithelial subgroup across four datasets.

Figure 3 .
Figure 3. Heterogeneity of cancerous cells across all 4 datasets.(A) expression status of TcGA module and cPGeA module across all subgroups.A student t.test is performed for each comparison (*, p < 0.05; **, p < 0.01; ***, p < 0.001).(B) expression of subgroup-featured genes across different subgroups.(c) list of top enriched KeGG terms of featured genes from each subgroup.

Figure 5 .
Figure 5. Generation of a 6-gene panel using cRPc-specific epithelial features.(A) cartoon plot representing the communication intensity between different epithelial subgroups in cRPc samples.(B) list of top 20 lR pairs between different epithelial subgroups in cRPc samples.(c) expression status of PEG10, CCDC74A, PPIB, GAPDH, AR, and EPCAM across different subgroups.A student t.test is performed for each comparison (*, p < 0.05; **, p < 0.01; ***, p < 0.001).(d) expression scores of 6-gene panel across different samples in TcGA PRAd cohort.A student t.test is performed for each comparison (*, p < 0.05; **, p < 0.01; ***, p < 0.001).(e) Roc plots representing the performance of the 6-gene panel in separating normal samples from tumor samples, recurrent tumor samples from non-recurrent tumor samples and HT recurrent samples from HT non-recurrent samples.(f) for recurrent tumor samples, left plot: samples with shorter recurrence time (time < = median) have higher (t.test not significant) 6-gene panel scores than samples with longer recurrence time (time > median); middle plot: samples with higher 6-gene panel scores (score > = median) have shorter (t.test not significant) recurrence time than samples with lower 6-gene panel scores (score < median); right plot: pearson correlation between recurrence time and 6-gene panel scores in recurrent tumor samples.(G) for recurrent HT samples, left plot: samples with shorter recurrence time (time < = median) have higher (t.test significant, *, p < 0.05; **, p < 0.01; ***, p < 0.001) 6-gene panel scores than samples with longer recurrence time (time > median); middle plot: samples with higher 6-gene panel scores (score > = median) have shorter (t.test not significant) recurrence time than samples with lower 6-gene panel scores (score < median); right plot: pearson correlation between recurrence time and 6-gene panel scores in recurrent tumor samples.

Figure 6 .
Figure 6.stroma and myeloid analysis across four datasets.(A) UMAP plot showing the distribution patterns of stroma subgroups in each dataset.(B) Bar plot showing the number of cells in each stroma subgroup from each dataset.(c) dot plot representing the relative expression level (color) and the ratio of gene-expressing cells (size) of all marker genes used in the definition of each stroma subgroup.(d) Gene expression levels of FAP and TGFB1 across stroma subgroups.A student t.test is performed for each comparison (*, p < 0.05; **, p < 0.01; ***, p < 0.001).(e) UMAP plot showing the distribution patterns of myeloid subgroups in each dataset.(f) Bar plot showing the number of cells in each myeloid subgroup from each dataset.(G) dot plot representing the relative expression level (color) and the ratio of gene-expressing cells (size) of all marker genes used in the definition of each myeloid subgroup.(H) Gene expression levels of TGFB1, IL10, TNF, and IL1B across myeloid subgroups.A student t.test is performed for each comparison (*, p < 0.05; **, p < 0.01; ***, p < 0.001).

Figure 7 .
Figure 7. Tcell/Bcell analysis across four datasets and generation of a 12-gene panel.(A) UMAP plot showing the distribution patterns of Tcell subgroups in each dataset.(B) Bar plot showing the number of cells in each Tcell subgroup from each dataset.(c) dot plot representing the relative expression level (color) and the ratio of gene-expressing cells (size) of all marker genes used in the definition of each Tcell subgroup.(d) Gene expression levels of TIGIT, LAG3, PDCD1, and TGFB1 across stroma subgroups.A student t.test is performed to each comparison (*, p < 0.05; **, p < 0.01; ***, p < 0.001).(e) UMAP plot showing the distribution patterns of Bcell subgroups in each dataset.(f) Bar plot showing the number of cells in each Bcell subgroup from each dataset.(G) dot plot representing the relative expression level (color) and the ratio of gene-expressing cells (size) of all marker genes used in the definition of each Bcell subgroup.(H) Gene expression levels of TGFB1 across Bcell subgroups.A student t.test is performed for each comparison (*, p < 0.05; **, p < 0.01; ***, p < 0.001).(i) expression scores of 12-gene panel across different samples in TcGA PRAd cohort.A student t.test is performed for each comparison (*, p < 0.05; **, p < 0.01; ***, p < 0.001).(J) Roc plots representing the performance of the 12-gene panel in separating normal samples from tumor samples, recurrent tumor samples from non-recurrent tumor samples and HT recurrent samples from HT non-recurrent samples.(K) for recurrent tumor samples, left plot: samples with shorter recurrence time (time < = median) have higher (t.test not significant) 121-gene panel scores than samples with longer recurrence time (time > median); middle plot: samples with higher 12-gene panel scores (score > = median) have shorter (t.test not significant) recurrence time than samples with lower 12-gene panel scores (score < median); right plot: pearson correlation between recurrence time and 6-gene panel scores in recurrent tumor samples.(l) for recurrent HT samples, left plot: samples with shorter recurrence time (time < = median) have higher (t.test significant, *, p < 0.05; **, p < 0.01; ***, p < 0.001) 12-gene panel scores than samples with longer recurrence time (time > median); middle plot: samples with higher 12-gene panel scores (score > = median) have shorter (t.test not significant) recurrence time than samples with lower 12-gene panel scores (score < median); right plot: pearson correlation between recurrence time and 12-gene panel scores in recurrent tumor samples.