HIV-1 provirus transcription and translation in macrophages differs from pre-integrated cDNA complexes and requires E2F transcriptional programs

ABSTRACT HIV-1 cDNA pre-integration complexes persist for weeks in macrophages and remain transcriptionally active. While previous work has focused on the transcription of HIV-1 genes; our understanding of the cellular milieu that accompanies viral production is incomplete. We have used an in vitro system to model HIV-1 infection of macrophages, and single-cell RNA sequencing (scRNA-seq) to compare the transcriptomes of uninfected cells, cells harboring pre-integration complexes (PIC), and those containing integrated provirus and making late HIV proteins. scRNA-seq can distinguish between provirus and PIC cells because their background transcriptomes vary dramatically. PIC cell transcriptomes are characterized by NFkB and AP-1 promoted transcription, while transcriptomes of cells transcribing from provirus are characterized by E2F family transcription products. We also find that the transcriptomes of PIC cells and Bystander cells (defined as cells not producing any HIV transcript and thus presumably not infected) are indistinguishable except for the presence of HIV-1 transcripts. Furthermore, the presence of pathogen alters the transcriptome of the uninfected Bystander cells, so that they are distinguishable from true control cells (cells not exposed to any pathogen). Therefore, a single cell comparison of transcriptomes from provirus and PIC cells provides a new understanding of the transcriptional changes that accompany HIV-1 integration.


Introduction
The major barrier to curing HIV-1 in patients is a small reservoir of cells that are latently infected and impervious to immune recognition and clearance [1][2][3]. The study of HIV-1 latency is complicated by the fact that latently infected cells in vivo are extremely rare. It is a drawback that many studies of latency have relied on bulk sequencing endpoints. Under these conditions, the specific parameters defining the latently infected cell are diluted in the context of a vast heterogeneous population. In addition, multiple mechanisms can result in latency reversal and therefore one latently infected cell may differ from the next [4]. Thus, averaging data from a heterogeneous cell population, such as data obtained by bulk sequencing studies, leads to confusion rather than clarification.
Following infection and reverse transcription, the pre-integration cDNA complex (PIC), in which the HIV-1 genome may take linear or circular forms, serves as a template for transcription [5][6][7]. In dividing T cells, the PIC is short-lived, and the transient transcription of genes from the PIC is considered irrelevant [6]. However, in quiescent T cells, the PIC is longer-lived and even results in sufficient transcription for virion production in response to latency reversal agents [8]. HIV-1 latency in macrophages is multi-factorial and the dynamics of PIC integration are different [9][10][11][12]. Macrophages are known for prolonged viral production while exhibiting resistance to host cell lethality, they reside in multiple tissue reservoirs where viral production appears to be influenced by the microenvironments. They also exhibit T cell-like postintegration latency, showing sensitivity to the chromatin environment, the absence of transcriptional activation, the presence of transcriptional repressors, and host antiviral processes [9][10][11][12]. Preintegration latency is the process that we address here. Macrophages have been shown to harbor transcriptionally active pre-integrated HIV-1 cDNA for months; however, this PIC mRNA is not thought to result in the production of virus [10][11][12].
The study of macrophage latency in vivo has proven difficult due to the very scarce number of cells. We have used a reporter HIV-1 infection of PMA activated THP-1 cells to model infection of macrophages. This model was selected to provide consistent induction responses in repeated experiments to provide statistical certainty for our conclusions (see Results, below).
We used single-cell RNA sequencing (scRNA-seq) techniques [13,14] to show that in activated THP-1 cells, many of the cells in a culture infected with our HIV-1 construct are producing HIV transcripts, while only a minority are producing Gag/p24 and other HIV-1 proteins. Levels of HIV-1 transcripts do not correlate with virus production, since many of the cells harboring PIC complexes have transcript loads comparable to cells transcribing from integrated provirus. This means that a high load of viral transcript is not a sufficient switch to reverse latency. Overall, within the limitations of 10X Genomics technology [15], transcripts for gagpol, tat, env, and nef are found in higher numbers of "PIC cells", compared to "Provirus cells" (those transcribing from integrated HIV-1 cDNA and producing mCherry, gag and other proteins). However, the loads of these transcripts detected in PIC and Provirus cells are similar. Transcripts for gag, vif, vpr, rev, vpu, and the marker gene mCherry are detectable in relatively equivalent percentages of cells transcribing from PIC or provirus, and at similar levels per cell. In no case did Provirus cells appear to be producing any of the detected HIV-1 transcripts at a higher prevalence than PIC cells.
Quite notably, the background transcriptomes of cells harboring HIV-1 PIC are not detectably altered by PIC transcription. Unsupervised clustering shows cells containing PIC transcripts to be distributed equally throughout the PIC/Bystander cell cluster. "Bystander" cells are defined as those cells not containing any detectable HIV-1 transcript. Thus, cells harboring HIV-1 PIC appear "oblivious" to the presence of HIV-1 gene transcription at the transcriptome level, even though some have been reported in the literature to be producing detectable levels of Nef, other HIV-1 proteins, and chemokines [9,10,[15][16][17][18][19].
Transcription Factor Targeting analysis [20][21][22] shows that NFkB and AP-1 transcription products are predominant in the transcriptomes of PIC and Bystander cells. In contrast, Provirus cluster cells, on average, contain higher total amounts of viral transcripts, and their transcriptomes predominantly exhibit E2F promoter family transcription products. These cells make detectable amounts of p24, mCherry, and Vpu proteins. This seems counter-intuitive because NFkB and AP-1 are transcription sites in the HIV-1 LTR that promote transcription from the provirus [23,24]. In addition, E2F has been reported to suppress HIV-1 transcription [25][26][27][28]. Nevertheless, in our model it is clear that when cells are making late HIV-1 gene proteins, the transcriptomes exhibit activation of E2F regulated transcripts, while NFkB and AP-1 regulated genes are relatively suppressed. Western blotting data agrees with the transcription factor analysis in that Rb phosphorylation is increased in the Provirus cluster cells, which correlates with increased E2F activation.

DHIV3 infection of activated THP-1 macrophages yields cells with transcriptomes containing DHIV3 mRNA but otherwise identical to uninfected cells
We used a VSV-G-pseudotyped DHIV3 virus that expressed mCherry to promote consistent levels of viral entry in PMA-activated THP-1 cells and to allow for easier interpretation of post-cell entry events (Fig. S-1) [29,30]. DHIV3 is replication-deficient due to the replacement of part of the nef gene with the reporter gene. Following infection, flow cytometry data clearly revealed two clusters of THP-1 cells, mCherry positive cells (usually from 2 to 20% of the total population, depending on DHIV3 titer) and mCherry negative cells (Figure 1(a,b)).
Quantifying DHIV3 infection by scRNA-seq closely agreed with the flow cytometry data in that two clusters of cells were clearly identified, based on an individual cell's background transcriptome, in percentages that agreed with flow cytometry of replicate cultures (Figure 1(c)). Data analysis is described in detail in the Methods. Briefly, raw fastq files were generated, aligned to a custom reference genome (GRCh38 augmented with mCherry and HIV genes) and per cell gene counts generated with the 10X Cell Ranger pipeline. Following basic QC and filtering as suggested by Seurat, we generated both UMAP and t-SNE clustering projections [15]. Library construction targeted 5,000 cells and routinely yielded greater than 4,000 cells following Seurat analysis and quality control. Library construction and sequencing experiments were performed with technical repeats. The technical repeats were found to be statistically identical and were therefore combined ( Fig. S-2).
Following Seurat analysis, the number of mean reads per cell was approximately 35,000, with a median of more than 2,500 genes detected per transcriptome, and greater than 15,000 different gene transcripts detected in the overall library. Control cultures yielded slightly larger numbers of cells captured, with greater than 6,000 cells each and with more than 3,000 genes detected per cell. All experiments were conducted with parallel cultures that were analyzed by flow cytometry for mCherry production. In the HIVreplicate1 experiment, flow cytometry indicated 8.6% mCherry positive cells (Figure 1(a)).  Absicca, mCherry (Texas Red) emission. Ordinate, GFP (FITC) emission. mCherry positive cells equal approximately 8.5% of the total viable cell population. C) UMAP projection of a duplicate culture (HIVreplicate1) is shown in panel C. Greater than 14,000 different cellular genes were detected in this analysis, including the 9 viral genes and mCherry message originating from DHIV3-mCherry. A semi-supervised two cluster model was adopted, the smaller "Provirus cluster" (cluster A) was 8.1% of the total cell population, approximately equivalent to the percentage of mCherry positive cells from panel B. The two semi-supervised clusters are circled in red. PIC/Bystander cluster is indicated as cluster B. The HIV activity scale presents the Seurat module score that is described in methods. Input data in this analysis included 33,819 PCA entries. Bar codes of the same cells tracked to the Provirus clusters, regardless of whether the clusters were generated using UMAP or Seurat-tSne tools (Fig. S-3). by transcriptomes generally containing higher minimum levels of HIV-1 transcripts. This "Provirus" cluster accounted for approximately 8.1% of the total cell population, and closely matched the percentage of cells identified as mCherry-positive by flow cytometry in the duplicate culture, shown above. In the second "PIC/ Bystander" semi-supervised cluster of mCherry negative cells, we found that many of the macrophages (more than 50%) were producing HIV-1 transcript (Fig. S-3 plots cells with any detected HIV-1 transcript). We now understand that these cells are "PIC" cells, defined as cells containing HIV-1 mRNA transcribed from pre-integrated cDNA complexes. Some PIC cell transcriptomes appeared to contain HIV-1 transcript loads as high as many cells detected in the Provirus cluster, but most expressed lower amounts (Figure 1(b); S-3). Remarkably, these PIC cells had no other statistically significant changes to their transcriptomes that would differentiate them from the truly uninfected "Bystander" cells (again, defined as cells in the PIC/ Bystander cluster not containing detectable HIV-1 mRNA). Thus, PIC and Bystander cells made up the "PIC/Bystander" cluster.
We used a Feature map to plot the influence of the cell cycle, number of genes detected or mitochondrial transcript number on the distribution of cells containing HIV-1 transcripts (Fig S-4 B-D) [31,32]. None of these factors had any significant influence on the distribution of PIC cells throughout the PIC/Bystander cluster. We then used unsupervised clustering and violin plots of HIV transcript load to examine the distribution of cells containing HIV-1 transcripts throughout the Provirus and PIC/Bystander clusters (Figure (2)).
Using K-nearest neighbors clustering, at K10, 2 clusters of cells (clusters 6 and 8) were identified that accounted for most of the Provirus cluster cells (372 of the 380 provirus cells determined by semi-supervised clustering). Therefore, the combined transcriptomes of cells in clusters 6 and 8 were compared to the combined transcriptomes of cells in clusters 1-5, 7, and 9-10, and used to generate the DGE analyses shown below. Unsupervised clustering using a range of designated K values from 10 to 130 is shown in Figure S-5.
Clusters 6 and 8 (Figure 2(b)) were characterized as containing fewer cells with a low HIV transcript load. Thus, cells in what we define as the Provirus cluster differ significantly from PIC cells in terms of average minimal DHIV3 transcript load. The key to their clustering, as defined in our semi-supervised model, was the fact that they vary significantly in background transcriptome. The multiple clusters generated in the unsupervised clustering are likely stochastic. We conclude this because, they were not influenced by the presence or absence of PIC cells (Figure 2(b)). PIC cells are distributed throughout all the PIC/Bystander clusters, regardless of the specified K value. So, for our purposes the two cluster, semi-supervised, model shown in Figure 1(b) was considered to accurately represent the interpretation gained from flow cytometry (Figure 1(a)), namely: either the cells were making mCherry protein, or they were not.
Inferring from our flow cytometry percentages, we hypothesized that only cells in the Provirus cluster were making fluorescing mCherry protein. To confirm that cells making mCherry were also making late viral proteins, the production of p24 capsid protein was Violin plot of HIV-1 transctipts/cell in the 10 clusters identified at K10 (Scran's buildSSNGraph using the PCA as input). PIC cells with detectable HIV-1 transcripts, were distributed throughout clusters 1-5, 7 and 9-10. Clusters 6 and 8 contained 372 of the 381 cells included in the semi-supervised Provirus cluster (circled in red). Stipulation of lower K values means that during analysis any one given cell is clustered with a smaller number of cells with similar transcriptomes. quantified using flow cytometry and monoclonal mouse IgG-AG3.0 anti-Gag p24 antibody. Cells were fixed in formaldehyde, and secondary anti-mouse Alexa Fluor antibody allowed detection during flow cytometry. When cells were analyzed using FACS Canto, only cells making mCherry were found to stain for p24 ( Fig.  S-6). This suggests that the presumed PIC cells are not making late viral proteins in detectable amounts, while only Provirus cells appear to be making late viral proteins.

The DHIV3 mRNA in "PIC" cells is due to PIC transcription
Integrase inhibitors have proven to be useful tools in the study of PIC transcription [8,16]. We used 25 nM MK-2048 [33,34] for our experiments to prove the presumed "Provirus" versus "PIC" status of our cell clusters. MK-2048 is a second-generation integrase inhibitor active against several integrase mutations [33]. When integrase inhibitor-treated cultures were analyzed by flow cytometry, mCherry producing cells were routinely reduced from an average of about 12% to less than 2% of the total ( Figure 3). The use of the viability stain in this experiment assured that this was not due to the death of the infected and drug-treated cells. We then analyzed protein preparations from parallel cultures of these cells by Western blotting. MK-2048 treatment achieved >80% inhibition of mCherry production in these experiments. For the Western analysis, we initially purified populations of viable mCherry positive and mCherry negative cells using FACS analysis. However, because of the laborious difficulty in obtaining sufficient quantities of protein from the sorted cells, we resorted to comparing proteins of whole cell cultures obtained either in the presence or in the absence of integrase inhibitor. This latter approach yielded results similar to those obtained with proteins from sorted cells (Figure 4(a,b)). Thus, even for the DHIV3-infected whole culture preparations, in the absence of integrase inhibitor, where the majority of protein (>80%) came from the mCherry negative cells, m-Cherry protein was still readily detectable. All protein preparations were compared to an equivalent amount of protein from Control cells. Control cell protein preparations were from PMA activated THP-1 cells not exposed to DHIV3.
As anticipated, we found protein preparations from DHIV3-mCherry cultures, in the absence of integrase inhibitor, to contain high levels of mCherry protein ( Figure 4(b)). In addition, these same protein preparations contained p24 and p24 precursor proteins (Figure 4 (c)). The p24 antibody we used is known to bind both p24 and Gag precursor proteins [10,3536]. As expected, p24 protein was also detectable in protein from the MK-2048 treated culture. In the literature, this observation has been attributed to residual p24 protein lingering from the initial infection [10]. We obtained support for this interpretation by taking a 48 hour time point (Figure 4(d)). The extended time point showed p24 precursor proteins only appearing in the integrase inhibitor cultures 48 hours after treatment. In contrast, p24 and Gag precursor proteins were increasingly detectable in cell proteins from 24-and 48-hour integrase competent infections. Thus, p24 detected in DHIV3 infected cells was likely due to residual p24 from the original infection. We found both p24 and precursor p24 proteins only in the protein samples from cells without integrase inhibitor.
We tried antibodies for all the other major HIV-1 proteins, although not exhaustively, to test the correlation of protein expression with the detection of the transcript. One antibody that yielded a clear result was the polyclonal antibody against Vpu. In this case, we see a result very similar to that obtained with mCherry and p24, in that protein was clearly detected only in samples from infected cell cultures not treated with integrase inhibitor (Figure 4(e)). Vpu was not detectable in protein from the integrase inhibitortreated cells. It is an interesting side note that Vpu has been associated with inhibition of NFkB promoted transcription [36,37].
Final confirmation of the PIC versus Provirus status of our semi-supervised cluster cell populations was  Figure 3. B) Lane 2, mCherry protein was readily detectable in protein from cultures containing Provirus cells, with anti-mCherry antibody used in A. Lane 3, a small amount of mCherry signal was detected in MK-2048 treated cultures. C) Lane 2, p24 and Gag precursor proteins visualized with p24 antibody used above in Fig. S-6, and HRP linked anti-mouse secondary antibody. The p24 band in lane 3 is residual from infection as reported in the literature [10]. The presence of precursor proteins in lane 2 shows p24 synthesis in cultures containing Provirus cells. D) Lanes 1 and 2, Control cell protein at 24 and 48 hrs respectively; lanes 3 and 4, protein from DHIV3 infected culture at 24 and 48 hrs respectively; lanes 5 and 6, protein from DHIV3 infected cultures treated with integrase inhibitor (as above) at 24 and 48 hrs respectively. At 24 hrs post infection, we only found both p24 and precursor Gag proteins in the protein samples from DHIV3 infected cells in the absence of integrase inhibitor. At 48 hrs post-infection, in the absence of integrase inhibitor, the amounts of detectable p24 and Gag proteins were dramatically increased from levels at 24 hrs post infection. As seen initially (Panel C), some p24 protein was detectable in integrase inhibitor-treated cultures at 24 hrs post infection, however, Gag is not detectable at this time. At 48 hrs post infection in the integrase inhibitor treated cultures, some Gag protein does become detectable, reflecting production in cells that escaped complete integrase inhibition. This is in agreement with our flow cytometry analysis that showed suppressed, but still detectable numbers of mCherry positive cells in the integrase inhibitor treated cultures ( Figure 3). The Gag precursor proteins only appear in the integrase inhibitor treated culture proteins 48 hrs after treatment. All antibodies, sources and dilutions are provided in Methods. E) Lane 2, Vpu detected with rabbit antibody, visualized using HRP linked anti-rabbit secondary. The resolution of the image is slightly compromised due to the small size of Vpu protein.
obtained using real-time polymerase chain reaction (real-time PCR) analysis [38,39]. DNA from the respective cell treatment groups described above (Control, DHIV3 infected, DHIV3 infected with integrase inhibitor) was isolated and analyzed by PCR using multiple sets of primers. For detection of integrated proviral DNA, a set of primers [38] was used to amplify from random nascent human genomic Alu sequences to an internal HIV LTR sequence. This initial amplification step was followed by a second PCR amplification step using nested primers, which would only amplify discrete DNA products that contain integrated provirus [38]. Integrated provirus was detected in much higher abundance from DNA of DHIV3infected cells in the absence of integrase inhibitor (Fig. S-7 A, B, C). Very small amounts of integrated provirus DNA were detectable in MK2048 treated DHIV3-mCherry infected cell DNA, when using higher amounts (200 ng) of starting DNA. This is in close agreement with Flow cytometry results ( Figure 3) and Western analysis ( Figure 4) indicating small amounts of proviral DHIV3-mCherry DNA in our integrase inhibitor-treated cultures.
To demonstrate unequivocally that our PIC cluster cells do indeed contain PIC HIV cDNA, we used the primers of Brussels and Sonigo [39], which are internal to the HIV-1 LTR sequence. These primers were oriented in a way so as to detect only circular 2-LTR PIC HIV-1 DNA by bridging the 2-LTR junction [38,39]. We found that we required 2 rounds of PCR amplification to obtain the predicted 2-LTR amplicon, suggesting that this particular PIC species is in low abundance in our model cells ( Fig. S-7D, E, F). We then tested this conclusion using bracketing primers (see Methods) to generate a product to contain the predicted amplicon of Brussels and Sonigo [37], and then followed with a round of amplification using the previous primers to generate a nested 2-LTR junction product. This approach also generated equal amounts of the predicted amplicon product from DHIV3-mCherry infected culture DNA, whether in the presence of integrase inhibitor or not. Thus, the circular 2-LTR PIC DNA was detectable in PIC and Provirus cells. The 2-LTR PIC cDNA PCR product was not detected in Control cell DNA.
To confirm that total PIC cDNA is in relatively high abundance in our DHIV3-mCherry infected cultures, we adapted the previous primers to amplify total DHIV3 LTR DNA. With these primers, similarly high levels of total PIC cDNA was detectable in DHIV3 infected cells, whether in the presence of integrase inhibitor or not (Fig S-7 G, H, I). This confirmed that both PIC and Provirus cells contain PIC cDNA.
In summary, the Provirus and PIC status assigned to our cell clusters were confirmed by real-time PCR. The Western blot analysis confirmed our flow cytometry observations that mCherry producing cells were also producing p24. Both approaches show that the mCherry cells are selectively suppressed by integrase inhibitor treatment and lead to the conclusion that only Provirus cluster cells are making mCherry, p24 or Vpu proteins from the DHIV3 transcripts. Conversely, transcripts detected in PIC cells, due to PIC transcription, do not lead to detectable mCherry, p24, or Vpu synthesis.

scRNA-seq analysis of integrase inhibitor-treated DHIV3 infected cultures shows suppression of the Provirus cluster
DHIV3 infected THP-1 cells were treated with 25 nM MK-2048 at the time of infection, using our established protocol, and analyzed by scRNA-seq. In this experiment, the integrase inhibitor blocked ~87.5% of mCherry production by flow cytometry analysis of a parallel culture ( Figure 3). The UMAP image of DHIV3 transcript features is shown in Figure 5(a). Transcriptome clustering using varying nearest neighbor K-values (K10, K50, K90, and K130) yielded 3 to 7 clusters, depending on the K-value applied ( Fig. S-8). Regardless of the specified K-value, we did not detect a Provirus cluster in MK-2048 cultures, as was observed with integrase competent infections ( Figure 5(c)). Again, the distribution of HIV-1 transcript containing PIC cells throughout the semi-supervised integrase inhibitor cell cluster was not affected by cell cycle, percent mitochondrial gene expression or numbers of genes detected per cell ( Fig. S-9). Thus, the scRNA seq result confirms that integrase inhibitor-treatment selectively suppresses cells in the Provirus cluster, agreeing with results obtained by Western blot and PCR analysis, and confirms that the DHIV3 mRNA detected in the PIC cluster cells is due to transcription of PIC complexes.

Hallmark and REACTOME analyses indicate mitosis-associated pathways are upregulated in Provirus cluster cells whereas viral restriction and interferon-associated pathways are upregulated in PIC cluster cells
Differential gene expression (DGE) in Provirus versus PIC/Bystander cluster transcriptomes was analyzed using GSEA with Hallmark or REACTOME gene lists (Appendix I), the pairwise T-Tests function from Scran was used to determine the statistically significant differentially expressed genes between groups. Table 1 presents the statistically significant Hallmark results. By comparing the DGE between the Provirus and PIC/Bystander cell clusters using Hallmark and REACTOME tools, it was clear that Provirus cluster cells were differentiated from PIC/ Bystander cluster cells in several key ways. In general, the transcriptomes of cells in the Provirus cluster were characterized by gene transcripts associated with cell replication, whereas the transcriptomes of cells in the PIC/Bystander cluster were characterized by pathways associated with immune-response and interferon signaling. In the Provirus cells, the analyses clearly showed upregulation of cell replication, oxidative phosphorylation, protein synthesis and E2F family targeted pathways. On the other hand, NFkB, AP1, interferon responsive, and immune response pathways are relatively upregulated in the PIC/ Bystander cluster cell transcriptomes. Intuitively, this makes sense, but it runs contrary to established literature that associates E2F transcription factors with decreased viral production [24,25]. This finding would not be obvious without the use of single-cell analysis. In the absence of single-cell analysis, the Provirus cluster's differentially expressed gene transcripts would have been swamped out by the 90% of mRNA obtained from PIC and Bystander cells. This led us to hypothesize that the cell's transcriptome background reflects the regulation of virus protein production to some extent, and that E2F transcription contributes to the favorable environment.
Using the UMAP Feature plots shown in Figure  S-10, we visualized the distribution of cells containing some of the most statistically significant differentially expressed transcripts in the Provirus cell transcriptomes compared to the PIC/Bystander cell transcriptomes. The GSEA lists of the differentially expressed genes are provided in Supplementary Appendix I. Unlike the random distribution of PIC cells throughout the PIC/Bystander cluster, the distribution of these differentially expressed genes throughout the PIC/Bystander cluster is often clustered. We interpret this indicate the lack of impact PIC transcription has on the overall background PIC cell overall transcriptome (and vice versa).

Biological repeat experiments confirm observations
To confirm the conclusions obtained from the analyses presented above, an independent biological repeat experiment was conducted. The repeat experiments captured over 4,500 cells, with an average greater than 3,000 genes per transcriptome. The control cell cultures again yielded a slightly larger number of cells captured with more than 3,000 genes transcripts per cell. The biological repeat experiments, HIVreplicate2 and WT2 (Control experiment number 2) were also conducted with parallel cultures that were analyzed by flow cytometry. Flow cytometry indicated 2.6% mCherrypositive cells in the HIVreplicate2 experiment.  Table 1. Hallmark analysis of gene pathways up or down regulated detected in Provirus versus PIC/Bystander cluster cells (respectively). GSEA Hallmark analysis (fgsea R package) of metabolic pathways negatively or positively regulated (p < 0.1) in the Provirus cluster transcriptome versus the PIC/Bystander cluster transcriptome. Pathways up-regulated in Provirus cells include E2F, Myc targets, G2-M checkpoint, spermatogenesis and oxidative phosphorylation. Down-regulated pathways identified are more numerous, but notably included TNFα signaling via NFkB, inflammatory response genes, apoptosis and interferon γ response. The pairwise T-Tests function from Scran was used to determine the significant of genes between groups. The significant DGE subsets were used for all comparisons. Several comparisons were made to confirm that transcriptomes from Provirus and PIC/Bystander clusters in the repeat experiments were identical. In the first comparison, differentially expressed genes (positive or negative) were identified between the Provirus and PIC/Bystander clusters in the respective biological repeats. The log2 fold changes from these gene sets were then compared to test if the differences between the transcriptomes of Provirus and PIC/Bystander cells in the repeat experiments was consistent ( Fig. S-13). Because there were almost 4 times as many Provirus cluster cells in the HIVreplicate1 experiment, the detected differentially expressed genes were significantly increased in the HIVreplicate1 case, compared to the HIVreplicate2 case. Nevertheless, the two gene sets were positively correlated. Correspondingly, GSEA with Hallmark and REACTOME gene sets from HIVreplicate2 identified many of the same pathways as differentially regulated as did HIVreplicate1 (Appendix I).
To obtain additional statistical certainty for our conclusion that the semi-supervised cluster transcriptomes obtained for HIVreplicate1 were repeated in HIVreplicate2, we compared log2 fold change values from the Provirus or PIC/Bystander clusters in HIV infected cells to the log2 fold change values from the Control (WT) PMA-treated THP-1 cultures. This 8-way comparison (shown in Fig.  S-14) provided statistical certainty that DGE sets from the Provirus cluster and PIC/Bystander cluster gene sets from the biological replicates were not different. The replicate Provirus and PIC/Bystander gene sets have a generally strong concordance amongst themselves and there is a modest to strong non-zero mean trend in logFC among genes that changed in at least one of the contrasts between replicates (FDR 5%). In every comparison, a significant positive correlation was obtained from the commonly detected, significantly differentially expressed genes of Provirus or PIC/Bystander clusters in the two biological repeats when compared to the Control samples. The weakest correlations were observed in comparing PIC/Bystander to Control cell DEGs, especially Control experiment 2 (probably because there is more commonality between genes detected in PIC/Bystander cells and Control cells than there is between Provirus cluster cells and Control cells); nevertheless the data between HIVreplicate1 and HIVreplicate2 were statistically concordant. Therefore, the transcriptome data obtained from the two biological repeat experiments were not different. In other words, statistically identical representative transcriptomes for Provirus and PIC/Bystander clusters were obtained in independent biological repeat experiments.
As was the case for HIVreplicate1, a clear Provirus cluster was detectable in the HIVreplicate2 (Figure 6, S-11). However, because the level of Provirus infection in HIVreplicate2 was lower than in HIVreplicate1, the frequency of DHIV3 transcript detection in the Provirus cluster cells was proportionately lower, while the absolute number of detectable PIC and Bystander cells was relatively higher. Again, consistent with the observation for HIVreplicate1, PIC cells were randomly distributed throughout the Bystander cluster.

M1 and M2 marker gene transcript expression is detectable in PIC/Bystander and Provirus cell clusters
We have used HIV-1 infection of PMA activated THP-1 cells to model infection of macrophages [40,41]. This model was selected to provide consistent induction responses in repeated experiments in order to provide statistical certainty for our conclusions. PMA activated THP-1 cells are not dividing and are, essentially, M0 macrophage cells [40][41][42], the type of macrophage most susceptible to HIV-1 infection [41]. In our protocol, THP-1 cells were incubated with 32 nM PMA for 24 hours, at which time they are morphologically differentiated, adherent, and flattened with rare mitoses. While THP-1 cells are known to de-differentiate over 72 hours when PMA is removed [42], in our protocol, following DHIV-1 addition for infection, the cells remain in 16 nM PMA, well within literature ranges of THP-1 cell activation by PMA [40][41][42][43][44][45][46].
While the PMA-activated THP-1 cells are not dividing [42], they can be further differentiated into M1 or M2 macrophage-like cells, by LPS and IFN-γ or IL-4 and IL-13, respectively [40]. This differentiation response resembles the M1 or M2 differentiation in primary macrophages when analyzed by transcriptome changes [40,[46][47][48]. However, THP-1 induction responsiveness in terms of gene expression has been reported to be lower and not completely representative of induction responses obtained with naive primary macrophage cultures [40,49,50].
To determine if our Provirus or PIC cell transcriptomes were consistent with differentiation to M1 or M2 [40,[46][47][48], and to determine if the M1 or M2 linked transcriptome changes could correlate with the Provirus versus PIC/Bystander transcriptome differences identified above, gene transcripts associated with M1 or M2 differentiation were compared between Provirus cells, PIC cells, and Control cells. Figure 7 shows the dotplot of these results. The overall expression pattern of M0, M1, or M2 marker genes in Provirus cells was not found to differ from Control cells, indicating no change in THP-1 differentiation state with viral integration. Thus, the distinction between Provirus and Control cell cluster transcriptomes was not due to the differentiation of Provirus cells toward an M1 or M2 phenotype. PIC and Bystander cells, however, were found to have increases in the level and fraction of cells expressing MAFB (M0) and IL1B, HLA-DRB1, and CD68 (M1) differentiation marker genes when compared to Control or Provirus cells. While the degree of a THP-1 cell's differentiation toward M1 or M2 phenotypes, relative to the expression of these marker transcripts, is a vague relationship, the differences described here may be associated with some of the transcriptomic differences identified between the PIC/ Bystander cluster and Control or Provirus clusters.

PIC and Provirus cells express all DHIV3 genes at equivalent levels, although higher numbers of PIC cells detectably express some transcripts
We were interested to know if early HIV-1 gene transcripts (tat, nef, or rev) predominated in PIC cell transcriptomes, versus later transcripts in the Provirus cells. We used Feature plots to determine the distribution of selected HIV gene transcripts expressed in individual cells. It was found that cells expressing the respective early or later gene transcripts were distributed throughout the image (Figure 8(a)).
We then used violin plots to compare the load of individual DHIV3 gene transcripts in Provirus cluster cells to PIC cells (Figure 8(b)). The Provirus cluster (from Figure 2) contained transcriptomes of 372 cells, the PIC/Bystander cluster contained 569 cells, thus the ratio of Provirus to PIC cells was ~0.65. This visualization was much more informative and yielded a more nuanced understanding. Some transcripts, such as gag-pol, tat, env and nef were detected in far more cells in the PIC cluster. Experiment HIVreplicate1 suggested that the level of expression of this set of genes was significantly elevated in Provirus versus PIC cells, but this observation did not repeat in experiment HIVreplicate2 (Figure 9). In contrast, gag, vif, vpr, rev, vpu, and mCherry, were clearly detectable, but in fewer numbers of Provirus and PIC cells. Cells containing these transcripts were comparably prevalent in the two groups. The level of expression of this set of genes was comparable between Provirus and PIC cell groups in both experiments. In conclusion, all DHIV-1 transcripts were easily detectable in both Provirus cells and PIC cells. Furthermore, there is a clear overlap in the amount of HIV-1 gene transcripts detected in Provirus and PIC cells.  Figure 6). The transcriptome representing these Provirus cells was compared to the combined transcriptomes of the remaining clusters to generate the violin plots in Figure 9. The patterns of transcription observed in HIVreplicate1 were confirmed in this experiment (Figure 9, S-15). The transcripts of gag, vif, vpr, vpu, and mCherry were detectable in PIC cells at frequencies and levels of expression similar to those observed in the Provirus cells. In contrast, even correcting for the Provirus/PIC ratio of 0.17, the transcripts of gag-pol, tat, env, and nef were again detectable in proportionally higher number of PIC cells (Figure 9). Again, there was overlap in the numbers of HIV-1 gene transcripts detected in Provirus and PIC cells. No cells expressing rev were detected in the Provirus cluster in this repeat, due either to the low efficiency of detecting this transcript in the 10X system, or because rev is expressed only at low levels in relatively few cells, or both.
During transcription of pro-virus, HIV-1 does not produce all transcripts in equal numbers or at the same time [51,52]. The specific processed gene transcripts produced initially in viral replication differ from transcripts produced later on. Furthermore, 10X Genomics scRNA-seq library production is known for significant numbers of dropouts, and cDNA copying of various gene transcripts during library construction varies in efficiency [53][54]. In addition, in using poly-T primers in the cDNA library construction, the 10X process introduces a 3' bias toward the detection of given sequences in a transcript [15]. Thus, it is not possible to make quantitative comparisons between the different transcripts using this approach. Nevertheless, the overarching take-away from this single cell analysis is that cells making fully spliced transcripts such as nef, tat, and rev are equally prevalent with cells making gag-pol and env transcripts (Figure 8,9). Furthermore, there appears to be two patterns of transcription. One pattern, observed with gag-pol, tat, env, and nef, is characterized by gene transcripts being more frequently detectable in PIC cells than Provirus cells. The other pattern, observed with gag, vif, vpr, rev, vpu, and mCherry, suggests relative equal frequency of transcription in Provirus and PIC cells.

Psupertime analysis indicates progression of cluster transcriptomes from Control to PIC/Bystander to Provirus
To understand the transcriptome transitions needed to move from unexposed and uninfected "Control" cells to PIC/Bystander cells, and on to Provirus cluster cells, we performed a psupertime analysis [55][56][57][58] of the respective cell cluster transcriptomes. Psupertime is a supervised pseudotime [55] technique. It explicitly uses sequential condition labels as input. Psupertime is based on penalized ordinal logistic regression that places the cells in the ordering specified by the sequence of labels. This allows for targeted characterization of processes in single-cell RNAseq data.
One thousand cells were randomly selected from each transcriptome cluster (combined Control, PIC/ Bystander, and Provirus data sets, respectively) and their transcriptomes were combined for psupertime analysis. Imposition of Cluster identity yielded the image shown in Figure 10(a). The psupertime-type analysis showed closer similarity between Control and PIC/Bystander transcriptomes than between Control and Provirus transcriptomes, and closer similarity between PIC/Bystander and Provirus transcriptomes than between Control and Provirus. The GSEA list of the DGEs that contributed to this faux progression from Control to PIC/Bystander to Provirus is presented in Appendix II.
When we examined the expression of DHIV3 transcripts through the psupertime progression, the analysis showed no obvious preference for early gene transcription in cells belonging to the PIC/Bystander versus the Provirus clusters ( Figure 10(b)).
When questioning which transcription factors were regulating the transcriptome transitions, we searched the contributory psupertime DGE transcripts for transcription factors. Many transcription factors that differed in expression in the contrasted transcriptomes were identified (Appendix II). However, this yielded a complicated picture and did not clarify which transcription factors might be most important for controlling the transcriptome transitions from Control to PIC/Bystander to Provirus clusters. However, because the activity of most transcription factors is regulated by activation of proteins already present within the cell, and not at the transcription level, we speculated that Transcription Factor Targeting analysis might be more informative as to which transcription factors were key to cluster transitions.

E2F, NF-kB and AP1 control phenotype transitions between PIC/Bystander cells and Provirus cells
We used Transcription Factor Targeting analysis to identify transcription factors that controlled DGEs in our Control, PIC/Bystander and provirus clusters. This analysis was consistent with the aforementioned Hallmark and REACTOME analyses. The E2F family of transcription factors predominate in regulating the Provirus cluster DGEs (Table 2). Twenty out of the 29 possible promoterassociated transcription factor interactions positively associated with the transition from the PIC/Bystander to the Provirus transcriptome identified with the E2F transcription factor family. Thus, E2F is associated with pathways that determine the phenotype of cells in the Provirus cluster. Conversely, 19 possible promoter-associated transcription factor interactions were negatively associated with DGEs reflecting the transition from PIC/Bystander to Provirus cells. Of these 19 possible promoter complexes, 8 different promoter interactions were identified to be associated with NFkB and AP1 transcription ( Table 2) suggesting that downregulation of NFkB and AP1 also plays a key role in shaping the Provirus cluster cell transcriptome.
AP-1 and NFkB also appear to play roles in the maintenance of the PIC/Bystander cell transcriptome as well. Correspondingly, Transcription Factor Targeting identified differences between Control and PIC/Bystander cell transcriptomes. Simple exposure of activated THP-1 cells to DHIV3 was sufficient to decrease E2F signaling in PIC/ Bystander cells, compared to Control cells, and to increase AP-1 and NFkB signaling (Appendix III). We presume this effect on the PIC/Bystander cells is through PAMP and/or interferon signaling, but we have not investigated this further. Again, these results are consistent with the results obtained with Hallmark and REACTOME analysis of the DGEs.

Western blot analysis of Retinoblastoma protein phosphorylation shows an increase in Provirus cells
To confirm a role for E2F and NFkB in regulating the transcriptomes of Provirus and PIC/Bystander clusters (respectively), we sorted Provirus (mCherry positive cells) from PIC /Bystander (mCherry negative) cells using the FACS Canto. Rb phosphorylation is associated with activation of E2F promoter family transcription. We hypothesized Provirus cluster cells would exhibit retinoblastoma (Rb) phosphorylation consistent with E2F activation [27,59]. We used anti-T821 Phospho-Rb antibody. Phosphorylation of Rb at threonine-821 (T821) blocks pocket protein binding, including E2F family proteins, and activates E2F family promoter gene transcription [59]. Isolated Provirus cells had the greatest phosphorylation of Rb (pRb), compared to Control and PIC/Bystander cells ( Figure 11). Interestingly, phosphorylation of Rb in PIC/ Bystander cells was lower than that detected in Control cells (Figure 11).
These findings agreed closely with the Transcription Factor Targeting analysis, which showed that E2F promoter family transcription was predominant in Provirus cluster cells. It also agreed with the Transcription Factor Targeting analysis in that PIC/Bystander cells exhibited reduced Rb phosphorylation, and presumably E2F driven transcription, when compared to either Provirus cluster of Control cells. We did not find a difference in NFkB deactivation in the Provirus cells as would be implied by phospo-IkB S32, (Figure 11), and hypothesize that other mechanisms must account for the relative decrease in NFkB driven transcription in Provirus cluster cells.

Cells transcribing from provirus are more likely to produce viral proteins upon second infection than PIC or Bystander cells
If Provirus cluster cells have already committed to the production of virus, represented by the switch of their background transcriptome to favor E2F transcription factor interactions, we hypothesized that they should be more efficient at producing virus upon second infection. We sequentially infected activated THP-1 cells with DHIV3-mCherry followed by an infection with DHIV3-GFP 24 hours later (DHIV3-mCherry infection at 0 hour and DHIV3-GFP infection at 24 hours). We found a higher percentage of cells positive for mCherry and GFP after 48 hours compared to GFP alone (Figure 12). At his time point, which was 24 hours after DHIV3-GFP infection, about half the mCherry positive cells were also GFP positive. Whereas less than one-quarter of the mCherry negative cells were expressing GFP protein. This trend continued out to 72 hours post DHIV3-mCherry infection, 48 hours after DHIV3-GFP addition, where about 60% of the mCherry cells were also GFP positive, compared to about 40% GFP positive in mCherry negative cells. In repeat experiments and in experiments using primary macrophage and T-cell cultures (Fig. S-16) the results were the same. Provirus cluster, mCherry positive cells, were 2X to 5X more likely to make DHIV3-GFP protein upon second infection than PIC/Bystander cluster cells. Thus, cells already committed to making virus were more likely to make virus from a second infection than PIC/Bystander cells on the first infection.

Discussion
Early research into the HIV-1 life cycle identified transcription of HIV-1 PIC cDNA in T-cells and macrophages [5][6][7][8][9][10][11][12]. The unintegrated viral DNA can take Table 2. Transcription Factor Targeting analysis of DGE contrasting PIC/Bystander and Provirus cells. TFT analysis (GSEA with the fgsea R package and the C3 collection from msig) suggests that at least 3 transcription factor families control the transition from PIC/Bystander transcriptomes to Provirus cluster transcriptomes. These are E2F, NFkB and AP1 family promoter proteins. In particular, increased E2F regulated transcription appears to correspond with the transition to production of viral proteins. several forms, linear and the 1-LTR-and 2-LTR circles, with much of the transcription thought to emanate from 1-LTR circles [7,60]. In macrophages, it has been shown that HIV-1 PIC cDNA can persist and be actively transcribed for months. However, it is generally agreed that PIC HIV-1 transcription in macrophages does not routinely produce infectious virus [12]. The use of singlecell techniques has enabled us to quantify both the numbers of cells expressing given HIV transcripts in a mixed culture, and the relative transcript loads of each HIV-1 gene in each infected cell [13][14][15]. It also allows us to put the cells containing viral transcripts into the context of their background transcriptomes. This provides the opportunity to compare cells producing late viral proteins to those producing transcript but not late viral proteins to better understand the cellular metabolic background necessary for virus production.
We show that scRNA-seq can differentiate between macrophage cells that transcribe from PIC HIV-1 cDNA and macrophage cells that transcribe from integrated HIV-1 provirus. In proving this observation, we discovered that the synthesis of many HIV-1 proteins is only detectable in provirus cells, even though HIV-1 RNA transcripts can be detected equally in cells transcribing from provirus or PIC HIV-1 cDNA. Single-cell RNA sequencing can distinguish between the two cell types because the background transcriptomes vary dramatically. PIC cell transcriptomes are characterized by NFkB and AP-1 promoted transcription, while transcriptomes of cells transcribing from provirus are characterized by E2F family transcription products. We also find that the transcriptomes of PIC cells and Bystander cells are identical, suggesting that PIC cells are "oblivious", at the transcriptome level, to the fact that they are making HIV-1 transcripts. Furthermore, the presence of pathogen alters the transcriptome of the uninfected Bystander cells, so that they are readily distinguishable from true Control cells (cells not exposed to any pathogen). Thus, to understand the transcriptional changes caused by HIV-1 infection in these cell populations, it is not appropriate to compare bulk RNA sequencing data from HIV-1/host cell cocultures to sequencing data from control cultures not exposed to pathogen. The correct comparisons to make are between cells transcribing from provirus and those transcribing PIC HIV-1 cDNA, or Bystander cells. As reported by Marsh and Wu and colleagues [10], "transcription in the absence of integration is selective and skewed towards certain viral early genes such as nef and tat, with highly diminished rev and vif". In general, our single cell analysis agrees with this conclusion, but provides a more nuanced picture. In our data, compared to cells in the same co-culture transcribing from provirus, many more PIC cells are producing tat, nef, gag-pol, and env transcripts. However, the level of a given transcript load per cell is not detectably different from Provirus cluster cells. In contrast, the prevalence of PIC cells that make rev, gag, and accessory gene transcripts constitute a smaller fraction of the PIC cells. Nevertheless, those few PIC cells that are producing rev, gag and accessory gene transcripts are doing so at levels equivalent to Provirus cells. In Provirus cells, the numbers of cells detectably making fully spliced versus un-spliced transcripts are equal. The levels of these transcripts per cell overlaps with levels of transcripts produced by PIC cells. Thus, it is difficult to generalize that high overall transcript levels are a trigger for virus production.
Because DHIV3 is replication-deficient, we measure the production of HIV-1 proteins as a surrogate marker of virion production. Only the Provirus cells appear to be making p24, mCherry or Vpu protein. Using Western blot analysis, we only see the production of p24, mCherry and Vpu in purified Provirus cell protein, or protein samples containing Provirus cells. It is not clear whether our failure to detect the HIV proteins in preparations enriched for PIC/Bystander cells is because very few cells in this cluster were making the proteins (and thus below our level of detection), or if there was some restriction mechanism preventing protein production, or both. Single-cell protein analysis technology might be able to address this point. The robust detection of mCherry, p24, and Vpu in protein samples enriched for Provirus cells confirms that late protein synthesis is an attribute of the Provirus cells. Nef protein is reported in the literature to be detectable in protein from both Provirus and PIC cluster cells [10]. Nef protein is not made in cells infected with the DHIV3-mCherry, due to the mCherry sequence replacing the 5' portion of the nef gene, and we could not detect Nef protein from any DHIV3 infected cultures.
Integrated provirus transcription is required for virus production [12]. It is regulated by promoter elements in the HIV LTR. Thierry and coworkers have recently shown that PIC cDNA and provirus are differentially responsive to NFkB promotion [60]. As others have found, they report provirus transcription is enhanced by NFkB and AP1 binding. However, they find PIC cDNA transcription to be inhibited by NFkB activation. Our data adds a layer of complexity to this picture. We can show that early and late gene transcripts are detectable in Provirus cells, at levels overlapping with the respective transcript levels in PIC cells. We also show that, in cells that have transitioned from PIC/Bystander to Provirus, the background transcriptome reflects an overall down-regulation of NFkB and AP1 transcripts, and up-regulation by E2F family promoted transcripts. E2F is not a promoter in the HIV-1 LTR, and we do not suggest that E2F regulation of provirus transcription is the key to the PIC to Provirus transition. However, we do propose that an E2F family promoter-dominated transcriptome is required for virus production.
This proposition appears counter to literature, in which E2F is thought to suppress viral transcription [25,61]. Nevertheless, the pathways upregulated by E2F are those consistent with what one would intuitively anticipate a being required for viral production, and our transcription factor analysis suggests that E2F family promotors play an outsized role in Provirus transcription. E2F is an Rb-pocket binding protein, and phosphorylation of Rb at T821 is known to activate E2F transcription. Indeed, we show that levels of phospho-T821 Rb are higher in Provirus cells. A proposed role for E2F promoter family proteins in virus production is not new and is consistent with literature citing a role for Rb phosphorylation and E2F activation in HIV-1 linked tumorigenesis [62]. Consistent with this was our observation in our THP-1 system, and in primary cultures of human macrophage and T-cells, the Provirus cells are more likely than PIC cells to make virus upon second infection. If this interpretation is correct, then the study of genome-wide changes that accompany provirus integration and amplify E2F signaling might be key to understanding the switch in transcriptome necessary for viral protein production as PIC cells transition to Provirus cells.
Several recent scRNA-seq studies using donor macrophage or T-cell preparations have revealed complicated relationships between transcriptome and HIV-1 production. In general, these studies preselect for Provirus cells before amplification and report complicated and possibly stochastic, transcriptome heterogeneity in both latent and active HIV-1 infected cells [63][64][65]. In contrast, we focused on the distinction between GFP positive cells (transcribing from provirus) compared to GFP negative cells (either Bystander cells or those transcribing from PIC HIV-1 cDNA), in cell cultures containing both infected and uninfected cells. We found that a change in transcriptome background that includes E2F transcription accompanies Provirus integration, and this also accompanies the production of late viral proteins.

Generation of DHIV3-mCherry
DHIV3-mCherry virus was generated by calcium phosphate transfection (25). In brief, HEK293FT cells were grown to 80% confluence. DHIV3-mCherry plasmid and VSV-g plasmid were mixed with calcium chloride (2.5 M) and HEPES buffered saline solution. The calcium phosphate-DNA suspension was added dropwise to the cells. Chloroquine (100 mM) was subsequently added and the HEK293FT cells were incubated overnight. The medium was replaced with fresh DMEM and incubated for an additional 48 hours. Supernatant was collected and filtered (0.45 μm). Optimal viral titers were determined by titrating the virus in THP-1 cells. A titer volume of 100 μL in 500 μL total (~5 x 106 TU/mL in THP-1s) was chosen due to its high DHIV infection and minimal effects on viability. Titer volume (up to 400 μL of 500 μL) was increased as infectivity fell off in stocks over time.

DHIV3-mCherry and THP-1 co-culture
For the THP-1s, cells were preincubated overnight in PMA (20 ng/mL) at 500,000 cells/well to generate differentiated macrophages. Medium was replaced for THP-1 cells 1 hour before the addition of DHIV3-mCherry. Cells were incubated at 37°C for 24 hours followed by preparation for analysis by flow cytometry and cell imaging.

Flow cytometry
Adherent THP-1 cells were incubated with Accutase™ for 15 minutes at 37°C. THP-1 cells were transferred to 5-mL tubes and washed with PBS. Cells were resuspended in BD Horizon™ Fixable Viability Stain 450 (0.25 μg/mL) and incubated at 4°C for 30 minutes. Cells were then fixed in 2% formaldehyde for 30 minutes at 4°C. Cells were analyzed using a FACS Canto. Percent infection by DHIV was quantified as a subset of the live population (FSC/V450/50-). Gates for infection were set according to the uninfected "mock" THP-1 cell controls. Three independent biological replicates were completed for all treatment conditions, each in triplicate wells per experiment. Population analysis was then done using FlowJoTM v10.7, to assess if infection levels and cell viability were consistent similar in all replicates. The Flow Cytometry figures are representative plots obtained from one of the replicates.

10X Genomics library construction and sequencing
Two biological replicate cultures of (HIV+) THP-1 cells (HIVreplicate1 and HIVreplicate2) and (HIV-) THP-1 cells (hereafter referred to as Control, or WT1 and WT2 for wild type) were processed through the 10X Genomics Chromium Single Cell Controller with Single Cell Gene Expression 3' Solution (v2 chemistry). Sequencing was done on an Illumina HiSeq 2500 instrument.
Cell suspensions were partitioned into an emulsion of nanoliter-sized droplets using a 10X Genomics Chromium Single Cell Controller and RNA sequencing libraries were constructed using the Chromium Single Cell 3' Reagent Kit v2 (10X Genomics Cat#PN-120237). Briefly, droplets contained individual cells, reverse transcription reagents, and a gel bead loaded with poly(dT) primers that include a 16 base cell barcode and a 10 base unique molecular index. Lysis of the cells and gel bead enables priming and reverse transcription of poly-A RNA to generate barcoded cDNA molecules. Libraries were constructed by End Repair, A-Tailing, Adapter Ligation, and PCR amplification of the cDNA molecules. Purified cDNA libraries were qualified on an Agilent Technologies 2200 TapeStation using a D1000 ScreenTape assay (Agilent Cat#5067-5582 and Cat#5067-5583). The molarity of adapter-modified molecules was defined by quantitative PCR using the Kapa Biosystems Kapa Library Quant Kit (Kapa Biosystems Cat#KK4824).

) of HIVrepeat1
Raw FASTQ files from 10x Genomics were processed by 10x Genomics' Cell Ranger software. Each library was processed with "cellranger count" pipeline with a common genomic reference made up of human and HIV genomes as well as mCherry. The human genomic reference was GRCh38 with gene annotation from Ensembl release 91, where only features with gene_biotype:protein_coding were kept. The HIV genome and annotation was acquired from NCBI genome (RefSeq ID NC_001802.1). No warnings were issued by 10x Genomics regarding sequencing, alignment, or cellbased QC metrics; however, the samples could have been sequenced deeper as reflected in the sequencing saturation statistics.
In an attempt to recover those (perhaps lower quality) GEM partitions, the raw gene-barcode matrices from "cellranger count" (located in "outs/ raw_gene_bc_matrices") was processed with the EmptyDrops algorithm (R package DropletUtils v1.2.2) to discriminate cells from background GEM partitions at a false discovery rate (FDR) of 0.001% [66]. GEM partitions with 2000 UMI counts or less were considered to be devoid of viable cells, while those with at least 10,000 UMI counts were automatically considered to be cells.
For each technical replicate, additional quality control measures were taken to filter out low-quality cells. Cellbased QC metrics were calculated with R package scater (v1.16.2) using the perCellQCMetrics function. Cells with extremely low UMI counts, extremely low gene counts, or an extremely high percentage of expression attributed to mitochondrial genes were also flagged as low quality. Extremeness in any of these three measures was determined by 3 median absolute deviations from the median with the scater is Outlier function. These cells suspected of being low quality were removed from downstream analysis with one exception in the HIV replicates; if cells exhibited above median HIV gene expression, they were not discarded as these were thought to hold potential value as examples of cells in which viral replication suppressed other gene expression (not observed). Further analysis of the HIV biological replicates showed there were remaining low quality cells as marked by unusual mitochondrial gene expression or low library size, which were removed to improve the signal-to-noise ratio. Specifically, HIV-infected cells with mitochondrial expression of 7.23% and above were removed as well as cells that have less than 3242 UMI counts. To ensure no cell type was discarded due to filtering, average gene expression was compared gene-wise between discarded and kept cells in scatter plots. There were no genes of interest that exhibited markedly different average gene expression between the discarded and kept cells, suggesting the filtering did not remove interesting subpopulations. After filtering low quality cells, technical replicates were combined into biological replicates HIVreplicate1, HIVreplicate2, wt1 and wt2.
For each biological replicate, cells were normalized [66] and scored for several important attributes. Each cell from HIV biological replicates was assigned a HIV activity score with Seurat's (v3.2.2) AddModuleScore function [58,67], where a high score indicates HIV gene expression was stronger in the cell relative to randomly selected genes of similar expression strength in the biological replicate. Cell cycle phases and scores were assigned cells with the cyclone method [56]. Cells were scored against a simulated doublet population of cells with scran's (v1.10.2) doubletCells function; those cells with extremely high doublet scores (5 median absolute deviations) were removed and remaining cells were re-normalized again with scran's quickCluster and scater's normalize methodology [57] for differences in sequencing depth between libraries. (Fig. S-3) 10x Cell Ranger raw sequencing data was processed into UMI counts using the "mkfastq", "count", bioinformatics modules. Cell Ranger de-multiplexed cDNA libraries into FASTQ files with Illumina's bcl2fastq and aligned reads to a hybrid genomic reference composed of human (Ensemble GRCh38), HIV (NCBI ID: NC_001802.1), and mCherry genomic references with STAR aligner [68]. In addition to other QC metrics [69], CellRanger filtered cell barcodes and unique molecular identifiers (UMIs) in estimation of gene-cell UMI counts using only reads that mapped uniquely within the transcriptome. We specified an "expected cell number" of 3000 per library based on reported cell recovery rates.

Data analysis for t-Sne insert
The QC metrics reported by Cell Ranger indicated that our library construction was a success; the libraries averaged 97.9% valid cell barcodes, 60.6% of reads mapping to the transcriptome, and reported a median of 2402 genes detected per cell (mean of 15,262.2 genes per library). Only in HIV-infected samples did reads map to the HIV genome. Cell Ranger also evaluated dimension reduction, clustering, and differential gene expression analysis under default parameters. For further details of the Cell Ranger data processing and analysis pipeline, see https://support.10xgenomics.com/ single-cell-gene-expression/software/pipelines/latest/ algorithms/overview.
Due to cost, this study was limited by low sequencing depth (mean of 41,433 reads per cell per library). This limitation was mitigated by removing genes with low sequencing coverage; specifically, a gene was filtered out if it did not have 1% of cells reporting at least 3 UMI; cells were filtered out if they did not have at least 200 genes with a UMI count. Cells with exceedingly high (top 2%) ribosomal and/or mitochondrial content were filtered out. To reduce mutliplets contaminating analysis, cells with the top 2.3% total UMI were removed (see 10x Genomics benchmarks).

Dimension reduction, clustering, and differential expression
Highly variable genes were identified with scran's trendVar and decomposeVar. Specifically, loess smoothing was applied to the gene variance (dependent variable) and the mean gene expression (independent variable) after having corrected for the % mitochondrial expression and cell cycle effects on the cells. Genes with average expression below the first quartile were filtered out of consideration. Gene variance was decomposed into biological and technical components, where genes with variance above the mean trend (loess fit) were assumed to possess biological variation [70]. This process was repeated for the HIV biological replicates separately followed by scran's combineVar function applied to the combined HIV replicates to identify genes estimated to have positive biological variation and controlled with a false discovery rate of 0.05.

Gene set enrichment analysis
A gene set enrichment analysis (GSEA) was conducted with R package fgsea (v1.8.0) [32]. The log2 fold change vector of strong-HIV vs. weak-HIV was evaluated for enrichment against 3 different collections of MSigDB gene sets: namely, Hallmark, REACTOME, and Transcription Factor Targets. The Benjamini-Hochberg false discovery rate (FDR) was controlled at 10%.
A DGE and GSEA analysis was conducted on HIV biological replicates in comparison to Control biological replicates. The Control (Wild Type) replicate expression data was subject to (nearly) identical quality control and identical pre-processing steps. As described above, quality control on HIV cells was subject to a greater degree of scrutiny.
To mitigate biases due to potential batch-specific variation, DGE and GSEA analyses contrasting Provirus -HIV and PIC/Bystander-HIV populations to Control cells leveraged consensus between various pairwise contrasts. For instance, in the HIV-Provirus vs. Control (WT) contrast, t-tests were evaluated in 4 distinct contrasts: (i) HIVreplicate1-Active vs. WT1; (ii) HIVreplicate1-Provirus vs. WT2; (iii) HIVreplicate2-Provirus vs. WT1; (iv) and HIVreplicate2-Provirus vs. WT2. The scran function combine Markers performed a meta-analysis across the 4 contrasts with the Simes method. The Simes meta-analysis tested whether any of the 4 contrasts manifest either a change a gene-wise expression for DGE analysis; that is to say the metaanalysis p-value encodes the evidence against the null hypothesis, which assumes the gene is not changed in any of the 4 comparisons. Similarly in the GSEA analysis, the log2 fold change statistics from the 4 comparisons were tested for enrichment of the 3 previously mentioned MSigDB gene set collections (Hallmark, REACTOME, Transcript Factor Targets). The results were also merged with the Simes meta-analysis. This same strategy for the HIV-Provirus vs. Control contrast was repeated in the HIV-PIC/Bystander vs. Control comparison.
There was interest in modeling the progression of infection from wild type to PIC/Bystander-HIV to Provirus -HIV clusters. Specifically, there was interest in identifying what genes exhibit a variable expression profile or non-constant trend when ordered from Control to PIC/Bystander-HIV to Provirus -HIV. Macnair and Claassen [71] developed a supervised psuedotime R package, called psupertime, that is tailored to this express purpose. In particular, a penalized logistic ordinal regression model was fit to the combined HIV and Control data. The input data was a subset of highly variable genes identified in the same manner as described above, but including Control data as well. The gene expression data had been normalized, log2 transformed, and followed by linear correction of effects due to percent mitochondrial expression and cell cycle phase. The model was able to clearly order cells that reflects the expected order of Control then PIC/Bystander-HIV then Provirus -HIV. The psupertime method also reports a small set of genes that strongly associate with the expected progression, which is based on the magnitude of the penalized coefficients in the logistic ordinal regression.

Western protocol
To generate protein samples for the Western blot experiments, multiple (6-12) wells of differentiated THP-1 cells, in 12 well plates, were infected with DHIV-3. In the integrase inhibitor experiments, three independent wells were combined to generate samples from each of the three treatment conditions (control, DHIV infection, DHIV infection with integrase inhibitor). Proteins from these replicate wells were pooled for Western blotting. Two distinct experimental replicates were analyzed, the initial 24 hour infection experiment and then the second 24-and 48 hour repeat infections. These repeats used different passages of THP-1 cells and different batches of infectious DHIV-1. For the flow-sorted cells, 18 wells each of control and infected THP-1 cells were sorted to accumulate sufficient protein for the Western blots. Protein from the collected, sorted cell sample groups were pooled. While the cells used in the sorted cell experiment were from 1 passage of THP-1 cells, multiple batches of infectious DHIV-3 were used. The stored THP-1 cells were pelleted by centrifugation for 5 min at 10,000 x g. Precipitates were then washed twice in cold PBS. Afterward, cells were lysed using 50 mM Tris pH 7.4, 100 mM NaCl and 2 mM EDTA, complete® protease inhibitor, 2 mM NaF, 2 mM sodium orthovanadate and 10% SDS. Protein concentrations were determined using bovine serum albumin standard and Coomassie Plus Protein Reagent from Pierce Biotechnology (Rockford, IL). The amount of protein loaded for the sorted cell experiments was 7.5 µg/ lane, for the integrase inhibitor Western gels, 15 µg/ lane and separated using NuPAGE 4 − 12% Bis-Tris gradient gels (Invitrogen Life Sciences, Carlsbad, CA) and transferred to a PVDF membrane (Millipore, Billerica, MA). These were then blocked in 2% BSA in TBST for 20 min at room temperature, incubated overnight with primary antibodies at 4°C in blocking buffer solution and with the secondary antibody for 45 minutes at room temperature. Protein was detected using chemiluminescence and blots were visualized using a Protein Simple FluoroChem M system.

Real-time PCR methods
For detection of integrated proviral DNA, DNA from control, DHIV3-mCherry infected, DHIV3-mCherry infected plus integrase inhibitor-treated THP1 cells was purified using Qiagen Blood and Tissue DNeasy kits. For the PCR experiments, we generated DNA from 2 replicate wells in a 12 well culture plate. Each RT PCR reaction was run with triplicate technical repeats. Every PCR experiment, using unique primer sets, was performed at least twice. The PCR evaluation of integrated HIV was performed using the primers and PCR conditions described by Chun et al. [38]. Briefly, the primers were: Alu->LTR 5, 5'-TCCCAGCTACTCGGGAGGCTG AGG-3' LTR->Alu 3', 5'-AGGCAAGCTTTATTGAGGCTTA AGC-3' Nested secondary PCR primers (generating a 352 bp amplicon): 5', 5'-CACACACAAGGCTACTTCCCT-3' 3', 5'-GCCACTCCCCIGTCCCGCCC-3' We used Ranger polymerase and buffer conditions (Meridian Bioscience, Thomas Scientific, Swedesboro, NJ) for the long-range PCR with Alu-LTR primer sets, and BioTaq polymerase and buffer conditions (Meridian Bioscience, Thomas Scientific, Swedesboro, NJ) for the nested PCR. The integrated HIV PCR was performed on an MJ PTC-200 with an MJR 2 × 48 and a Chromo-4 alpha unit for the long-range and nested PCR, respectively. The nested PCR was performed as a real-time assay using SYBR Green I to detect the amplicon progression curves and evaluate the melting curve.
For detection of total HIV DNA, to determine if comparable total HIV DNA was present in the samples the same samples described above, we utilized the 5' nested primer (5'-CACACACAAGGCTACTTCCCT-3') along with the LTR->Alu 3' primer (5'-AGGCAAG CTTTATTGAGGCTTAAGC-3') using PCR conditions similar to the nested PCR described above but with a 30 sec extension time for the 484 bp amplicon. This PCR was performed on a Roche LightCycler 480 instrument using BioTaq polymerase and buffer conditions with SYBR Green I detection of the 484 bp amplicon.
Initially we performed the PCR conditions used by Brussel and Sonigo [37], using BioTaq polymerase and buffer conditions with a 25 sec extension time and SYBR Green I detection, but we were unable to detect any amplicon 2LTR circle PIC product. Following the detection of the total PIC HIV data, we hypothesized that the 2-LTR content in these samples could be considerably lower at this 24 hour time point. Therefore, we reran the PCR again a second time and were able to detect an expected 231 bp amplicon. To verify this approach, we synthesized new PCR primers: RU5 forward: 5'-GCTTAAGCCTCAATAAAG CTTGCCT-3' (this is the compliment of the LTR->Alu primer described above by Chun et al. [38]).
U3 reverse: 5'-ACAAGCTGGTGTTCTCTCCT-3'. This primer set also did not generate the 2-LTR circular PIC amplicon within 50 PCR cycles but was designed to encompass the amplicon generated by Brussel and Sonigo primers above. Therefore, we used the HIV F and R1 primers as a nested set and ran a 1:20 dilution of this amplification for another 50 PCR cycles to obtain the expected 231 bp amplicon. This demonstrates the 2-LTR circular form of PIC cDNA was present at low levels in all the conditions where DHIV3-mCherry was used, but not in the control samples.

Statistical analysis
The pairwise T-Tests function from Scran was used to determine statistically significant differential expression of genes between groups. This was performed for all comparison sets. Only those genes which were significantly different were included in Hallmark, REACTOME, pseudotime, psupertime, and TFT analyses. Other default statistical standards were adopted from the various software recommendations during data analyses unless otherwise specified.

Acknowledgments
This work was supported by the University of Utah HCI Flow Cytometry Facility and Metabolomics core facilities, and the HSC High Throughput Genomics core facility. The help of Christopher J. Conley and Brian Lohman in the bio-statistical analysis is gratefully acknowledged. Mr. A.L. Lim was partially supported during this work by the grant U01TW008163 (P.I., Margo Haygood, University of Utah). This work was also supported in part by R21 AI124823 (LRB & VP) and a University of Utah seed grant (LRB).

Disclosure statement
No potential conflict of interest was reported by the author(s).

Author summary
A single cell analysis can distinguish between HIV-1 infected macrophage cells that are transcribing pre-integrated HIV-1 cDNA and those transcribing HIV-1 provirus. Only cells transcribing HIV-1 provirus are making detectable p24, marker mCherry, and Vpu proteins, which corresponds with a change in the host cell's background transcriptome from one expressing viral restriction and immunological response genes to one that is expressing genes associated with cell replication and oxidative phosphorylation.