Statistical inference and mechanistic understanding of Horizontal Gene Transfer in Microbial Communities

Paulina M. Paiz
9 min readDec 21, 2023

Introduction

A significant portion of bacterial genomes is acquired not through inheritance but through horizontal gene transfer (HGT). This process involves the exchange of genetic material between non-parental organisms as a strategy to achieve environmental adaptation and survival. One of the significant concerns regarding HGT in the biomedical community is that it enables the acquisition of antibiotic resistance. Understanding the mechanisms by which bacteria engage in HGT as well as exploring the analytical tools that help scientists detect HGT offers a unique lens through which to view microbial evolution.

Mechanisms of Horizontal Gene Transfer

The three primary mechanisms by which microbes engage in HGT are transformation, conjugation, and transduction. Transformation occurs when bacteria uptake foreign DNA, such as that secreted by nearby cells, from the environment. In 1928, Griffith showed that a non-virulent strain of Streptococcus pneumoniae could be transformed into a virulent strain by exposure to heat-killed virulent bacteria, suggesting that some “transforming principle” allowed bacteria to develop pathogenicity. Transduction involves the transfer of DNA during viral replication and it is facilitated by bacteriophages. The amount of genetic material that can be transfected is limited by the size of the viral plasmid but it is usually around 10,000 bases. On top of this, phages can encode proteins that facilitate the introduction of foreign DNA into the receptor bacterial cell. Finally, conjugation happens through direct physical contact between donor and recipient cells. It is often facilitated by a conjugative plasmid, a type of mobile genetic element that contains transfer genes. Encoding these genes allows the donor cell to attract and connect to the cytoplasm of the receptor cell by forming a tube-like structure called a pilus.

Performing transformation, conjugation, and transduction in the laboratory has contributed significantly to our understanding of gene function and regulation. A distinct marker gene, such as one conferring antibiotic resistance or a fluorescent protein, can be introduced into a donor organism’s genome. The donor organism is then placed in an environment with the recipient organism and over time, researchers monitor the recipient population for the presence of the marker gene. The spread and expression of the marker gene within the recipient population can be quantified using techniques such as PCR, gel electrophoresis, or fluorescence microscopy. Readouts map the dynamics of gene transfer, including the rate and efficiency of HGT under various conditions. Controlled environments such as laboratories are necessary for studying HGT, but analyzing these processes in more natural and complex settings could yield deeper insights into the genetic makeup of microbial communities over time. Recent technologies such as Oxford Nanopore allow for the sequencing of longer DNA fragments, which provide more context around lateral transfers. Altogether, bacterial transformation, conjugation, and transduction inspired many of the biotechnologies used in modern genetic engineering including Genentech’s achievements in recombinant DNA.

Detection and Inference of Horizontal Transfer

Before the advent of high-throughput sequencing, branching processes and distinct taxonomies were standard tools to study microbial evolution. Mathematical structures called phylogenetic trees helped represent vertical evolutionary events. These trees are hierarchical structures that depict the relationships between different species or clades based on their shared ancestry. The branching patterns in phylogenetic trees illustrate the divergence of lineages from a common ancestor, providing a visual representation of the evolutionary history of the organisms or genes under study. However, the traditional phylogenetic tree model faced challenges when lateral gene transfer was observed in the laboratory. In the presence of HGT, the simple bifurcating structure of phylogenetic trees became inadequate for representing the evolutionary relationships among organisms. After lateral gene transfer was observed, reticulate networks took over with their ability to describe more intertwined patterns in which branches can join or split. These graphical structures can capture complex signaling pathways and increase the power to identify genetic markers associated with HGT.

The foundation for most analysis pipelines to infer HGT is composed of conventional bioinformatic algorithms such as sequence alignment, BLAST, and other comparative genomics techniques. Basic Local Alignment Search Tool (BLAST) compares microbial sequences to sequence databases and calculates the statistical significance of matches. BLAST match methods are used as a second pass: they sort BLAST hits by indicators of sequence similarity, such as bit scores. This enables the delineation of different phylogenetic histories and the identification of members of gene families. If the best match is a distantly related organism, the gene is categorized as likely horizontally acquired. This method has become popular due to the rapid increase in available annotated genome data. However, it can be limited by factors such as gene loss events, stochastic similarity, and database errors, which can lead to false identifications of HGT. To address these limitations, scientists can generate statistical distributions of the expected vertical evolutionary trajectories and quantify how much the observed trajectory diverges in relation to it. If the difference surpasses a predefined threshold, the genes are marked as putatively HGT-derived. One innovative method called HGTector follows this approach and overcomes the limitations of standalone BLAST alignment.

HGTector works by dividing BLAST hits into self, close, and distal groups based on phylogenetically informed user-defined categories. Associations between the distal group to members of the self-group​ capture directional gene flow. Summing up the normalized BLAST bit scores of hits provides a quality metric leading to three weights per gene. These weights form three independent statistical populations that define a “fingerprint” of the input genomes. A cutoff is then used to divide the weight distribution of each group into typical and atypical portions. Genes are predicted as putatively horizontally acquired based on rules that take into account all three weight distributions. For instance, a gene is considered a likely HGT candidate if it has a low or zero weight in the close weight distribution which indicates absence in sister groups. A gene is typical in the distal weight distribution if its hits are from distant organisms or are not underrepresented. Lastly, a gene is classified as atypical in the self weight distribution if there is sporadic detection in the self group of organisms. Setting up these additional filters allows scientists to call hierarchical evolution that deviates from the expected phylogeny under vertical constraints.14

Alignment-free techniques are a recent addition to the computational toolkit scientists have to detect and infer lateral transfer. These methods are sensitive to the extent of sequence divergence, less sensitive to low and moderate frequencies of horizontal gene transfer, and robust against genome rearrangements. One of the key advantages of alignment-free techniques is they do not hold the assumption of preserved linear order of homology within compared sequences, which can be disrupted by sequence rearrangements and horizontal gene transfer events. DIVE is an example of a compositional parametric method that falls into this category. It was developed by Dr. Julia Salzman’s group at Stanford University and was specifically designed to detect mobile genetic elements and diversity-generating mechanisms in microbial genomes. This reference-free approach works by directly analyzing sequencing reads, sidestepping the need for a reference genome or conventional assembly processes. Short DNA sequences of a specified length referred to as K-mers are used as anchors to identify surrounding sequences with high variability- a hallmark of MGE activity and potential HGT events. K-mer length can be user-configured to meet memory requirements and sequence specificity. By examining the diversity of sequences adjacent to these anchors, DIVE effectively identifies regions of genetic mobility and variability, such as transposons and CRISPR arrays. This method allows for the detection of both known and novel or poorly characterized MGEs. Similar to DIVE, other alignment-free algorithms have been applied to diverse areas of genomic research, including metagenomics, whole-proteome viral phylogenies, and the detection of genome mosaicism over short evolutionary distances. These applications highlight the versatility and utility of alignment-free methods in addressing various biological questions and challenges.

Applications of Topological Data Analysis

Topological data analysis (TDA) has emerged as a powerful tool for inferring microbial genetic recombination and antibiotic resistance given its ability to model evolutionary processes beyond strict tree-based objects. TDA is a branch of applied mathematics that uses concepts from algebraic topology to study the underlying structure and features of high-dimensional datasets, such as that generated from microbial genetic sequencing. Persistent homology, a key technique in TDA, is used to analyze the shape of data by studying its properties at different dimensional levels. In brief, it examines how clusters and voids in the data appear and disappear as users vary a scale parameter in a procedure known as a “multi-scale filtration.”19

For example, genetic information could be visualized in a cloud of data points in which each point is isolated at a fine-scale. Gradually increasing the scale parameter merges points into larger structures of clusters and forms holes. The voids that persist over a range of scales are considered significant and are thought to represent more fundamental characteristics of the underlying data. In contrast, transient features are voids that appear and disappear quickly as the scale changes. These are often regarded as noise. Persistent and transient features are plotted together in so-called barcode diagrams where each feature is represented as a line that starts at the scale where the feature appears and ends at the scale where it disappears.19,

In a paper titled “Topological Data Analysis Highlights Novel Geographical Signatures of the Human Gut Microbiome,” Lymberopoulos et al. used TDA coupled with two algorithms called Mapper and SAFE (Spatial Analysis of Functional Enrichment) to interrogate variability in the human gut microbiome of over 4,400 samples from 12 countries.21 The Mapper algorithm processed the data into a low-dimensional space, revealing clusters and connections that represent similarities in microbiome profiles. The SAFE algorithm then mapped variables onto a network which shows the enrichment of specific taxa based on external variables including geographical location, sex, and age. This integrative framework successfully revealed the underlying geometric structure of the dataset and identified non-linear relationships that conventional methods would have overlooked. The authors acknowledge that their study has limitations including potential sampling biases and the inability to control for factors like diet. Despite these challenges, as more population-level studies continue, TDA will be necessary to find associations between microbial composition and human health.

Conclusion

Horizontal gene transfer in bacteria remains a critical aspect of microbial evolution, impacting various fields such as medicine, agriculture, and biotechnology. Understanding the mechanisms, and statistical detection methods, and employing advanced analytical tools like topological data analysis are important steps toward disentangling the role of HGT in bacterial adaptation, pathogenicity, and the spread of antibiotic resistance.

References

  1. Thomas CM, Nielsen KM. (2005). Mechanisms of, and barriers to, horizontal gene transfer between bacteria. Nature Reviews Microbiology, 3(9), 711–721.
  2. Griffith F. (1928). The significance of pneumococcal types. Journal of Hygiene, 27(2),113–159.
  3. Dagan T, Martin W. (2007). Ancestral genome sizes specify the minimum rate of lateral gene transfer during prokaryote evolution. Proceedings of the National Academy of Sciences, 104(3), 870–875.
  4. Pevsner, J. (2005). Basic Local Alignment Search Tool (BLAST).
  5. Zielezinski, A., Vinga, S., Almeida, J.S., & Karłowski, W.M. (2017). Alignment-free sequence comparison: benefits, applications, and tools. Genome Biology, 18.
  6. Domazet-Lošo, M., & Domazet-Lošo, T. (2016). gmos: Rapid Detection of Genome Mosaicism over Short Evolutionary Distances. PLoS ONE, 11.
  7. We Wu, G.A., Jun, S., Sims, G.E., & Kim, S. (2009). Whole-proteome phylogeny of large dsDNA virus families by an alignment-free method. Proceedings of the National Academy of Sciences, 106, 12826–12831.
  8. Yamashita, A., Sekizuka, T., & Kuroda, M. (2014). Characterization of Antimicrobial Resistance Dissemination across Plasmid Communities Classified by Network Analysis. Pathogens, 3, 356–376.
  9. Abante, J., Wang, P. L., & Salzman, J. (2023). DIVE: a reference-free statistical approach to diversity-generating and mobile genetic element discovery. Genome biology, 24(1), 240. https://doi.org/10.1186/s13059-023-03038-0
  10. Bergman, J.M., Fineran, P.C., Petty, N.K., & Salmond, G.P. (2019). Transduction: The Transfer of Host DNA by Bacteriophages. Reference Module in Biomedical Sciences.
  11. Nema, V. (2019). The Role and Future Possibilities of Next-Generation Sequencing in Studying Microbial Diversity. Microbial Diversity in the Genomic Era.
  12. Meneses, B.D., & Délio, R. (2012). ADN — Recombinante: biotecnologia, ontologia e protologia.
  13. Yuan, L., Lu, H., Li, F., Nielsen, J., & Kerkhoven, E.J. (2023). HGTphyloDetect: facilitating the identification and phylogenetic analysis of horizontal gene transfer. Briefings in Bioinformatics, 24.
  14. Zhu, Q., Kosoy, M.Y., & Dittmar, K. (2014). HGTector: an automated method facilitating genome-wide discovery of putative horizontal gene transfers. BMC Genomics, 15.
  15. Fitz-Gibbon, S.T., & House, C.H. (1999). Whole genome-based phylogenetic analysis of free-living microorganisms. Nucleic acids research, 27 21, 4218–22 .
  16. Juhas, M., van der Meer, J.R., Gaillard, M., Harding, R.M., Hood, D.W., & Crook, D.W. (2008). Genomic islands: tools of bacterial horizontal gene transfer and evolution. Fems Microbiology Reviews, 33, 376–393.
  17. Wiedenbeck, J.K., & Cohan, F. (2011). Origins of bacterial diversity through horizontal genetic transfer and adaptation to new ecological niches. FEMS microbiology reviews, 35 5, 957–76 .
  18. Watson, B.N., Staals, R.H., & Fineran, P.C. (2018). CRISPR-Cas-Mediated Phage Resistance Enhances Horizontal Gene Transfer by Transduction. mBio, 9.
  19. Koutsovoulos, G.D., Noriot, S.G., Bailly-Bechet, M., Danchin, E.G., & Rancurel, C. (2022). AvP: A software package for automatic phylogenetic detection of candidate horizontal gene transfers. PLOS Computational Biology, 18.
  20. Bernard, G., Chan, C.X., & Ragan, M.A. (2016). Alignment-free microbial phylogenomics under scenarios of sequence divergence, genome rearrangement and lateral genetic transfer. Scientific Reports, 6.
  21. Carlsson, G., & Vejdemo-Johansson, M. (2021). Topological data analysis with applications. Cambridge University Press.
  22. Rabadán, R., & Blumberg, A. J. (2019). Topological data analysis for genomics and evolution: topology in biology. Cambridge University Press.
  23. Lymberopoulos, E., Gentili, G. I., Alomari, M., & Sharma, N. (2021). Topological data analysis highlights novel geographical signatures of the human gut microbiome. Frontiers in Artificial Intelligence, 4, 680564.

--

--