The fields of microbial ecology and environmental microbiology are producing loads of data, mainly nucleic acid sequence data due to the extensive use of amplicon sequencing and metagenomics, and an… Click to show full abstract
The fields of microbial ecology and environmental microbiology are producing loads of data, mainly nucleic acid sequence data due to the extensive use of amplicon sequencing and metagenomics, and an increasing use of transcriptomics. To increase our understanding of microorganisms in terrestrial ecosystems, multiple, concerted efforts to collect large numbers of samples for analyses of microbial communities were initiated already more than 15 years ago (Fierer & Jackson, 2006; Lozupone & Knight, 2007) but have really exploded the last years, with The Earth Microbiome Project Consortium being one of the first major endeavours for bacteria across all biomes (Thompson et al., 2017) and the work by Tedersoo et al. (2014) for soil fungi. The majority of the investigations have a biogeography focus based on a single sampling occasion and the word ‘global’ is frequently used in the titles of these soil microbial catalogues and surveys (Bahram et al., 2018; Delgado-Baquerizo et al., 2018; Gobbi et al., 2022). Similar efforts have been done for many other biomes. Although largely descriptive, they have contributed to a better understanding of microbial diversity and the distribution of microbial taxa and their functions at an unprecedented spatial scale. Further, correlative analyses have indicted direct or indirect drivers of the observed patterns as well as the role of microbial communities for ecosystem functioning (Bahram et al., 2018; Delgado-Baquerizo et al., 2020; Garland et al., 2021). The massive amount of complex data is not only an opportunity but also a major challenge when it comes to meaningful interpretation. The field of computational biology, being the intersection of computer science and biology, is rapidly expanding and developing new methods for this purpose. Artificial intelligence (AI), including machine learning (ML) and to some extent also deep learning (DL) methods are promising for dealing with big data in microbial ecology and environmental microbiology (Ghannam & Techtmann, 2021; McElhinney et al., 2022). Especially ML approaches are increasingly adopted by ecologists and many of these methods will soon become routine tools for analyses of complex microbial omics data. They can be used to categorize and finds patterns in uncategorized data as well as analyse data that we know how to categorize. There are several advantages to using ML methods in microbiome studies, for example, they can deal with non-linear relationships, make better use of the full depth of high-dimensional data, and can be used to build predictive models based on environmental and community data. Predictive modelling is very attractive in microbial ecology. Among the ML methods, random forests have become frequently applied in microbiome studies in the last decade (Jones et al., 2014; Ryo & Rillig, 2017). It is predominantly used for the identification of the best predictors for a given response variable and has for example been used to rank the environmental variables determining the major microbial phyla in wetlands (Bahram et al., 2022) and the diversity of ammonia oxidizing archaea across European soils (Saghaï Received: 27 September 2022 Accepted: 28 September 2022
               
Click one of the above tabs to view related content.