Machine learning (ML) is viewed as a promising tool for the prediction of aerobic biodegradation, one of the most important elimination pathways of organic chemicals from the environment. However, available… Click to show full abstract
Machine learning (ML) is viewed as a promising tool for the prediction of aerobic biodegradation, one of the most important elimination pathways of organic chemicals from the environment. However, available models only have small datasets (<3200 records), make binary classification predictions, evaluate ready biodegradability, and do not incorporate experimental conditions (e.g., system setup and reaction time). This study addressed all these limitations by first compiling a large database of 12,750 records, considering both ready and inherent biodegradation under different conditions, and then developing regression and classification models using different chemical representations and ML algorithms. The best regression model (R2 = 0.54 and root mean square error of 0.25) and classification model (the prediction accuracy from 85.1%) achieved very good performance. The model interpretation indicated that the models correctly captured the effects of chemical substructures, following the order of C═O > O═C-O > OH > CH3 > halogen > branching > N > 6-member ring. The consideration of chemical speciation based on pKa and α notations did not affect the regression model performance but significantly improved the classification model performance (the accuracy increased to 87.6%). The models also showed large applicability domains and provided reasonable predictions for more than 98% of over 850,000 environmentally relevant chemicals in the Distributed Structure-Searchable Toxicity database. These robust, trustable models were finally made widely accessible through two free online predictors with graphical user interface.
               
Click one of the above tabs to view related content.