The patent literature is a potentially valuable source of bioactivity data. The SureChEMBL database (https://www.surechembl.org/) is a publicly available large-scale resource that contains compounds extracted on a daily basis from… Click to show full abstract
The patent literature is a potentially valuable source of bioactivity data. The SureChEMBL database (https://www.surechembl.org/) is a publicly available large-scale resource that contains compounds extracted on a daily basis from the full text, images and attachments of patent documents, through an automated text and image-mining pipeline. In this paper we describe a process to prioritise 3.7 million life science relevant patents obtained from SureChEMBL, according to how likely they were to contain bioactivity data for potent small molecules on less-studied targets, according to the classification developed by the Illuminating the Druggable Genome (IDG) project. The overall goal was to select a smaller number of patents that could be manually curated and incorporated into the ChEMBL database. We describe the approach taken, the results obtained, and provide some illustrative examples.
               
Click one of the above tabs to view related content.