Recently, digitization of biomedical processes has accelerated, in no small part due to the use of machine learning techniques which require large amounts of labeled data. This chapter focuses on… Click to show full abstract
Recently, digitization of biomedical processes has accelerated, in no small part due to the use of machine learning techniques which require large amounts of labeled data. This chapter focuses on the prerequisite steps to the training of any algorithm: data collection and labeling. In particular, we tackle how data collection can be set up with scalability and security to avoid costly and delaying bottlenecks. Unprecedented amounts of data are now available to companies and academics, but digital tools in the biomedical field encounter a problem of scale, since high-throughput workflows such as high content imaging and sequencing can create several terabytes per day. Consequently data transport, aggregation, and processing is challenging.A second challenge is maintenance of data security. Biomedical data can be personally identifiable, may constitute important trade-secrets, and be expensive to produce. Furthermore, human biomedical data is often immutable, as is the case with genetic information. These factors make securing this type of data imperative and urgent. Here we address best practices to achieve security, with a focus on practicality and scalability. We also address the challenge of obtaining usable, rich metadata from the collected data, which is a major challenge in the biomedical field because of the use of fragmented and proprietary formats. We detail tools and strategies for extracting metadata from biomedical scientific file formats and how this underutilized metadata plays a key role in creating labeled data for use in the training of neural networks.
               
Click one of the above tabs to view related content.