Heterogeneity within and between data sets is one of the primary impediments to sound and efficient data analysis. This heterogeneity can arise from many sources: data collection practices can change… Click to show full abstract
Heterogeneity within and between data sets is one of the primary impediments to sound and efficient data analysis. This heterogeneity can arise from many sources: data collection practices can change over time, even within the same study; rigorous standards for data encoding may be missing, leading to inter-individual heterogeneity in data encoding; or the data may originally have been collected for a different purpose (e.g. EHR data). Additionally, narrative text, such as medical statements, are inherently unstructured and require curation before analysis. No matter which source(s) heterogeneity derives from, curation of the data is necessary for efficient, accurate and reproducible data analysis. This not only includes correcting errors and standardizing encoding, but also extends to enriching the data through mapping it to relevant medical ontologies, such as SNOMED CT, or other standardized terminologies, such as Research Resource Identifiers. To solve the most common issues in clinical and phenotypic data curation, we have developed the AccurateTM data curation and ontology mapping solution. It combines an intuitive web-based user interface for data cleaning with efficient solutions for semi-automated ontology mapping of both structured data and narrative text. For structured data, tokenized and stemmed data items are mapped against ontologies indexed in Elasticsearch. Term names, their synonyms and the local ontology structure are then used to query the target ontology, with a list of best matches returned along with a quality score for the mapping. Ontology tagging of narrative text is based on a sentence-based deep learning approach, analyzing sentences to classify and ontology map identified text units. In practice, combining a bidirectional long short term memory network with a conditional random field model into a named entity recognition system (bio-NER). Preliminary benchmarking of the bio-NER system on the MIMIC III data set suggests good specificity and sensitivity for identification of biomedically relevant concepts. In summary, we here present an intuitive and highly efficient solution for curating clinical and phenotypic data, as well as enriching it using ontology mapping of both structured and narrative data. Citation Format: Henrik Edgren, Beatriz Mano, Maria Laaksonen. Efficient curation and ontology mapping of clinical and phenotypic data [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2018; 2018 Apr 14-18; Chicago, IL. Philadelphia (PA): AACR; Cancer Res 2018;78(13 Suppl):Abstract nr 2276.
               
Click one of the above tabs to view related content.