LAUSR.org creates dashboard-style pages of related content for over 1.5 million academic articles. Sign Up to like articles & get recommendations!

The Automatic Detection of Dataset Names in Scientific Articles

Photo by sarahdorweiler from unsplash

We study the task of recognizing named datasets in scientific articles as a Named Entity Recognition (NER) problem. Noticing that available annotated datasets were not adequate for our goals, we… Click to show full abstract

We study the task of recognizing named datasets in scientific articles as a Named Entity Recognition (NER) problem. Noticing that available annotated datasets were not adequate for our goals, we annotated 6000 sentences extracted from four major AI conferences, with roughly half of them containing one or more named datasets. A distinguishing feature of this set is the many sentences using enumerations, conjunctions and ellipses, resulting in long BI+ tag sequences. On all measures, the SciBERT NER tagger performed best and most robustly. Our baseline rule based tagger performed remarkably well and better than several state-of-the-art methods. The gold standard dataset, with links and offsets from each sentence to the (open access available) articles together with the annotation guidelines and all code used in the experiments, is available on GitHub.

Keywords: names scientific; automatic detection; detection dataset; scientific articles; dataset names

Journal Title: Data
Year Published: 2021

Link to full text (if available)


Share on Social Media:                               Sign Up to like & get
recommendations!

Related content

More Information              News              Social Media              Video              Recommended



                Click one of the above tabs to view related content.