LAUSR.org creates dashboard-style pages of related content for over 1.5 million academic articles. Sign Up to like articles & get recommendations!

Use of locality sensitive hashing (LSH) algorithm to match Web of Science and Scopus

Photo by benlambertmedia from unsplash

A novel hashing algorithm is applied to match two prominent and important bibliographic databases at the paper level. In the literature, such tasks have been studied and conducted many times,… Click to show full abstract

A novel hashing algorithm is applied to match two prominent and important bibliographic databases at the paper level. In the literature, such tasks have been studied and conducted many times, but relying only on journal information due to massive volume of indexed publications. As a result of paper based match, missing or erroneous items can be completed from other source or the overlap can be measured more reliably. In this context, we focus on measuring the overlap between Clarivate Analytics Web of Science (WoS) and Elsevier’s Scopus at the paper level. Our focus is on detecting exact matches, that is, no false positives are tolerated at all. To this end, we follow a twofold matching procedure. First, a locality sensitive hashing algorithm is applied, which provides fast approximate nearest neighbours and similarities, in order to obtain WoS-Scopus pair suggestions. Second, for each suggested pair, different heuristics are applied to identify those pair of records that indeed refer to the same publication. We observe that at least 74% of WoS publications are also indexed by Scopus. The percentage increases to 92% when only the cited publications are retained. The overlapped WoS records are also presented based on Institute for Scientific Information subject categories (SC). Of those, three big SCs, whose overlap ratios are relatively low, are chosen and examined in detail. Last but not the least, it takes just about an hour to match 14.2 million versus 19.6 million publications from a publication year range of 2004–2013 in a high performance computer environment.

Keywords: locality sensitive; match; sensitive hashing; web science

Journal Title: Scientometrics
Year Published: 2017

Link to full text (if available)


Share on Social Media:                               Sign Up to like & get
recommendations!

Related content

More Information              News              Social Media              Video              Recommended



                Click one of the above tabs to view related content.