LAUSR.org creates dashboard-style pages of related content for over 1.5 million academic articles. Sign Up to like articles & get recommendations!

Visual information retrieval from historical document images

Photo from wikipedia

Abstract Information retrieval from documentary heritage is considered a challenging issue because of the documents’ unique structures and level of degradation. Text characters in printed documents historically are accompanied by… Click to show full abstract

Abstract Information retrieval from documentary heritage is considered a challenging issue because of the documents’ unique structures and level of degradation. Text characters in printed documents historically are accompanied by typographical objects. Retrieving and pursuing these visual typographical elements, which inform the content of historical manuscripts can help us better understand our documentary cultural heritage. Extracting these visual objects aids us in understanding and conveying more information about different practices of representation in historical documents and their effects on the current trends of publishing. Two important typographical objects related to the history of knowledge and information are footnotes and tables; the former are one of the critical elements that demonstrate authority and link the manuscript to its sources, and the latter summarize information in a compact and organized manner essential to the growth of scientific knowledge. To the best of our knowledge, there is currently no work that considers in depth the automated detection of these two typographical objects from the large collections of historical documents that would allow further historical study. This article focuses on the problem of detecting the presence of these two visual elements from historical printed documents and establishes two frameworks. The footnote detection framework uses a set of layout-based methods to extract some features regarding the font and appearance, and the table detection framework extracts spectral-based features from the images. These frameworks are tested on a large collection of 18th-century printed documents with more than 32 million images, and the results show their effectiveness and generalization power.

Keywords: typographical objects; visual information; printed documents; information retrieval; information

Journal Title: Journal of Cultural Heritage
Year Published: 2019

Link to full text (if available)


Share on Social Media:                               Sign Up to like & get
recommendations!

Related content

More Information              News              Social Media              Video              Recommended



                Click one of the above tabs to view related content.