Tuesday, March 24, 2015

What is the Dig into Data of Global Literature Project about?



The Dig into Data project focuses on developing analysis and understanding approaches which are culture/language agnostic. In other words, it pursues the idea that literal cultures evolve according to somehow global mechanisms that are independent from the actual language, culture and time period of the evolution. It is worth mentioning that there is another agnostic facet in this project in that sense that the developed methodologies are also expected to be transferable/migratable across languages/cultures in a straightforward way.

The database of the project compromises of several collections of manuscripts each one restricted to a specific language/culture and period of time. The period of time of each collection is selected in such a way that it highlights a transitional/evolutionary period of the associated literal culture. 

The objectives of the project are approached using methodologies that leverage on the singularities in the manuscripts at various levels. In this way, the developed methodologies would be simply transferable among collections with minimal re-modeling effort. Starting from the main collection of the project, the German European Enlightenment Collection, footnote objects were chosen as the singular events across the manuscripts. A complete set of document image processing methods has been developed to address challenges of processing this collection. The methods range from preprocessing and denoising, to layout analysis and correction, to typeface identification, to footnote marker and body detection and extraction, and to retrieval of the titles of cited manuscript in the footnotes. For many of these steps, such as preprocessing and layout analysis, our in-house state-of-the-art methodologies have been generalized and modified to address the new collection. On the whole collection consisting of more than 1,300 manuscripts, a set of more than 37,000 footnotes was detected and extracted, which is then being used to build some high-level understanding of the relations among manuscripts within the collection. It is worth mentioning that the amount of data provided by the singular features, i.e., the footnote objects, could be negligible compared to the total amount of the visual data that the collection carries from the understanding perspective. As a proof of generalizability of the approach, the methodologies were in a straightforward manner transferred from the German collection to the Chinese collection, the Collection of Chinese Women’s Writing from the Ming-Qing Dynasties, to detect and extract annotation markers and other singular features present on the manuscript’s pages of that collection. 

New visual methodologies are being developed based on novel representations and modeling in order to digest the whole set of document images in the form of a big, complex network of relations among the rich objects representing fractional, incomplete but complementary parts of image-content data encrypted in the manuscripts of a collection. In particular, spatial-patch graphs, error-bounded sparse representations, and multi-state (and quantum-state) state machines, among other approaches are on our road map toward partially addressing the challenges of understanding the documented human heritage.

No comments:

Post a Comment