The Dig into Data project focuses on developing analysis and
understanding approaches which are culture/language agnostic. In other words,
it pursues the idea that literal cultures evolve according to somehow global mechanisms
that are independent from the actual language, culture and time period of the
evolution. It is worth mentioning that there is another agnostic facet in this
project in that sense that the developed methodologies are also expected to be transferable/migratable
across languages/cultures in a straightforward way.
The database of the project compromises of several
collections of manuscripts each one restricted to a specific language/culture
and period of time. The period of time of each collection is selected in such a
way that it highlights a transitional/evolutionary period of the associated
literal culture.
The objectives of the project are approached using methodologies
that leverage on the singularities in the manuscripts at various levels.
In this way, the developed methodologies would be simply transferable among
collections with minimal re-modeling effort. Starting from the main collection
of the project, the German European Enlightenment Collection, footnote objects
were chosen as the singular events across the manuscripts. A complete set of
document image processing methods has been developed to address challenges of
processing this collection. The methods range from preprocessing and denoising,
to layout analysis and correction, to typeface identification, to footnote
marker and body detection and extraction, and to retrieval of the titles of
cited manuscript in the footnotes. For many of these steps, such as
preprocessing and layout analysis, our in-house state-of-the-art methodologies
have been generalized and modified to address the new collection. On the whole collection
consisting of more than 1,300 manuscripts, a set of more than 37,000 footnotes
was detected and extracted, which is then being used to build some high-level
understanding of the relations among manuscripts within the collection. It
is worth mentioning that the amount of data provided by the singular features,
i.e., the footnote objects, could be negligible compared to the total amount of
the visual data that the collection carries from the understanding perspective.
As a proof of generalizability of the approach, the methodologies were in a
straightforward manner transferred from the German collection to the Chinese
collection, the Collection of Chinese Women’s Writing from the Ming-Qing
Dynasties, to detect and extract annotation markers and other singular features
present on the manuscript’s pages of that collection.
New visual methodologies are being developed based on novel
representations and modeling in order to digest the whole set of document
images in the form of a big, complex network of relations among the rich
objects representing fractional, incomplete but complementary parts of
image-content data encrypted in the manuscripts of a collection. In particular,
spatial-patch graphs, error-bounded sparse representations, and multi-state (and
quantum-state) state machines, among other approaches are on our road map
toward partially addressing the challenges of understanding the documented
human heritage.
No comments:
Post a Comment