Wednesday, August 26, 2015

Interesting readings

[BianneBernard2011] - Bianne-Bernard, A.-L.; Menasri, F.; Mohamad, R.-H.; Mokbel, C.; Kermorvant, C. & Likforman-Sulem, L. Dynamic and Contextual Information in HMM Modeling for Handwritten Word Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 33, 2066-2080.

[Biederman1987] - Biederman, I. Recognition-by-components: A theory of human image understanding Psychological Review, 1987, 94, 115-147.

[Boschetti2010] - Boschetti, F. A Corpus-based Approach to Philological Issues University of Trento, 2010.

[Chen1998] - Chen, F. R. & Bloomberg, D. S. Summarization of Imaged Documents without OCR Computer Vision and Image Understanding, 1998, 70, 307-320.

[Cheriet2013] - Cheriet, M.; Farrahi Moghaddam, R. & Hedjam, R. A learning framework for the optimization and automation of document binarization methods Computer Vision and Image Understanding, 2013, 117, 269-280.

[Doetsch2012] - Doetsch, P.; Hamdani, M.; Ney, H.; Gimenez, A.; Andres-Ferrer, J. & Juan, A. Comparison of Bernoulli and Gaussian HMMs Using a Vertical Repositioning Technique for Off-Line Handwriting Recognition, ICFHR'12, 2012, 3-7.

[Farrahi2011] - Farrahi Moghaddam, R. & Cheriet, M. Beyond pixels and regions: A non-local patch means (NLPM) method for content-level restoration, enhancement, and reconstruction of degraded document images, Pattern Recognition, 2011, 44, 363-374.

[Farrahi2012] - Farrahi Moghaddam, R. & Cheriet, M. Real-Time Knowledge-Based Processing of Images: Application of the Online NLPM Method to Perceptual Visual Analysis, IEEE Transactions on Image Processing, 2012, 21, 3390-3404.

[Farrahi2012a] - Farrahi Moghaddam, R.; Farrahi Moghaddam, F. & Cheriet, M. A new framework based on signature patches, micro registration, and sparse representation for optical text recognition, ISSPA'12, 2012, 1259-1265.

[Farrahi2015] - Farrahi Moghaddam, R. & Cheriet, M. Modified Hausdorff Fractal Dimension (MHFD), 2015, [arXiv preprint http://arxiv.org/abs/1409.0876 arXiv:1409.0876, May 2015].

[Farrahi2015a] - Farrahi Moghaddam, R. & Cheriet, M. A Multiple-Expert Binarization Framework for Multispectral Images, ICDAR'15, 2015.

[Kanai1998] - Kanai, J. & Baird, H. S. Special Issue on Document Image Understanding and Retrieval Computer Vision and Image Understanding, 1998, 70, 285-286.

[Konya2012] - Konya, I. Adaptive Methods for Robust Document Image Understanding University of Bonn, 2012.

[Konya2014] - Konya, I. & Eickeler, S. Logical structure recognition for heterogeneous periodical collections DATeCH'14, 2014, 185-192.

[Lorang2015] - Lorang, E.; Soh, L.-K.; Datla, M. V. & Kulwicki, S. Developing an Image-Based Classifier for Detecting Poetic Content in Historic Newspaper Collections D-Lib Magazine, 2015, 21.
 
[RodriguezSerrano2010] - Rodríguez-Serrano, J. A.; Perronnin, F.; Sánchez, G. & Lladós, J. Unsupervised writer adaptation of whole-word HMMs with application to word-spotting, Pattern Recognition Letters, Award winning papers from ICPR'10, 2010, 31, 742-749.

[Romanello2013] - Romanello, M. & Pasin, M. Citations and annotations in classics: old problems and new perspectives DH-CASE'13, 2013, 1-8.

[Romanello2014] - Romanello, M. Mining Citations, Linking Texts ISAW Papers 7 (Current Practice in Linked Open Data for the Ancient World), ISAW, ISAW and NYU, 2014.
 
[Saund2003] - Saund, E.; Fleet, D.; Mahoney, J. & Lamer, D. Doermann, D. (Ed.) Rough and Degraded Document Interpretation by Perceptual Organization Proceedings $5^th$ Symposium on Document Image Understanding Technology (SDIUT), UMD, 2003.

[Skulimowski2014] - Skulimowski, M. On expanded citations i-KNOW'14, ACM, 2014, 1-4.

[Vila2013] - Vila, K.; Fernández, A.; Gómez, J. M.; Ferrández, A. & Díaz, J. Noise-tolerance feasibility for restricted-domain Information Retrieval systems Data & Knowledge Engineering, 2013, 86, 276-294.

[Vinciarelli2004] - Vinciarelli, A.; Bengio, S. & Bunke, H. Offline recognition of unconstrained handwritten texts using HMMs and statistical language models, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2004, 26, 709-720.

[Yu2013] - Yu, L.; Schwier, J.; Craven, R.; Brooks, R. & Griffin, C. Inferring Statistically Significant Hidden Markov Models, IEEE Transactions on Knowledge and Data Engineering, 2013, 25, 1548-1558.

[Zhang2014]-Zhang, M.; Farrahi Moghaddam, R. & Cheriet, M. Degraded Document Images Enhancement and Reconstruction Based on Non-local Sparse Representation, RL'14 in ECML/PKDD'14, 2014.

Wednesday, April 22, 2015

White Paper: The State of the Text Analytics Industry - 2015

The State of the Text Analytics Industry - 2015 (White Paper)

Authors: Brian Parke and Geoff Whiting

Introduction:

Text analytics has stepped off the sidelines and is playing a major role in applications all across the size scale. As it reaches a pervasive deployment stage, text analytics continues to bring significant wins to the companies who employ it. From sentiment analysis of real‐time social posts to long‐term trend spotting via deep data mining, practitioners and deployments are now comfortable enough with the data to specialize for each business case. ...

http://incitemc.com/text/docs/the_state_of_the_text_analytics_industry_2015_white_paper.pdf

Thursday, April 2, 2015

Interesting reading: Analysis of Pre- and Post-Monsoon Suspended Sediments in the Gulf of Kachchh, Gujarat, India Using Remote Sensing

Interesting reading:

Analysis of Pre- and Post-Monsoon Suspended Sediments in the Gulf of Kachchh, India Using Remote Sensing

Author: Mukesh Gupta

Arxiv Preprint (fulltext): http://arxiv.org/abs/1503.08369

Abstract

A comprehensive study of satellite-derived suspended sediment concentration (SSC) during pre- and post- monsoon has been conducted with full-month cycles of tidal responses to study the suspended sediment dynamics in the Gulf of Kachchh. Tidal data were interpreted in conjunction with the OCEANSAT-1 ocean color monitor (OCM)- derived SSC for pre- and post-monsoon. The analysis of the data shows that the Gulf is predominantly affected by the tidal changes. The average SSC during pre-monsoon were 30.8 mg/l (high tide) and 24.1 mg/l (low tide); and during post-monsoon 19.7 mg/l (high tide) and 21.8 mg/l (low tide). The only little monsoonal influence is seen when Indus River discharges sediments during pre-monsoon due to increased sediment flux from its origin, Himalayas in spring (February{April) as compared to less sediment discharge observed during winter (November{December). The pre- monsoon SSC images show overall high suspended sediments whereas post-monsoon SSC images show comparatively low SSC. The use of enhanced resolution ocean color satellite data (<360 m spatial resolution) for deriving higher SSC (>40 mg/l) for moderate/normal/high monsoon years under similar tidal conditions, and for quantifying sediment dispersal and dynamics and its validation is suggested as a future avenue of research.

Conclusions

In this paper, a comprehensive study of satellite-derived SSC during pre- and post-monsoon was conducted with full-month cycles of tidal responses to suspended sediment dynamics in the Gulf of Kachchh. The analysis is based on 10 (pre- and post-monsoon) SSC images used for sediment extent and SSC computations; and 31 SSC images used for sediment dispersal studies. Tidal data was interpreted in conjunction with the OCM-derived SSC for pre-monsoon (February, March, and April) and post-monsoon (November and December) periods. The area of extent of suspended sediments was derived for the same period using the SSC images. The hypothesis that the SSC increases after the monsoon in the Gulf is found incorrect. In fact, the SSC reduces after the monsoon, and also the effect of monsoon on Gulf sediment dynamics appears non-significant. The following conclusions are drawn from the study: towards objective-1, Gulf of Kachchh undergoes tremendous tidal influences in addition to external/other factors such as winds. The pre- and post-monsoon analyses of the data showed that the Gulf seemed to be more affected by the tidal changes than the monsoonal changes in 2004. We utilize the unique opportunity that arose due to the absence of normal monsoon in 2004 to study the role of external factors other than monsoon contributing to the suspended sediments into the Gulf. This led to a finding that the SSC in the Gulf increases during pre-monsoon and decreases during post-monsoon, contrary to what is expected in a normal monsoon. However, the SSC is expected to remain consistent throughout the year in the absence of normal monsoon, pointing toward the re-suspension of sediments due to tidal and wind forcing. The only little monsoonal influence is seen when Indus River discharges sediments during pre-monsoon due to increased sediment flux from its origin, Himalayas in spring (February{April) as compared to less sediment discharge observed during winter (November{December). The average SSC during pre-monsoon were 30.8 mg/l (high tide) and 24.1 mg/l (low tide); and during post- monsoon 19.7 mg/l (high tide) and 21.8 mg/l (low tide). The highest surface extents of suspended sediments during pre-monsoon were 21,206.1 (low tide) and 10,454.7 km2 (high tide); and during post-monsoon were 13,201.0 (low tide) and 18,891.5 km2 (high tide). Towards objective-2, it is demonstrated that sequential OCM images can be helpful in delineating suspended sediment plumes. The behavior of these plumes and sediment extent depends on myriads of external factors. The inconsistent and abrupt SSC values and surface extents, observed from OCM images, result from the combined influence of the deficient monsoon in 2004, prevailing surface winds, and re-suspension of bottom sediments due to tidal currents in the Gulf. In situ measurements of SSC and its surface extents can be a future validation exercise. The pre-monsoon SSC images show overall high suspended sediments whereas post-monsoon SSC images show comparatively low SSC. The entire suspended sediment dynamics occurs within the 50 m depth contour. The re-suspension of sediments due to wind forcing within the Gulf affects the regular behavior of the sediment dispersal patterns observed on the satellite images. It is hoped that similar studies for multiple and consecutive years, with moderate or 'normal' monsoon under similar tidal conditions, will a rm and/or generalize the findings of this paper. The use of enhanced resolution ocean color satellite data (<360 m spatial resolution) for deriving higher SSC (>40.0 mg/l) and to study the sediment dispersal patterns and dynamics and its validation is suggested as a future avenue of research.

Tuesday, March 24, 2015

Literal Culture World: Beyond the Culture-Agnostic Evolution of Literature



As mentioned in the previous posts, the Dig into Data Project (DiDP) is more about modeling and understanding of the literal culture evolution across apparently-different cultures. Such modeling not only allows to spot and highlight the key literal works and writers of a specific culture in a specific period of evolution time, it provides a mean to show that all literal cultures are to some degree “the same” at least from an evolution of literature (enlightenment) perspective. We frame such a similarity in the form of a culture-agnostic literature evolution mechanism that would be projectable on all cultures along their enlightenment period of time. 

The outcome of the DiDP would be a great achievement because it provides a new angle in understanding the literal cultures, and especially in hinting that all cultures are in essence the same. This would lead to implicit but significant conclusions that could have application beyond the DiDP.

Despite these applications, the goal of the DiDP would stay highly scientific because in the real world there is no isolated culture evolving by itself. Although in the past the level and speed of communications among cultures has been much less than that of today, a single spark initiated by a translation of an influential manuscript form a culture to another culture, which has been ready and eager, would be sufficient to start an onset of a significant evolution in receiving literal culture. Such an ignition could be seen as a big bang effect. Observing such interactions at the level of cultures is a critical understanding in order to accelerate literal evolutions and co-existence across the globe. The significant challenges would be in alignment of evolution periods of time of different cultures and also in modeling cross literal-domain interactions.

The DiDP database is rich in terms of cultures and literal domains. However, the number of domains, the periods of time, and also the geographical locations are so sacred that the chance of having a big bang event would be highly small. Enlarging the scope of the data to cover all literal data across the globe would be interesting and an ultimate goal. However, considering the resources available to the projects, this seems infeasible. What that seems to be a good test bed to evaluate the concept of the Inter-Culture Evolution could be a small world of a finite number of literal cultures that have been highly interactive among themselves in the form of various exchanges and migration, and also share the same greater geographical area for a considerable period of time. If the period of time of the available data of such a small world is long enough to capture a few of the inter-cultural big bang events, we can spot and highlight those inter-cultural evolutions using a moderate amount of research resources. Those highly intense inter-cultural interactions could be then studied and analyzed in deep and details in order to develop robust models for inter-cultural exchanges and phenomena. Such models can be leveraged at various levels toward a sustainable future of co-existing but highly evolving cultures across the globe.

Interestingly, in another project that we have been involved, i.e., the Indian Ocean World Project (IOWP), a very clear example of a small world has been studied for a long time. Serval cultures, such as Arabic, African, Chinese, Indian, Japanese, and Persian around the Indian Ocean have shared the same small world of ecology, economy, and people. This Indian Ocean World (IOW) has been highly dependent on significant climatic phenomena, such as the tropical Monsoon climate, and therefore some sort of ‘synchronization’ could be expected among its cultures. 

Thanks to the IOWP, a considerable amount of data, not limited to the literature, have been collected and hosted by the participating organizations in the project, and particularly McGill University. Such a huge data could be combined with the robust methodologies developed in the DiDP in order to open new horizons of understanding and insight on the cultural evolutions of such region, which could be then leveraged toward homogenous development across the IOW and also across the globe despite continuous increase in shortage and scarcity of resources, such as water.

We use the term Literal Culture World to name this combination of the DiDP outcomes and the IOWP data. Literal Culture World could be seen as a step beyond both Culture-Agnostic Evolution of Literature and also Global Economy of the IOW. Although we will start with the literal data, such a combination will not be limited only to literature. 


The Indian Ocean World Project

The Indian Ocean World Centre (IOWC) is a research initiative and resource base at McGill University. It has been established to promote the study of the history, economy and cultures of the lands and peoples of the Indian Ocean world (IOW). The IOW ranges from China to Southeast and South Asia, the Middle East and Africa.

For official web site, please visit:

http://indianoceanworldcentre.com/


What is the Dig into Data of Global Literature Project about?



The Dig into Data project focuses on developing analysis and understanding approaches which are culture/language agnostic. In other words, it pursues the idea that literal cultures evolve according to somehow global mechanisms that are independent from the actual language, culture and time period of the evolution. It is worth mentioning that there is another agnostic facet in this project in that sense that the developed methodologies are also expected to be transferable/migratable across languages/cultures in a straightforward way.

The database of the project compromises of several collections of manuscripts each one restricted to a specific language/culture and period of time. The period of time of each collection is selected in such a way that it highlights a transitional/evolutionary period of the associated literal culture. 

The objectives of the project are approached using methodologies that leverage on the singularities in the manuscripts at various levels. In this way, the developed methodologies would be simply transferable among collections with minimal re-modeling effort. Starting from the main collection of the project, the German European Enlightenment Collection, footnote objects were chosen as the singular events across the manuscripts. A complete set of document image processing methods has been developed to address challenges of processing this collection. The methods range from preprocessing and denoising, to layout analysis and correction, to typeface identification, to footnote marker and body detection and extraction, and to retrieval of the titles of cited manuscript in the footnotes. For many of these steps, such as preprocessing and layout analysis, our in-house state-of-the-art methodologies have been generalized and modified to address the new collection. On the whole collection consisting of more than 1,300 manuscripts, a set of more than 37,000 footnotes was detected and extracted, which is then being used to build some high-level understanding of the relations among manuscripts within the collection. It is worth mentioning that the amount of data provided by the singular features, i.e., the footnote objects, could be negligible compared to the total amount of the visual data that the collection carries from the understanding perspective. As a proof of generalizability of the approach, the methodologies were in a straightforward manner transferred from the German collection to the Chinese collection, the Collection of Chinese Women’s Writing from the Ming-Qing Dynasties, to detect and extract annotation markers and other singular features present on the manuscript’s pages of that collection. 

New visual methodologies are being developed based on novel representations and modeling in order to digest the whole set of document images in the form of a big, complex network of relations among the rich objects representing fractional, incomplete but complementary parts of image-content data encrypted in the manuscripts of a collection. In particular, spatial-patch graphs, error-bounded sparse representations, and multi-state (and quantum-state) state machines, among other approaches are on our road map toward partially addressing the challenges of understanding the documented human heritage.

Introduction

This blog is for my ideas on the "Global Currents: Cultures of Literary Networks, 1050-1900" project, also known as the "Digging into Data: Global Currents" project. The project is led by Prof Andrew Piper as the co-Lead PI along with Prof Mohamed Cheriet, Prof Elaine Treharne, and Prof Lambert Schomaker. This project is a Digging into Data Research Project with McGill University, ETS, Stanford University, and the University of Groningen. For official web site of the project, please visit:

http://globalcurrents.ca/2014/04/19/a-brief-introduction-to-the-global-currents-project/

http://txtlab.org/?cat=43

http://txtlab.org/?p=234

http://www.mcgill.ca/newsroom/channels/news/taking-big-data-challenge-232451