Posted in Conferences, Companies on June 19, 2007

Customizing OCR Collections
Google Tech Talks
May 31, 2007


Efficient book scanning and increasingly sophisticated OCR have laid the foundation for collections that are far larger, but also much less structured, than the carefully curated collections of literary and historical materials on which demanding study has depended. This talk describes how high performance services can be built on top of these large collections. The challenge is to provide mechanisms whereby particular communities customize the content and the services that underlie very large collections. No centralized entity can optimize its services for every community. We need mechanisms with which communities can extend OCR (e.g., adding new language models and/or character sets), multi-lingual services (e.g., adding language specific modules such as morphological analyzers and either adding or pointing to knowledge sources such as machine readable dictionaries, parallel texts), and named entity identification/information extraction services (e.g., adding domain specific gazetteers, biographical encyclopedias etc.). In such an environment, therefore, communities need not only APIs to search and visualization services but reasonable methods with which to improve the performance of core services on their particular materials. This talk will review services on which demanding work within several areas of the humanities will depend.

