This tutorial covers the use of Human Language Technologies for the Semantic Web and Web Services. It includes sections on HLT and Text Mining for the Semantic Web, various forms of Information Extraction, Ontology Population and Semantic Metadata Creation, and Evaluation.
The tutorial begins with an introduction to Human Language Technology, looking at both its background and development, and then situating it within the context of text mining and other tasks involving knowledge discovery from large collections of unstructured text, which are necessary for the development of the semantic web. The second section concerns information extraction, a major component of text mining. Information extraction involves extracting facts and structured information from unstructured data. We contrast this with Information retrieval, which concerns extracting documents from large text collections, and with data mining, which concerns discoveing patterns in structured data. We introduce GATE, and architecture for language engineering, and its resources for information extraction, and then expand the idea of traditional information extraction to focus on semantic web-enabled technology such as ontology population and semantic metadata creation, both of which involve the use of information extraction based on ontologies. We look at some current state-of-the-art semantic annotation systems such as KIM, Magpie, MnM and OntoMat. In the third section, we discuss evaluation methods for such technology, based on the idea that traditional methods are insufficient when applied to semantic web technology, due to the presence of hierarchical (ontological) information rather than flat structures. We also take a brief look at usability issues of annotation systems. Finally, the tutorial gives demonstrations of two examples of HLT in use for the semantic web. First we present RichNews, which aims to automate the annotation of news programs, segmenting, describing and classifying news broadcasts from transcripts. Second, we present work on ontology-based and mixed initiative information extraction carried out in the context of SEKT.
Author: Diana Maynard, University Of Sheffield