Key Phrase Indexing With Controlled Vocabularies
Olena Medelyan is a grad student who has just started on a Google-funded PhD scholarship, looking at keyphrase extraction using lexical and linguistic techniques.
Keyphrases are widely used in information retrieval as a brief but precise summary of documents. They are usually selected by professional human indexers. The more consistent the indexers are with each other, the higher the retrieval efficiency.
- We describe an experiment where six professionals assigned keyphrases from a controlled vocabulary to the same documents, and evaluate their indexing consistency. Interesting patterns discovered in this experiment helped in developing an automatic approach for this task.
- The keyphrase extraction algorithm KEA++ extracts phrases from the documents and maps them onto index terms from a domain-specific thesaurus. A machine learning scheme determines the most significant phrases based on their statistical, syntactic and semantic properties. The evaluation reveals that KEA++ is almost as consistent with the indexers as they with each other.
- It is important that a keyphrase set covers all main topics of a document.