Refine
Document Type
- Article (3) (remove)
Language
- English (3)
Has Fulltext
- yes (3)
Is part of the Bibliography
- no (3)
Keywords
- Automatische Identifikation (1)
- Automatische Klassifikation (1)
- Classification (1)
- Keyword Extraction (1)
- LCSH (1)
- Library of Congress (1)
- Machine Learning (1)
- Maschinelles Lernen (1)
- Multimedia Retrieval (1)
- Schlagwort (1)
Institute
Library of Congress Subject Headings (LCSH) are popular for indexing library records. We studied the possibility of assigning LCSH automatically by training classifiers for terms used frequently in a large collection of abstracts of the literature on hand and by extracting headings from those abstracts. The resulting classifiers reach an acceptable level of precision, but fail in terms of recall partly because we could only train classifiers for a small number of LCSH. Extraction, i.e., the matching of headings in the text, produces better recall but extremely low precision. We found that combining both methods leads to a significant improvement of recall and a slight improvement of F1 score with only a small decrease in precision.
For indexing archived documents the Dutch Parliament uses a specialized thesaurus. For good results for full text retrieval and automatic classification it turns out to be important to add more synonyms to the existing thesaurus terms. In the present work we investigate the possibilities to find synonyms for terms of the parliaments thesaurus automatically. We propose to use distributional similarity (DS). In an experiment with pairs of synonyms and non-synonyms we train and test a classifier using distributional similarity and string similarity. Using ten-fold cross validation we were able to classify 75% of the pairs of a set of 6000 word pairs correctly.
We compare the effect of different segmentation strategies for passage retrieval of user generated internet video. We consider retrieval of passages for rather abstract and complex queries that go beyond finding a certain object or constellation of objects in the visual channel. Hence the retrieval methods have to rely heavily on the recognized speech. Passage retrieval has mainly been studied to improve document retrieval and to enable question answering. In these domains best results were obtained using passages defined by the paragraph structure of the source documents or by using arbitrary overlapping passages. For the retrieval of relevant passages in a video no author defined paragraph structure is available. We compare retrieval results from 5 different types of segments: segments defined by shot boundaries, prosodic segments, fixed length segments, a sliding window and semantically coherent segments based on speech transcripts. We evaluated the methods on the corpus of the MediaEval 2011 Rich Speech Retrieval task. Our main conclusions are (1) that fixed length and coherent segments are clearly superior to segments based on speaker turns or shot boundaries; (2) that the retrieval results highly depend on the right choice for the segment length; and (3) that results using the segmentation into semantically coherent parts depend much less on the segment length. Especially, the quality of fixed length and sliding window segmentation drops fast when the segment length increases, while quality of the semantically coherent segments is much more stable. Thus, if coherent segments are defined, longer segments can be used and consequently fewer segments have to be considered at retrieval time.