Refine
Document Type
- Conference Proceeding (2)
- Part of a Book (1)
- Working Paper (1)
Has Fulltext
- yes (4)
Is part of the Bibliography
- no (4)
Keywords
- Information Retrieval (4) (remove)
Institute
We present a simple method to find topics in user reviews that accompany ratings for products or services. Standard topic analysis will perform sub-optimal on such data since the word distributions in the documents are not only determined by the topics but by the sentiment as well. We reduce the influence of the sentiment on the topic selection by adding two explicit topics, representing positive and negative sentiment. We evaluate the proposed method on a set of over 15,000 hospital reviews. We show that the proposed method, Latent Semantic Analysis with explicit word features, finds topics with a much smaller bias for sentiments than other similar methods.
Image captions in scientific papers usually are complementary to the images. Consequently, the captions contain many terms that do not refer to concepts visible in the image. We conjecture that it is possible to distinguish between these two types of terms in an image caption by analysing the text only. To examine this, we evaluated different features. The dataset we used to compute tf.idf values, word embeddings and concreteness values contains over 700 000 scientific papers with over 4,6 million images. The evaluation was done with a manually annotated subset of 329 images. Additionally, we trained a support vector machine to predict whether a term is a likely visible or not. We show that concreteness of terms is a very important feature to identify terms in captions and context that refer to concepts visible in images.
We compare the effect of different text segmentation strategies on speech based passage retrieval of video. Passage retrieval has mainly been studied to improve document retrieval and to enable question answering. In these domains best results were obtained using passages defined by the paragraph structure of the source documents or by using arbitrary overlapping passages. For the retrieval of relevant passages in a video, using speech transcripts, no author defined segmentation is available. We compare retrieval results from 4 different types of segments based on the speech channel of the video: fixed length segments, a sliding window, semantically coherent segments and prosodic segments. We evaluated the methods on the corpus of the MediaEval 2011 Rich Speech Retrieval task. Our main conclusion is that the retrieval results highly depend on the right choice for the segment length. However, results using the segmentation into semantically coherent parts depend much less on the segment length. Especially, the quality of fixed length and sliding window segmentation drops fast when the segment length increases, while quality of the semantically coherent segments is much more stable. Thus, if coherent segments are defined, longer segments can be used and consequently less segments have to be considered at retrieval time.