Refine
Document Type
- Conference Proceeding (18)
- Article (3)
- Working Paper (2)
- Preprint (1)
Language
- English (24) (remove)
Keywords
- Klassifikation (3)
- Open Access (3)
- Semantik (3)
- Text Mining (3)
- Computerlinguistik (2)
- Contract Analysis (2)
- Distributional Semantics (2)
- Informationsmanagement (2)
- Keyword Extraction (2)
- Machine Learning (2)
- Thesaurus (2)
- Abbreviations (1)
- Abkürzung (1)
- Acronyms (1)
- Akronym (1)
- Algorithmus (1)
- Ambiguität (1)
- Automatische Identifikation (1)
- Automatische Klassifikation (1)
- Bildersuchmaschine (1)
- Classification (1)
- Concreteness (1)
- Dewey-Dezimalklassifikation (1)
- Disambiguation (1)
- Distributionelle Semantik (1)
- Fassung (1)
- German (1)
- Illustration (1)
- Image Retrieval (1)
- Imagery (1)
- Information Dissemination (1)
- Information Retrieval (1)
- Knowledge Maps (1)
- Konkretum <Linguistik> (1)
- Korpus <Linguistik> (1)
- LCSH (1)
- Lemmatization (1)
- Lexical Semantics (1)
- Library of Congress (1)
- Linguistische Informationswissenschaft (1)
- Markov Models (1)
- Maschinelles Lernen (1)
- Medieninformatik (1)
- Medizinische Bibliothek (1)
- Multimedia (1)
- Multimedia Information Retrieval (1)
- Multimedia Retrieval (1)
- Multimedien (1)
- Notation <Klassifikation> (1)
- PDF <Dateiformat> (1)
- POS Tagging (1)
- Paraphrase (1)
- Paraphrase Similarity (1)
- Passage Retrieval (1)
- Regional Development (1)
- Regional Innovation Systems (1)
- Regional Policy (1)
- Retrieval (1)
- Schlagwort (1)
- Schlagwortkatalog (1)
- Schlagwortnormdatei (1)
- Scientific Figures (1)
- Scientific image search (1)
- Segmentation (1)
- Segmentierung (1)
- Similarity Measures (1)
- Speech Recognition (1)
- Spracherkennung (1)
- Statistical Analysis (1)
- Statistical Methods (1)
- Statistische Analyse (1)
- Statistische Methoden (1)
- Structural Analysis (1)
- Synononym (1)
- Synonymie (1)
- Territorial Intelligence (1)
- Text Segmentation (1)
- Text Similarity (1)
- Text annotation (1)
- Title Matching (1)
- User Generated Content (1)
- Vergleich (1)
- Versicherungsvertrag (1)
- Vertrag (1)
- Vertragsklausel (1)
- Video Segmentation (1)
- Wikidata (1)
- Wikimedia Commons (1)
- Wikipedia categories (1)
- XML (1)
- context vectors (1)
- distributional semantics (1)
- supervised machine learning (1)
- thesauri (1)
- Ähnlichkeit (1)
- Überwachtes Lernen (1)
Scientific papers from all disciplines contain many abbreviations and acronyms. In many cases these acronyms are ambiguous. We present a method to choose the contextual correct definition of an acronym that does not require training for each acronym and thus can be applied to a large number of different acronyms with only few instances. We constructed a set of 19,954 examples of 4,365 ambiguous acronyms from image captions in scientific papers along with their contextually correct definition from different domains. We learn word embeddings for all words in the corpus and compare the averaged context vector of the words in the expansion of an acronym with the weighted average vector of the words in the context of the acronym. We show that this method clearly outperforms (classical) cosine similarity. Furthermore, we show that word embeddings learned from a 1 billion word corpus of scientific exts outperform word embeddings learned from much larger general corpora.
The reuse of scientific raw data is a key demand of Open Science. In the project NOA we foster reuse of scientific images by collecting and uploading them to Wikimedia Commons. In this paper we present a text-based annotation method that proposes Wikipedia categories for open access images. The assigned categories can be used for image retrieval or to upload images to Wikimedia Commons. The annotation basically consists of two phases: extracting salient keywords and mapping these keywords to categories. The results are evaluated on a small record of open access images that were manually annotated.
In the present paper we sketch an automated procedure to compare different versions of a contract. The contract texts used for this purpose are structurally differently composed PDF files that are converted into structured XML files by identifying and classifying text boxes. A classifier trained on manually annotated contracts achieves an accuracy of 87% on this task. We align contract versions and classify aligned text fragments into different similarity classes that enhance the manual comparison of changes in document versions. The main challenges are to deal with OCR errors and different layout of identical or similar texts. We demonstrate the procedure using some freely available contracts from the City of Hamburg written in German. The methods, however, are language agnostic and can be applied to other contracts as well.
We compare the effect of different segmentation strategies for passage retrieval of user generated internet video. We consider retrieval of passages for rather abstract and complex queries that go beyond finding a certain object or constellation of objects in the visual channel. Hence the retrieval methods have to rely heavily on the recognized speech. Passage retrieval has mainly been studied to improve document retrieval and to enable question answering. In these domains best results were obtained using passages defined by the paragraph structure of the source documents or by using arbitrary overlapping passages. For the retrieval of relevant passages in a video no author defined paragraph structure is available. We compare retrieval results from 5 different types of segments: segments defined by shot boundaries, prosodic segments, fixed length segments, a sliding window and semantically coherent segments based on speech transcripts. We evaluated the methods on the corpus of the MediaEval 2011 Rich Speech Retrieval task. Our main conclusions are (1) that fixed length and coherent segments are clearly superior to segments based on speaker turns or shot boundaries; (2) that the retrieval results highly depend on the right choice for the segment length; and (3) that results using the segmentation into semantically coherent parts depend much less on the segment length. Especially, the quality of fixed length and sliding window segmentation drops fast when the segment length increases, while quality of the semantically coherent segments is much more stable. Thus, if coherent segments are defined, longer segments can be used and consequently fewer segments have to be considered at retrieval time.
Regional knowledge map is a tool recently demanded by some actors in an institutional level to help regional policy and innovation in a territory. Besides, knowledge maps facilitate the interaction between the actors of a territory and the collective learning. This paper reports the work in progress of a research project which objective is to define a methodology to efficiently design territorial knowledge maps, by extracting information of big volumes of data contained in diverse sources of information related to a region. Knowledge maps facilitate management of the intellectual capital in organisations. This paper investigates the value to apply this tool to a territorial region to manage the structures, infrastructures and the resources to enable regional innovation and regional development. Their design involves the identification of information sources that are required to find which knowledge is located in a territory, which actors are involved in innovation, and which is the context to develop this innovation (structures, infrastructures, resources and social capital). This paper summarizes the theoretical background and framework for the design of a methodology for the construction of knowledge maps, and gives an overview of the main challenges for the design of regional knowledge maps.
Concreteness of words has been studied extensively in psycholinguistic literature. A number of datasets have been created with average values for perceived concreteness of words. We show that we can train a regression model on these data, using word embeddings and morphological features, that can predict these concreteness values with high accuracy. We evaluate the model on 7 publicly available datasets. Only for a few small subsets of these datasets prediction of concreteness values are found in the literature. Our results clearly outperform the reported results for these datasets.
The dependency of word similarity in vector space models on the frequency of words has been noted in a few studies, but has received very little attention. We study the influence of word frequency in a set of 10 000 randomly selected word pairs for a number of different combinations of feature weighting schemes and similarity measures. We find that the similarity of word pairs for all methods, except for the one using singular value decomposition to reduce the dimensionality of the feature space, is determined to a large extent by the frequency of the words. In a binary classification task of pairs of synonyms and unrelated words we find that for all similarity measures the results can be improved when we correct for the frequency bias.
NOA is a search engine for scientific images from open access publications based on full text indexing of all text referring to the images and filtering for disciplines and image type. Images will be annotated with Wikipedia categories for better discoverability and for uploading to WikiCommons. Currently we have indexed approximately 2,7 Million images from over 710 000 scientific papers from all fields of science.
In distributional semantics words are represented by aggregated context features. The similarity of words can be computed by comparing their feature vectors. Thus, we can predict whether two words are synonymous or similar with respect to some other semantic relation. We will show on six different datasets of pairs of similar and non-similar words that a supervised learning algorithm on feature vectors representing pairs of words outperforms cosine similarity between vectors representing single words. We compared different methods to construct a feature vector representing a pair of words. We show that simple methods like pairwise addition or multiplication give better results than a recently proposed method that combines different types of features. The semantic relation we consider is relatedness of terms in thesauri for intellectual document classification. Thus our findings can directly be applied for the maintenance and extension of such thesauri. To the best of our knowledge this relation was not considered before in the field of distributional semantics.
Integrating distributional and lexical information for semantic classification of words using MRMF
(2016)
Semantic classification of words using distributional features is usually based on the semantic similarity of words. We show on two different datasets that a trained classifier using the distributional features directly gives better results. We use Support Vector Machines (SVM) and Multirelational Matrix Factorization (MRMF) to train classifiers. Both give similar results. However, MRMF, that was not used for semantic classification with distributional features before, can easily be extended with more matrices containing more information from different sources on the same problem. We demonstrate the effectiveness of the novel approach by including information from WordNet. Thus we show, that MRMF provides an interesting approach for building semantic classifiers that (1) gives better results than unsupervised approaches based on vector similarity, (2) gives similar results as other supervised methods and (3) can naturally be extended with other sources of information in order to improve the results.