Refine
Document Type
- Conference Proceeding (17)
- Article (2)
- Working Paper (2)
- Preprint (1)
Language
- English (22) (remove)
Keywords
- Klassifikation (3)
- Open Access (3)
- Semantik (3)
- Text Mining (3)
- Contract Analysis (2)
- Distributional Semantics (2)
- Informationsmanagement (2)
- Thesaurus (2)
- Abbreviations (1)
- Abkürzung (1)
- Acronyms (1)
- Akronym (1)
- Algorithmus (1)
- Ambiguität (1)
- Automatische Identifikation (1)
- Bildersuchmaschine (1)
- Computerlinguistik (1)
- Concreteness (1)
- Dewey-Dezimalklassifikation (1)
- Disambiguation (1)
- Distributionelle Semantik (1)
- Fassung (1)
- Illustration (1)
- Image Retrieval (1)
- Imagery (1)
- Information Dissemination (1)
- Information Retrieval (1)
- Keyword Extraction (1)
- Knowledge Maps (1)
- Konkretum <Linguistik> (1)
- Korpus <Linguistik> (1)
- Lexical Semantics (1)
- Linguistische Informationswissenschaft (1)
- Machine Learning (1)
- Medieninformatik (1)
- Medizinische Bibliothek (1)
- Multimedia (1)
- Multimedia Information Retrieval (1)
- Multimedia Retrieval (1)
- Multimedien (1)
- Notation <Klassifikation> (1)
- PDF <Dateiformat> (1)
- Paraphrase (1)
- Paraphrase Similarity (1)
- Passage Retrieval (1)
- Regional Development (1)
- Regional Innovation Systems (1)
- Regional Policy (1)
- Retrieval (1)
- Schlagwortkatalog (1)
- Schlagwortnormdatei (1)
- Scientific Figures (1)
- Scientific image search (1)
- Segmentation (1)
- Segmentierung (1)
- Similarity Measures (1)
- Speech Recognition (1)
- Spracherkennung (1)
- Statistical Analysis (1)
- Statistical Methods (1)
- Statistische Analyse (1)
- Statistische Methoden (1)
- Structural Analysis (1)
- Synononym (1)
- Synonymie (1)
- Territorial Intelligence (1)
- Text Segmentation (1)
- Text Similarity (1)
- Text annotation (1)
- Title Matching (1)
- User Generated Content (1)
- Vergleich (1)
- Versicherungsvertrag (1)
- Vertrag (1)
- Vertragsklausel (1)
- Video Segmentation (1)
- Wikidata (1)
- Wikimedia Commons (1)
- Wikipedia categories (1)
- XML (1)
- context vectors (1)
- distributional semantics (1)
- supervised machine learning (1)
- thesauri (1)
- Ähnlichkeit (1)
- Überwachtes Lernen (1)
In the present paper we sketch an automated procedure to compare different versions of a contract. The contract texts used for this purpose are structurally differently composed PDF files that are converted into structured XML files by identifying and classifying text boxes. A classifier trained on manually annotated contracts achieves an accuracy of 87% on this task. We align contract versions and classify aligned text fragments into different similarity classes that enhance the manual comparison of changes in document versions. The main challenges are to deal with OCR errors and different layout of identical or similar texts. We demonstrate the procedure using some freely available contracts from the City of Hamburg written in German. The methods, however, are language agnostic and can be applied to other contracts as well.
Concreteness of words has been studied extensively in psycholinguistic literature. A number of datasets have been created with average values for perceived concreteness of words. We show that we can train a regression model on these data, using word embeddings and morphological features, that can predict these concreteness values with high accuracy. We evaluate the model on 7 publicly available datasets. Only for a few small subsets of these datasets prediction of concreteness values are found in the literature. Our results clearly outperform the reported results for these datasets.
For the analysis of contract texts, validated model texts, such as model clauses, can be used to identify used contract clauses. This paper investigates how the similarity between titles of model clauses and headings extracted from contracts can be computed, and which similarity measure is most suitable for this. For the calculation of the similarities between title pairs we tested various variants of string similarity and token based similarity. We also compare two additional semantic similarity measures based on word embeddings using pre-trained embeddings and word embeddings trained on contract texts. The identification of the model clause title can be used as a starting point for the mapping of clauses found in contracts to verified clauses.
Scientific papers from all disciplines contain many abbreviations and acronyms. In many cases these acronyms are ambiguous. We present a method to choose the contextual correct definition of an acronym that does not require training for each acronym and thus can be applied to a large number of different acronyms with only few instances. We constructed a set of 19,954 examples of 4,365 ambiguous acronyms from image captions in scientific papers along with their contextually correct definition from different domains. We learn word embeddings for all words in the corpus and compare the averaged context vector of the words in the expansion of an acronym with the weighted average vector of the words in the context of the acronym. We show that this method clearly outperforms (classical) cosine similarity. Furthermore, we show that word embeddings learned from a 1 billion word corpus of scientific exts outperform word embeddings learned from much larger general corpora.
This paper summarizes the results of a comprehensive statistical analysis on a corpus of open access articles and contained figures. It gives an insight into quantitative relationships between illustrations or types of illustrations, caption lengths, subjects, publishers, author affiliations, article citations and others.
The reuse of scientific raw data is a key demand of Open Science. In the project NOA we foster reuse of scientific images by collecting and uploading them to Wikimedia Commons. In this paper we present a text-based annotation method that proposes Wikipedia categories for open access images. The assigned categories can be used for image retrieval or to upload images to Wikimedia Commons. The annotation basically consists of two phases: extracting salient keywords and mapping these keywords to categories. The results are evaluated on a small record of open access images that were manually annotated.
NOA is a search engine for scientific images from open access publications based on full text indexing of all text referring to the images and filtering for disciplines and image type. Images will be annotated with Wikipedia categories for better discoverability and for uploading to WikiCommons. Currently we have indexed approximately 2,7 Million images from over 710 000 scientific papers from all fields of science.
Editorial for the 17th European Networked Knowledge Organization Systems Workshop (NKOS 2017)
(2017)
Knowledge Organization Systems (KOS), in the form of classification systems, thesauri, lexical databases, ontologies, and taxonomies, play a crucial role in digital information management and applications generally. Carrying semantics in a well-controlled and documented way, Knowledge Organization Systems serve a variety of important functions: tools for representation and indexing of information and documents, knowledge-based support to information searchers, semantic road maps to domains and disciplines, communication tool by providing conceptual framework, and conceptual basis for knowledge based systems, e.g. automated classification systems. New networked KOS (NKOS) services and applications are emerging, and we have reached a stage where many KOS standards exist and the integration of linked services is no longer just a future scenario. This editorial describes the workshop outline and overview of presented papers at the 17th European Networked Knowledge Organization Systems Workshop (NKOS 2017) which was held during the TPDL 2017 Conference in Thessaloniki, Greece.
The amount of papers published yearly increases since decades. Libraries need to make these resources accessible and available with classification being an important aspect and part of this process. This paper analyzes prerequisites and possibilities of automatic classification of medical literature. We explain the selection, preprocessing and analysis of data consisting of catalogue datasets from the library of the Hanover Medical School, Lower Saxony, Germany. In the present study, 19,348 documents, represented by notations of library classification systems such as e.g. the Dewey Decimal Classification (DDC), were classified into 514 different classes from the National Library of Medicine (NLM) classification system. The algorithm used was k-nearest-neighbours (kNN). A correct classification rate of 55.7% could be achieved. To the best of our knowledge, this is not only the first research conducted towards the use of the NLM classification in automatic classification but also the first approach that exclusively considers already assigned notations from other
classification systems for this purpose.
For indexing archived documents the Dutch Parliament uses a specialized thesaurus. For good results for full text retrieval and automatic classification it turns out to be important to add more synonyms to the existing thesaurus terms. In the present work we investigate the possibilities to find synonyms for terms of the parliaments thesaurus automatically. We propose to use distributional similarity (DS). In an experiment with pairs of synonyms and non-synonyms we train and test a classifier using distributional similarity and string similarity. Using ten-fold cross validation we were able to classify 75% of the pairs of a set of 6000 word pairs correctly.