Refine
Document Type
- Conference Proceeding (18)
- Article (3)
- Working Paper (2)
- Preprint (1)
Language
- English (24) (remove)
Keywords
- Klassifikation (3)
- Open Access (3)
- Semantik (3)
- Text Mining (3)
- Computerlinguistik (2)
- Contract Analysis (2)
- Distributional Semantics (2)
- Informationsmanagement (2)
- Keyword Extraction (2)
- Machine Learning (2)
- Thesaurus (2)
- Abbreviations (1)
- Abkürzung (1)
- Acronyms (1)
- Akronym (1)
- Algorithmus (1)
- Ambiguität (1)
- Automatische Identifikation (1)
- Automatische Klassifikation (1)
- Bildersuchmaschine (1)
- Classification (1)
- Concreteness (1)
- Dewey-Dezimalklassifikation (1)
- Disambiguation (1)
- Distributionelle Semantik (1)
- Fassung (1)
- German (1)
- Illustration (1)
- Image Retrieval (1)
- Imagery (1)
- Information Dissemination (1)
- Information Retrieval (1)
- Knowledge Maps (1)
- Konkretum <Linguistik> (1)
- Korpus <Linguistik> (1)
- LCSH (1)
- Lemmatization (1)
- Lexical Semantics (1)
- Library of Congress (1)
- Linguistische Informationswissenschaft (1)
- Markov Models (1)
- Maschinelles Lernen (1)
- Medieninformatik (1)
- Medizinische Bibliothek (1)
- Multimedia (1)
- Multimedia Information Retrieval (1)
- Multimedia Retrieval (1)
- Multimedien (1)
- Notation <Klassifikation> (1)
- PDF <Dateiformat> (1)
- POS Tagging (1)
- Paraphrase (1)
- Paraphrase Similarity (1)
- Passage Retrieval (1)
- Regional Development (1)
- Regional Innovation Systems (1)
- Regional Policy (1)
- Retrieval (1)
- Schlagwort (1)
- Schlagwortkatalog (1)
- Schlagwortnormdatei (1)
- Scientific Figures (1)
- Scientific image search (1)
- Segmentation (1)
- Segmentierung (1)
- Similarity Measures (1)
- Speech Recognition (1)
- Spracherkennung (1)
- Statistical Analysis (1)
- Statistical Methods (1)
- Statistische Analyse (1)
- Statistische Methoden (1)
- Structural Analysis (1)
- Synononym (1)
- Synonymie (1)
- Territorial Intelligence (1)
- Text Segmentation (1)
- Text Similarity (1)
- Text annotation (1)
- Title Matching (1)
- User Generated Content (1)
- Vergleich (1)
- Versicherungsvertrag (1)
- Vertrag (1)
- Vertragsklausel (1)
- Video Segmentation (1)
- Wikidata (1)
- Wikimedia Commons (1)
- Wikipedia categories (1)
- XML (1)
- context vectors (1)
- distributional semantics (1)
- supervised machine learning (1)
- thesauri (1)
- Ähnlichkeit (1)
- Überwachtes Lernen (1)
For indexing archived documents the Dutch Parliament uses a specialized thesaurus. For good results for full text retrieval and automatic classification it turns out to be important to add more synonyms to the existing thesaurus terms. In the present work we investigate the possibilities to find synonyms for terms of the parliaments thesaurus automatically. We propose to use distributional similarity (DS). In an experiment with pairs of synonyms and non-synonyms we train and test a classifier using distributional similarity and string similarity. Using ten-fold cross validation we were able to classify 75% of the pairs of a set of 6000 word pairs correctly.
Automatic classification of scientific records using the German Subject Heading Authority File (SWD)
(2012)
The following paper deals with an automatic text classification method which does not require training documents. For this method the German Subject Heading Authority File (SWD), provided by the linked data service of the German National Library is used. Recently the SWD was enriched with notations of the Dewey Decimal Classification (DDC). In consequence it became possible to utilize the subject headings as textual representations for the notations of the DDC. Basically, we we derive the classification of a text from the classification of the words in the text given by the thesaurus. The method was tested by classifying 3826 OAI-Records from 7 different repositories. Mean reciprocal rank and recall were chosen as evaluation measure. Direct comparison to a machine learning method has shown that this method is definitely competitive. Thus we can conclude that the enriched version of the SWD provides high quality information with a broad coverage for classification of German scientific articles.
Lemmatization is a central task in many NLP applications. Despite this importance, the number of (freely) available and easy to use tools for German is very limited. To fill this gap, we developed a simple lemmatizer that can be trained on any lemmatized corpus. For a full form word the tagger tries to find the sequence of morphemes that is most likely to generate that word. From this sequence of tags we can easily derive the stem, the lemma and the part of speech (PoS) of the word. We show (i) that the quality of this approach is comparable to state of the art methods and (ii) that we can improve the results of Part-of-Speech (PoS) tagging when we include the morphological analysis of each word.
Library of Congress Subject Headings (LCSH) are popular for indexing library records. We studied the possibility of assigning LCSH automatically by training classifiers for terms used frequently in a large collection of abstracts of the literature on hand and by extracting headings from those abstracts. The resulting classifiers reach an acceptable level of precision, but fail in terms of recall partly because we could only train classifiers for a small number of LCSH. Extraction, i.e., the matching of headings in the text, produces better recall but extremely low precision. We found that combining both methods leads to a significant improvement of recall and a slight improvement of F1 score with only a small decrease in precision.