Search

4 search hits

1 to 4

Sort by

A Probabilistic Morphology Model for German Lemmatization (2019)

Lemmatization is a central task in many NLP applications. Despite this importance, the number of (freely) available and easy to use tools for German is very limited. To fill this gap, we developed a simple lemmatizer that can be trained on any lemmatized corpus. For a full form word the tagger tries to find the sequence of morphemes that is most likely to generate that word. From this sequence of tags we can easily derive the stem, the lemma and the part of speech (PoS) of the word. We show (i) that the quality of this approach is comparable to state of the art methods and (ii) that we can improve the results of Part-of-Speech (PoS) tagging when we include the morphological analysis of each word.

Distributional Similarity of Words with Different Frequencies (2013)

Wartena, Christian

Distributional semantics tries to characterize the meaning of words by the contexts in which they occur. Similarity of words hence can be derived from the similarity of contexts. Contexts of a word are usually vectors of words appearing near to that word in a corpus. It was observed in previous research that similarity measures for the context vectors of two words depend on the frequency of these words. In the present paper we investigate this dependency in more detail for one similarity measure, the Jensen-Shannon divergence. We give an empirical model of this dependency and propose the deviation of the observed Jensen-Shannon divergence from the divergence expected on the basis of the frequencies of the words as an alternative similarity measure. We show that this new similarity measure is superior to both the Jensen-Shannon divergence and the cosine similarity in a task, in which pairs of words, taken from Wordnet, have to be classified as being synonyms or not.

Reproduzierbarkeit von Studien in der Computerlinguistik (2019)

Andresen, Amelie

Die Reproduzierbarkeit von Studien ist wichtig, um ihre Ergebnisse prüfen zu können. Auch bei Forschung, die auf frühere Ergebnisse aufbaut, wird zuweilen ein Zugang zu den alten Daten oder dem Source Code benötigt. Diese Arbeit analysiert Studien aus der Computerlinguistik hinsichtlich ihrer Reproduzierbarkeit. Zunächst werden die Begrifflichkeiten zu diesem speziellen Gebiet definiert und im folgenden Schritt wird ein Datensatz erstellt, in dem ausgewählte Open-Access-Studien aus dem Jahre 2018 auf der Basis zuvor festgelegter Kriterien bewertet werden. Diese sind unter anderem die Zugänglichkeit des benutzten Materials, der angewendeten Methoden und der Ergebnisse. Neben den Kriterien werden auch Hypothesen zu diesem Datensatz aufgestellt. Schließlich werden die Ergebnisse visualisiert und hinsichtlich besagter Hypothesen interpretiert. Basierend auf der resultierenden Auswertung sind die meisten Studien reproduzierbar. Im Ausblick werden mögliche Weiterführungen und Erweiterungen dieser Untersuchung erläutert.

The Hanover Tagger (Version 1.1.0) - Lemmatization, Morphological Analysis and POS Tagging in Python (2023)

Wartena, Christian

HanTa, or Hanover Tagger, is an open source Python program for lemmatization and part of speech tagging. This document contains a description of the functionality, an introduction to the ideas and techniques used and some information on the annotated training data for Dutch, English and German.

1 to 4

Open Access

Refine

Author

Year of publication

Document Type

Language

Has Fulltext

Is part of the Bibliography

Keywords

Institute

4 search hits