Distributional Similarity of Words with Different Frequencies

Distributional semantics tries to characterize the meaning of words by the contexts in which they occur. Similarity of words hence can be derived from the similarity of contexts. Contexts of a word are usually vectors of words appearing near to that word in a corpus. It was observed in previous research that similarity measures for the context vectors of two words depend on the frequency of these words. In the present paper we investigate this dependency in more detail for one similarity measure, the Jensen-Shannon divergence. We give an empirical model of this dependency and propose the deviation of the observed Jensen-Shannon divergence from the divergence expected on the basis of the frequencies of the words as an alternative similarity measure. We show that this new similarity measure is superior to both the Jensen-Shannon divergence and the cosine similarity in a task, in which pairs of words, taken from Wordnet, have to be classified as being synonyms or not.

Metadaten
Author:	Christian Wartena ORCiD GND
URN:	urn:nbn:de:bsz:960-opus-4077
DOI:	https://doi.org/10.25968/opus-335
Document Type:	Working Paper
Language:	English
Year of Completion:	2013
Publishing Institution:	Hochschule Hannover
Release Date:	2013/04/29
Tag:	Distributionelle Semantik Distributional Semantics
GND Keyword:	Synonymie; Semantik; Computerlinguistik; Linguistische Informationswissenschaft; Korpus <Linguistik>
Source:	Proceedings of the 13th edition of the Dutch-Belgian information retrieval Workshop (DIR 2013), 2013, S.8-11
Link to catalogue:	744013437
Institutes:	Fakultät III - Medien, Information und Design
DDC classes:	020 Bibliotheks- und Informationswissenschaft
Licence (German):	Creative Commons - Namensnennung-Nicht kommerziell-Keine Bearbeitung 3.0

Open Access