Refine
Year of publication
Document Type
- Conference Proceeding (50) (remove)
Has Fulltext
- yes (50)
Is part of the Bibliography
- no (50)
Keywords
- Text Mining (5)
- Concreteness (4)
- Semantik (4)
- Ausbildung (3)
- Bibliothek (3)
- German (3)
- Information Retrieval (3)
- Informationsmanagement (3)
- Klassifikation (3)
- Bibliothekswesen (2)
- Contract Analysis (2)
- Deutsch (2)
- Digitalisierung (2)
- Disambiguation (2)
- Distributional Semantics (2)
- E-Learning (2)
- Grader (2)
- Graja (2)
- Konkretum <Linguistik> (2)
- Kulturerbe (2)
- Machine Learning (2)
- Modellversuch BID (2)
- Open Access (2)
- Programmieraufgabe (2)
- Rechtswissenschaften (2)
- Sachtext (2)
- Sprachnorm (2)
- Vergleich (2)
- Vertrag (2)
- Wikibase (2)
- Wikidata (2)
- Ähnlichkeit (2)
- 3D data (1)
- Abbreviations (1)
- Abkürzung (1)
- Acronyms (1)
- Akronym (1)
- Algorithmus (1)
- Ambiguität (1)
- Annotation (1)
- Autobewerter (1)
- Automatische Klassifikation (1)
- Automatische Sprachanalyse (1)
- Automatisierte Programmbewertung (1)
- Azyklischer gerichteter Graph (1)
- Benutzererlebnis (1)
- Bewertungsaspekt (1)
- Bewertungsmaßstab (1)
- Bibliothekar (1)
- Bilderkennung (1)
- Bildersprache (1)
- Bildersuchmaschine (1)
- Bildmaterial (1)
- Bildverarbeitung (1)
- Book of Abstract (1)
- Citizens (1)
- Classification (1)
- Computerlinguistik (1)
- Constructive Alignment (1)
- Corpus construction (1)
- Data Science (1)
- Data-Warehouse-Konzept (1)
- Datenaufbereitung (1)
- Decision Support Systems, Clinical (1)
- Deep Convolutional Networks (1)
- Dewey-Dezimalklassifikation (1)
- Didactic (1)
- Digital Wellbeing (1)
- Digitalization (1)
- Digitization (1)
- Disambiguierung (1)
- Dokumentanalyse (1)
- E - Assessment (1)
- Fassung (1)
- Feature and Text Extraction (1)
- Figurative Language (1)
- Focus Group (1)
- Formelhafte Textabschnitte (1)
- Forschungsdaten (1)
- Gesundheitsfürsorge (1)
- Graph-based Text Representations (1)
- Grappa (1)
- Gruppeninterview (1)
- Health IT (1)
- Hochschule (1)
- Home Care (1)
- Hybrid Conference (1)
- Image Recognition (1)
- Image Retrieval (1)
- Imagery (1)
- Images (1)
- Information Dissemination (1)
- Information Extraction (1)
- Information Management (1)
- Information Science (1)
- Java <Programmiersprache> (1)
- Keyword Extraction (1)
- Knowledge Maps (1)
- Kompakkt (1)
- Kompetenz (1)
- Korpus <Linguistik> (1)
- Krankenhaus (1)
- LIG (1)
- Latent Semantic Analysis (1)
- Layout Detection (1)
- Legal Documents (1)
- Legal Writings (1)
- Legende <Bild> (1)
- Lemmatization (1)
- Lernmotivation (1)
- Lexical Semantics (1)
- Linear Indexed Grammars (1)
- Linked Data (1)
- Linked Open Data (1)
- Liver Transplantation (1)
- Markov Models (1)
- Maschinelles Lernen (1)
- Media Didactic Concept (1)
- Mediendidaktik (1)
- Medizinische Bibliothek (1)
- Middleware (1)
- Motivation (1)
- NFDI (1)
- NFDI4Culture – Konsortium für Forschungsdaten materieller und immaterieller Kulturgüter (1)
- NLP (1)
- Nierentransplantation (1)
- Notation <Klassifikation> (1)
- Open Repositories (1)
- Open Science (1)
- Open Source (1)
- OpenRefine (1)
- PDF <Dateiformat> (1)
- PDF Document Analysis (1)
- POS Tagging (1)
- Paraphrase (1)
- Paraphrase Similarity (1)
- Patient empowerment (1)
- Phraseologie (1)
- Physics (1)
- Physik (1)
- Plugin (1)
- ProFormA-Aufgabenformat (1)
- Qualifikation (1)
- Rechtsdokumente (1)
- Reduction of Complexity (1)
- Regional Development (1)
- Regional Innovation Systems (1)
- Regional Policy (1)
- Repository <Informatik> (1)
- Schlagwortkatalog (1)
- Schlagwortnormdatei (1)
- Scientific image search (1)
- Selbstgesteuertes Lernen (1)
- Self-directed Learning (1)
- Semantics (1)
- Semantisches Datenmodell (1)
- Similarity Measures (1)
- Spezialbibliothekar (1)
- Standardised formulation (1)
- Standardisierung (1)
- Statistical Methods (1)
- Statistische Methoden (1)
- Structural Analysis (1)
- Systems Librarian, Data Librarian, Job advertisement analysis, Job profiles, New competencies (1)
- Territorial Intelligence (1)
- Text Similarity (1)
- Text annotation (1)
- Textbooks (1)
- Thesaurus (1)
- Title Matching (1)
- Transplantatabstoßung (1)
- Verbal Idioms (1)
- Versicherungsvertrag (1)
- Vertragsklausel (1)
- Wikimedia Commons (1)
- Wikipedia categories (1)
- Wissenschaftliche Bibliothek (1)
- Word Norms (1)
- Wort (1)
- XML (1)
- Zweiwortsatz (1)
- abstractness (1)
- concreteness (1)
- context vectors (1)
- cultural heritage (1)
- data warehouse (1)
- distributional semantics (1)
- e-Assessment (1)
- eLearning (1)
- education (1)
- fall prediction (1)
- fall prevention (1)
- fall risk (1)
- graft rejection (1)
- high-quality Learning Formats (1)
- image processing (1)
- information extraction (1)
- kidney transplant (1)
- library and information science (1)
- linked data (1)
- research data management (1)
- research information (1)
- sensor-based assessment (1)
- supervised machine learning (1)
- thesauri (1)
- wearable sensors (1)
- web crawling (1)
- word embedding space (1)
- Öffentliche Bibliothek (1)
- Überwachtes Lernen (1)
Institute
- Fakultät III - Medien, Information und Design (50) (remove)
To learn a subject, the acquisition of the associated technical language is important.
Despite this widely accepted importance of learning the technical language, hardly any studies are published that describe the characteristics of most technical languages that students are supposed to learn. This might largely be due to the absence of specialized text corpora to study such languages at lexical, syntactical and textual level. In the present paper we describe a corpus of German physics text that can be used to study the language used in physics. A large and a small variant are compiled. The small version of the corpus consists of 5.3 Million words and is available on request.
In this poster we present the ongoing development of an integrated free and open source toolchain for semantic annotation of digitised cultural heritage. The toolchain development involves the specification of a common data model that aims to increase interoperability across diverse datasets and to enable new collaborative research approaches.
Legal documents often have a complex layout with many different headings, headers and footers, side notes, etc. For the further processing, it is important to extract these individual components correctly from a legally binding document, for example a signed PDF. A common approach to do so is to classify each (text) region of a page using its geometric and textual features. This approach works well, when the training and test data have a similar structure and when the documents of a collection to be analyzed have a rather uniform layout. We show that the use of global page properties can improve the accuracy of text element classification: we first classify each page into one of three layout types. After that, we can train a classifier for each of the three page types and thereby improve the accuracy on a manually annotated collection of 70 legal documents consisting of 20,938 text elements. When we split by page type, we achieve an improvement from 0.95 to 0.98 for single-column pages with left marginalia and from 0.95 to 0.96 for double-column pages. We developed our own feature-based method for page layout detection, which we benchmark against a standard implementation of a CNN image classifier. The approach presented here is based on corpus of freely available German contracts and general terms and conditions.
Both the corpus and all manual annotations are made freely available. The method is language agnostic.
Data and Information Science: Book of Abstracts at BOBCATSSS 2022 Hybrid Conference, 23rd - 25th of May 2022, Debrecen.
This year marks the 30th anniversary of the BOBCATSSS. The BOBCATSSS is an international, annual symposium designed for librarians and information professionals in a rapidly changing environment. Over the past 30 years, the conference has included exciting topics, great venues, interested guests and engaging presenters.
This year we would like to introduce the topics of the many papers presented in the Book of Abstracts for the first time in presence at the University of Debrecen and hybrid. The Book of Abstracts provides an overview of all presentations given at BOBCATSSS. Presentations are listed in alphabetical order by title and include speeches, Pecha Kuchas, posters and workshops.
The theme of BOBCATSSS is Data and Information Science. Data and information are the basis for decisions and processes in business, politics and science. Particularly important in the current era of digital transformation. This is exactly where this year's subthemes come in. They deal with data science, openness as well as institutional roles.
In this paper we investigate how concreteness and abstractness are represented in word embedding spaces. We use data for English and German, and show that concreteness and abstractness can be determined independently and turn out to be completely opposite directions in the embedding space. Various methods can be used to determine the direction of concreteness, always resulting in roughly the same vector. Though concreteness is a central aspect of the meaning of words and can be detected clearly in embedding spaces, it seems not as easy to subtract or add concreteness to words to obtain other words or word senses like e.g. can be done with a semantic property like gender.
Generalisierte Rechtsdokumente, bei denen für die individuellen Ausprägungen eines Vertrages die Positionen im Text bekannt sind, können eingesetzt werden, um erstens das Genehmigungsverfahren von Neuverträgen automatisiert zu unterstützen und zweitens als Vertragsgenerator neue Rechtsdokumente vorausgewählt zur Verfügung zu stellen. In diesem Beitrag wird, mithilfe von bekannten juristischen Texten gezeigt, wie formelhafte Textabschnitte identifiziert und häufige individuelle Ausprägungen klassifiziert werden können, um als Musterabschnitte eingesetzt zu werden. Es werden Einsatzbereiche vorgestellt und vorhandenes Potential für Legal Tech-Anwendungen aufgezeigt.
Image captions in scientific papers usually are complementary to the images. Consequently, the captions contain many terms that do not refer to concepts visible in the image. We conjecture that it is possible to distinguish between these two types of terms in an image caption by analysing the text only. To examine this, we evaluated different features. The dataset we used to compute tf.idf values, word embeddings and concreteness values contains over 700 000 scientific papers with over 4,6 million images. The evaluation was done with a manually annotated subset of 329 images. Additionally, we trained a support vector machine to predict whether a term is a likely visible or not. We show that concreteness of terms is a very important feature to identify terms in captions and context that refer to concepts visible in images.
Wikidata and Wikibase as complementary research data management services for cultural heritage data
(2022)
The NFDI (German National Research Data Infrastructure) consortia are associations of various institutions within a specific research field, which work together to develop common data infrastructures, guidelines, best practices and tools that conform to the principles of FAIR data. Within the NFDI, a common question is: What is the potential of Wikidata to be used as an application for science and research? In this paper, we address this question by tracing current research usecases and applications for Wikidata, its relation to standalone Wikibase instances, and how the two can function as complementary services to meet a range of research needs. This paper builds on lessons learned through the development of open data projects and software services within the Open Science Lab at TIB, Hannover, in the context of NFDI4Culture – the consortium including participants across the broad spectrum of the digital libraries, archives, and museums field, and the digital humanities.
A new FOSS (free and open source software) toolchain and associated workflow is being developed in the context of NFDI4Culture, a German consortium of research- and cultural heritage institutions working towards a shared infrastructure for research data that meets the needs of 21st century data creators, maintainers and end users across the broad spectrum of the digital libraries and archives field, and the digital humanities. This short paper and demo present how the integrated toolchain connects: 1) OpenRefine - for data reconciliation and batch upload; 2) Wikibase - for linked open data (LOD) storage; and 3) Kompakkt - for rendering and annotating 3D models. The presentation is aimed at librarians, digital curators and data managers interested in learning how to manage research datasets containing 3D media, and how to make them available within an open data environment with 3D-rendering and collaborative annotation features.
In order to ensure validity in legal texts like contracts and case law, lawyers rely on standardised formulations that are written carefully but also represent a kind of code with a meaning and function known to all legal experts. Using directed (acyclic) graphs to represent standardized text fragments, we are able to capture variations concerning time specifications, slight rephrasings, names, places and also OCR errors. We show how we can find such text fragments by sentence clustering, pattern detection and clustering patterns. To test the proposed methods, we use two corpora of German contracts and court decisions, specially compiled for this purpose. However, the entire process for representing standardised text fragments is language-agnostic. We analyze and compare both corpora and give an quantitative and qualitative analysis of the text fragments found and present a number of examples from both corpora.