Displaying 1 - 7 of 7
-
Seidlmayer, E., Voß, J., Melnychuk, T., Galke, L., Tochtermann, K., Schultz, C., & Förstner, K. U. (2020). ORCID for Wikidata. Data enrichment for scientometric applications. In L.-A. Kaffee, O. Tifrea-Marciuska, E. Simperl, & D. Vrandečić (
Eds. ), Proceedings of the 1st Wikidata Workshop (Wikidata 2020). Aachen, Germany: CEUR Workshop Proceedings.Abstract
Due to its numerous bibliometric entries of scholarly articles and connected information Wikidata can serve as an open and rich
source for deep scientometrical analyses. However, there are currently certain limitations: While 31.5% of all Wikidata entries represent scientific articles, only 8.9% are entries describing a person and the number
of entries researcher is accordingly even lower. Another issue is the frequent absence of established relations between the scholarly article item and the author item although the author is already listed in Wikidata.
To fill this gap and to improve the content of Wikidata in general, we established a workflow for matching authors and scholarly publications by integrating data from the ORCID (Open Researcher and Contributor ID) database. By this approach we were able to extend Wikidata by more than 12k author-publication relations and the method can be
transferred to other enrichments based on ORCID data. This is extension is beneficial for Wikidata users performing bibliometrical analyses or using such metadata for other purposes. -
Galke, L., Vagliano, I., & Scherp, A. (2019). Can graph neural networks go „online“? An analysis of pretraining and inference. In Proceedings of the Representation Learning on Graphs and Manifolds: ICLR2019 Workshop.
Abstract
Large-scale graph data in real-world applications is often not static but dynamic,
i. e., new nodes and edges appear over time. Current graph convolution approaches
are promising, especially, when all the graph’s nodes and edges are available dur-
ing training. When unseen nodes and edges are inserted after training, it is not
yet evaluated whether up-training or re-training from scratch is preferable. We
construct an experimental setup, in which we insert previously unseen nodes and
edges after training and conduct a limited amount of inference epochs. In this
setup, we compare adapting pretrained graph neural networks against retraining
from scratch. Our results show that pretrained models yield high accuracy scores
on the unseen nodes and that pretraining is preferable over retraining from scratch.
Our experiments represent a first step to evaluate and develop truly online variants
of graph neural networks. -
Galke, L., Melnychuk, T., Seidlmayer, E., Trog, S., Foerstner, K., Schultz, C., & Tochtermann, K. (2019). Inductive learning of concept representations from library-scale bibliographic corpora. In K. David, K. Geihs, M. Lange, & G. Stumme (
Eds. ), Informatik 2019: 50 Jahre Gesellschaft für Informatik - Informatik für Gesellschaft (pp. 219-232). Bonn: Gesellschaft für Informatik e.V. doi:10.18420/inf2019_26. -
Mai, F., Galke, L., & Scherp, A. (2019). CBOW is not all you need: Combining CBOW with the compositional matrix space model. In Proceedings of the Seventh International Conference on Learning Representations (ICLR 2019). OpenReview.net.
Abstract
Continuous Bag of Words (CBOW) is a powerful text embedding method. Due to its strong capabilities to encode word content, CBOW embeddings perform well on a wide range of downstream tasks while being efficient to compute. However, CBOW is not capable of capturing the word order. The reason is that the computation of CBOW's word embeddings is commutative, i.e., embeddings of XYZ and ZYX are the same. In order to address this shortcoming, we propose a
learning algorithm for the Continuous Matrix Space Model, which we call Continual Multiplication of Words (CMOW). Our algorithm is an adaptation of word2vec, so that it can be trained on large quantities of unlabeled text. We empirically show that CMOW better captures linguistic properties, but it is inferior to CBOW in memorizing word content. Motivated by these findings, we propose a hybrid model that combines the strengths of CBOW and CMOW. Our results show that the hybrid CBOW-CMOW-model retains CBOW's strong ability to memorize word content while at the same time substantially improving its ability to encode other linguistic information by 8%. As a result, the hybrid also performs better on 8 out of 11 supervised downstream tasks with an average improvement of 1.2%. -
Seidlmayer, E., Galke, L., Melnychuk, T., Schultz, C., Tochtermann, K., & Förstner, K. U. (2019). Take it personally - A Python library for data enrichment for infometrical applications. In M. Alam, R. Usbeck, T. Pellegrini, H. Sack, & Y. Sure-Vetter (
Eds. ), Proceedings of the Posters and Demo Track of the 15th International Conference on Semantic Systems co-located with 15th International Conference on Semantic Systems (SEMANTiCS 2019).Abstract
Like every other social sphere, science is influenced by individual characteristics of researchers. However, for investigations on scientific networks, only little data about the social background of researchers, e.g. social origin, gender, affiliation etc., is available.
This paper introduces ”Take it personally - TIP”, a conceptual model and library currently under development, which aims to support the
semantic enrichment of publication databases with semantically related background information which resides elsewhere in the (semantic) web, such as Wikidata.
The supplementary information enriches the original information in the publication databases and thus facilitates the creation of complex scientific knowledge graphs. Such enrichment helps to improve the scientometric analysis of scientific publications as they can also take social backgrounds of researchers into account and to understand social structure in research communities. -
Galke, L., Mai, F., Schelten, A., Brunch, D., & Scherp, A. (2017). Using titles vs. full-text as source for automated semantic document annotation. In O. Corcho, K. Janowicz, G. Rizz, I. Tiddi, & D. Garijo (
Eds. ), Proceedings of the 9th International Conference on Knowledge Capture (K-CAP 2017). New York: ACM.Abstract
We conduct the first systematic comparison of automated semantic
annotation based on either the full-text or only on the title metadata
of documents. Apart from the prominent text classification baselines
kNN and SVM, we also compare recent techniques of Learning
to Rank and neural networks and revisit the traditional methods
logistic regression, Rocchio, and Naive Bayes. Across three of our
four datasets, the performance of the classifications using only titles
reaches over 90% of the quality compared to the performance when
using the full-text. -
Galke, L., Saleh, A., & Scherp, A. (2017). Word embeddings for practical information retrieval. In M. Eibl, & M. Gaedke (
Eds. ), INFORMATIK 2017 (pp. 2155-2167). Bonn: Gesellschaft für Informatik. doi:10.18420/in2017_215.Abstract
We assess the suitability of word embeddings for practical information retrieval scenarios. Thus, we assume that users issue ad-hoc short queries where we return the first twenty retrieved documents after applying a boolean matching operation between the query and the documents. We compare the performance of several techniques that leverage word embeddings in the retrieval models to compute the similarity between the query and the documents, namely word centroid similarity, paragraph vectors, Word Mover’s distance, as well as our novel inverse document frequency (IDF) re-weighted word centroid similarity. We evaluate the performance using the ranking metrics mean average precision, mean reciprocal rank, and normalized discounted cumulative gain. Additionally, we inspect the retrieval models’ sensitivity to document length by using either only the title or the full-text of the documents for the retrieval task. We conclude that word centroid similarity is the best competitor to state-of-the-art retrieval models. It can be further improved by re-weighting the word frequencies with IDF before aggregating the respective word vectors of the embedding. The proposed cosine similarity of IDF re-weighted word vectors is competitive to the TF-IDF baseline and even outperforms it in case of the news domain with a relative percentage of 15%.
Share this page