Displaying 1 - 2 of 2
-
Galke, L., Mai, F., Schelten, A., Brunch, D., & Scherp, A. (2017). Using titles vs. full-text as source for automated semantic document annotation. In O. Corcho, K. Janowicz, G. Rizz, I. Tiddi, & D. Garijo (
Eds. ), Proceedings of the 9th International Conference on Knowledge Capture (K-CAP 2017). New York: ACM.Abstract
We conduct the first systematic comparison of automated semantic
annotation based on either the full-text or only on the title metadata
of documents. Apart from the prominent text classification baselines
kNN and SVM, we also compare recent techniques of Learning
to Rank and neural networks and revisit the traditional methods
logistic regression, Rocchio, and Naive Bayes. Across three of our
four datasets, the performance of the classifications using only titles
reaches over 90% of the quality compared to the performance when
using the full-text. -
Galke, L., Saleh, A., & Scherp, A. (2017). Word embeddings for practical information retrieval. In M. Eibl, & M. Gaedke (
Eds. ), INFORMATIK 2017 (pp. 2155-2167). Bonn: Gesellschaft für Informatik. doi:10.18420/in2017_215.Abstract
We assess the suitability of word embeddings for practical information retrieval scenarios. Thus, we assume that users issue ad-hoc short queries where we return the first twenty retrieved documents after applying a boolean matching operation between the query and the documents. We compare the performance of several techniques that leverage word embeddings in the retrieval models to compute the similarity between the query and the documents, namely word centroid similarity, paragraph vectors, Word Mover’s distance, as well as our novel inverse document frequency (IDF) re-weighted word centroid similarity. We evaluate the performance using the ranking metrics mean average precision, mean reciprocal rank, and normalized discounted cumulative gain. Additionally, we inspect the retrieval models’ sensitivity to document length by using either only the title or the full-text of the documents for the retrieval task. We conclude that word centroid similarity is the best competitor to state-of-the-art retrieval models. It can be further improved by re-weighting the word frequencies with IDF before aggregating the respective word vectors of the embedding. The proposed cosine similarity of IDF re-weighted word vectors is competitive to the TF-IDF baseline and even outperforms it in case of the news domain with a relative percentage of 15%.
Share this page