Paul Trilsbeek

Publications

Displaying 1 - 10 of 10
  • Drude, S., Trilsbeek, P., Sloetjes, H., & Broeder, D. (2014). Best practices in the creation, archiving and dissemination of speech corpora at the Language Archive. In S. Ruhi, M. Haugh, T. Schmidt, & K. Wörner (Eds.), Best Practices for Spoken Corpora in Linguistic Research (pp. 183-207). Newcastle upon Tyne: Cambridge Scholars Publishing.
  • Drude, S., Broeder, D., & Trilsbeek, P. (2014). The Language Archive and its solutions for sustainable endangered languages corpora. Book 2.0, 4, 5-20. doi:10.1386/btwo.4.1-2.5_1.

    Abstract

    Since the late 1990s, the technical group at the Max-Planck-Institute for Psycholinguistics has worked on solutions for important challenges in building sustainable data archives, in particular, how to guarantee long-time-availability of digital research data for future research. The support for the well-known DOBES (Documentation of Endangered Languages) programme has greatly inspired and advanced this work, and lead to the ongoing development of a whole suite of tools for annotating, cataloguing and archiving multi-media data. At the core of the LAT (Language Archiving Technology) tools is the IMDI metadata schema, now being integrated into a larger network of digital resources in the European CLARIN project. The multi-media annotator ELAN (with its web-based cousin ANNEX) is now well known not only among documentary linguists. We aim at presenting an overview of the solutions, both achieved and in development, for creating and exploiting sustainable digital data, in particular in the area of documenting languages and cultures, and their interfaces with related other developments
  • Jung, D., Klessa, K., Duray, Z., Oszkó, B., Sipos, M., Szeverényi, S., Várnai, Z., Trilsbeek, P., & Váradi, T. (2014). Languagesindanger.eu - Including multimedia language resources to disseminate knowledge and create educational material on less-resourced languages. In N. Calzolari, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of LREC 2014: 9th International Conference on Language Resources and Evaluation (pp. 530-535).

    Abstract

    The present paper describes the development of the languagesindanger.eu interactive website as an example of including multimedia language resources to disseminate knowledge and create educational material on less-resourced languages. The website is a product of INNET (Innovative networking in infrastructure for endangered languages), European FP7 project. Its main functions can be summarized as related to the three following areas: (1) raising students' awareness of language endangerment and arouse their interest in linguistic diversity, language maintenance and language documentation; (2) informing both students and teachers about these topics and show ways how they can enlarge their knowledge further with a special emphasis on information about language archives; (3) helping teachers include these topics into their classes. The website has been localized into five language versions with the intention to be accessible to both scientific and non-scientific communities such as (primarily) secondary school teachers and students, beginning university students of linguistics, journalists, the interested public, and also members of speech communities who speak minority languages
  • Klatter-Folmer, J., Van Hout, R., Van den Heuvel, H., Fikkert, P., Baker, A., De Jong, J., Wijnen, F., Sanders, E., & Trilsbeek, P. (2014). Vulnerability in acquisition, language impairments in Dutch: Creating a VALID data archive. In N. Calzolari, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of LREC 2014: 9th International Conference on Language Resources and Evaluation (pp. 357-364).

    Abstract

    The VALID Data Archive is an open multimedia data archive (under construction) with data from speakers suffering from language impairments. We report on a pilot project in the CLARIN-NL framework in which five data resources were curated. For all data sets concerned, written informed consent from the participants or their caretakers has been obtained. All materials were anonymized. The audio files were converted into wav (linear PCM) files and the transcriptions into CHAT or ELAN format. Research data that consisted of test, SPSS and Excel files were documented and converted into CSV files. All data sets obtained appropriate CMDI metadata files. A new CMDI metadata profile for this type of data resources was established and care was taken that ISOcat metadata categories were used to optimize interoperability. After curation all data are deposited at the Max Planck Institute for Psycholinguistics Nijmegen where persistent identifiers are linked to all resources. The content of the transcriptions in CHAT and plain text format can be searched with the TROVA search engine
  • Trilsbeek, P., & Koenig, A. (2014). Increasing the future usage of endangered language archives. In D. Nathan, & P. Austin (Eds.), Language Documentation and Description vol 12 (pp. 151-163). London: SOAS. Retrieved from http://www.elpublishing.org/PID/142.
  • Van den Heuvel, H., Sanders, E., Klatter-Folmer, J., Van Hout, R., Fikkert, P., Baker, A., De Jong, J., Wijnen, F., & Trilsbeek, P. (2014). Data curation for a VALID archive of Dutch language impairment data. Dutch journal of applied linguistics, 3(2), 127-135. doi:10.1075/dujal.3.2.02heu.

    Abstract

    The VALID Data Archive is an open multimedia data archive in which data from children and adults with language and/or communication problems are brought together. A pilot project, funded by CLARIN-NL, was carried out in which five existing data sets were curated. This pilot enabled us to build up experience in conserving different kinds of pathological language data in a searchable and persistent manner. These data sets reflect current research in language pathology rather well, both in the range of designs and the variety in pathological problems, such as Specific Language Impairment, deafness, dyslexia, and ADHD. In this paper, we present the VALID initiative, explain the curation process and discuss the materials of the data sets.

    Files private

    Request files
  • Wittenburg, P., Trilsbeek, P., & Wittenburg, F. (2014). Corpus archiving and dissemination. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford Handbook of Corpus Phonology (pp. 133-149). Oxford: Oxford University Press.
  • Trilsbeek, P., Broeder, D., Van Valkenhoef, T., & Wittenburg, P. (2008). A grid of regional language archives. In C. Calzolari (Ed.), Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC 2008) (pp. 1474-1477). European Language Resources Association (ELRA).

    Abstract

    About two years ago, the Max Planck Institute for Psycholinguistics in Nijmegen, The Netherlands, started an initiative to install regional language archives in various places around the world, particularly in places where a large number of endangered languages exist and are being documented. These digital archives make use of the LAT archiving framework [1] that the MPI has developed
    over the past nine years. This framework consists of a number of web-based tools for depositing, organizing and utilizing linguistic resources in a digital archive. The regional archives are in principle autonomous archives, but they can decide to share metadata descriptions and language resources with the MPI archive in Nijmegen and become part of a grid of linked LAT archives. By doing so, they will also take advantage of the long-term preservation strategy of the MPI archive. This paper describes the reasoning
    behind this initiative and how in practice such an archive is set up.
  • Van Uytvanck, D., Dukers, A., Ringersma, J., & Trilsbeek, P. (2008). Language-sites: Accessing and presenting language resources via geographic information systems. In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, & D. Tapias (Eds.), Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC 2008). Paris: European Language Resources Association (ELRA).

    Abstract

    The emerging area of Geographic Information Systems (GIS) has proven to add an interesting dimension to many research projects. Within the language-sites initiative we have brought together a broad range of links to digital language corpora and resources. Via Google Earth's visually appealing 3D-interface users can spin the globe, zoom into an area they are interested in and access directly the relevant language resources. This paper focuses on several ways of relating the map and the online data (lexica, annotations, multimedia recordings, etc.). Furthermore, we discuss some of the implementation choices that have been made, including future challenges. In addition, we show how scholars (both linguists and anthropologists) are using GIS tools to fulfill their specific research needs by making use of practical examples. This illustrates how both scientists and the general public can benefit from geography-based access to digital language data
  • Trilsbeek, P., & Wittenburg, P. (2007). "Los acervos lingüísticos digitales y sus desafíos". In J. Haviland, & F. Farfán (Eds.), Bases de la documentacíon lingüística (pp. 359-385). Mexico: Instituto Nacional de Lenguas Indígenas.

    Abstract

    This chapter describes the challenges that modern digital language archives are faced with. One essential aspect of such an archive is to have a rich metadata catalog such that the archived resources can be easily discovered. The challenge of the archive is to obtain these rich metadata descriptions from the depositors without creating too much overhead for them. The rapid changes in storage technology, file formats and encoding standards make it difficult to build a long-lasting repository, therefore archives need to be set up in such a way that a straightforward and automated migration process to newer technology is possible whenever certain technology becomes obsolete. Other problems arise from the fact that there are many different groups of users of the archive, each of them with their own specific expectations and demands. Often conflicts exist between the requirements for different purposes of the archive, e.g. between long-term preservation of the data versus direct access to the resources via the web. The task of the archive is to come up with a technical solution that works well for most usage scenarios.

Share this page