Spoken Dutch Corpus (Corpus Gesproken Nederlands  - CGN)

Introduction References Corpus Structure Corpus Information
Document Information Header Information Metadata Overview

Last update: 27-Feb-2001

 

Introduction

The Spoken Dutch Corpus Project is aimed at the construction of a database of contemporary standard Dutch as spoken by adults in the Netherlands and Flanders. Upon completion, the corpus will contain approximately ten million words, two thirds of which originate from the Netherlands and one third from Flanders. The Spoken Dutch Corpus comprises a large number of samples of (recorded) spoken text. In all about 1,000 hours of speech.

 

References

The Spoken Dutch Corpus (CGN-project)

 

Metadata Overview

Corpus Header contains general information about the project and/or information which is equal for all samples
type * describes to which type of document the header is part of (CORPUS)
creator * name of the (final) producer of the header.
version * version of the header that is adapted.
update * gives the date of when the header is last modified
fileDesc ?
titleStmt Information about the contents of the corpus
title ?
respStmt ?
respType describes the task for which somebody/institute was responsible
respName name of the responsible institute
editionStmt number of the release
release * ?
version * ?
extent size of the corpus
wordCount total amount of words in the corpus
secCount total amount of seconds of the corpus
byteCount total amount of bytes comprising the corpus
extNote additional information about the kind of counting(s). e.g. about the punctuation marks.
tempoAv gives the average speed of speaking in the corpus
wph * average amount of words per hour
(id) * id of the discerning component
publicationStmt information about the publication and distribution of the corpus
distributor name of the distributor
pubAddress address of the distributor
telephone telephone number of the distributor
fax fax number of the distributor
eAddress email address of the distributor
availability distributionregion of the (actual version of) the corpus
region *
status *
pubDate date of distribution
copyright name of the copyright holder
encodingDesc Documents the relationship between the texts and the sources
projectDesc Description of the CGN project
samplingDecl Description of the sampling method
editorialDecl Information about the state of affairs during digitisation and annotation of the text
transduction Describes the digitization and transcription process of recordings
segmentation Describes segmentation principles in the corpus, e.g. division by speakers, utterances, sentences, words etc.
refDecl Explains how (parts of) fragments are named and how they relate
classDecl Description of the classification of the samples in the corpus

1..N

category ?
catDesc ?
profileDesc Specific information about the corpus
langUsage describes the language (variety) which is included in the corpus
wsdUsage Contains one or more <writingSystem> sub-elements

1..N

writingSystem indicates which ISO characterset is used. 
revDesc Documents the applied changes

1..N

change ?
date date on which the update was made
respStmt ?
respType description of the task
resp description of the change
respName name of the responsible institute
Text Header ?
type * describes to which type of document the header is part of (TEXT).
creator * name of the (final) producer of the header.
version * version of the header that is adapted.
update * gives the date of when the header is last modified
fileDesc contains a (bibliographic) description of the corpus
titleStmt information about the contents of the fragment
title ?
respStmt ?
respType Describes the task for which somebody/institute was responsible
respName Name of the responsible institute
extent size of the fragment
wordCount total amount of words in the fragment
secCount total amount of seconds of the fragment
byteCount total amount of bytes comprising the fragment
extNote additional information about the kind of counting(s). e.g. about the punctuation marks.
tempoAv average speed of speaking
wph * average amount of words per hour
publicationStmt information about the distribution of the fragment
distributor name of the distributor
availability spreading of the fragment
corpus ?
cd ?
date ?
sourceDesc bibliographic description of the source
biblStr bibliographic description
author first initial(s) and last name (of the writer)
title title
pubName name (of the publisher)
pubPlace place (of publishing)
pubDate year of distribution of the used print
rec ?
date * date of recording
time * time of recording
source indicates from where the material originates
producent producer of the recording
encodingDesc Documents the relation between texts and sources
editorialDecl Provides details of editorial principles and practices applied during the encoding of a text
correction  
type * ?
status * checked YES/NO
profileDesc Specific information about the corpus
textClass indicates which classifications are relevant for the text
catRef ?
target * one or more catDesc values for the fragment
keywords keyword chosen from a limited list
term * ?
particDesc ?
person ?
id * speaker identification code
role * speakers role
age * interpretation of the age of the speaker during the recording
interaction interaction between participants
type * ?
active * amount of active (identified) speakers
passive * amount of passive (unidentified) speakers
relation relation between the speakers
active * speaker identification of the active speaker in a directional relation or all speakers in a non-directional relation
desc * description of the relation
mutual * indicates whether the relation holds for all speakers or is directional
settDesc ?
region province where the recording is taken
locName place where the recording is taken
locale description of the space where the recording is taken
activity describes in short what the speakers are doing
recCondition ?
recMedium ?
type * medium of recording
microphone type of microphone used to make the recording
micDistance  
person speaker ID
dist ?
cm distance in centimeters
noise description of the background noise with the recording
digitisation ?
opname analog / digital ?
verwerking analog / digital ?
status analog / digital ?
revDesc Documents the changes that are applied

1..N

change ?
date Date on which the update was made
respStmt ?
respType description of the task
resp description of the change
respName name of the responsible institute
Participant Header ?
type * describes to which type of document the header is part of (PARTICIPANT).
creator * name of the (final) producer of the header.
version * version of the header that is adapted.
update * gives the date of when the header is last modified
particDesc Gives general information about the speaker
person ?
id * speaker identification code
sex * speaker's gender
birth ?
year * speaker's year of birth
place * speaker's place of birth
reg * region where the speaker is born
language ?
firstLang language variant in which the speaker is raised
lang * ?
dialect * ?
homeLang language variant the speaker uses at home
lang * ?
dialect * ?
workLang language variant the speaker uses at work
lang * ?
dialect * ?
residence ?
place * speaker's place of residence
reg * the region where the speaker is living
size * indication of the size of the population where the speaker's living
education ?
place * place where the speaker followed his/her education
reg * region where the speaker followed his/her education
opleiding * highest education the speaker finished
level * level of education
occupation ?
job * speaker's job
level * job level indication
notes Other remarks concerning the speaker, e.g. participation in other projects, other places of residence, etc.