Spoken Dutch Corpus (Corpus Gesproken Nederlands - CGN)
Introduction | References | Corpus Structure | Corpus Information |
Document Information | Header Information | Metadata Overview |
Last update: 27-Feb-2001
The Spoken Dutch Corpus Project is aimed at the construction of a database of contemporary standard Dutch as spoken by adults in the Netherlands and Flanders. Upon completion, the corpus will contain approximately ten million words, two thirds of which originate from the Netherlands and one third from Flanders. The Spoken Dutch Corpus comprises a large number of samples of (recorded) spoken text. In all about 1,000 hours of speech.
The Spoken Dutch Corpus (CGN-project)
Corpus Header | contains general information about the project and/or information which is equal for all samples | ||||||
type * | describes to which type of document the header is part of (CORPUS) | ||||||
creator * | name of the (final) producer of the header. | ||||||
version * | version of the header that is adapted. | ||||||
update * | gives the date of when the header is last modified | ||||||
fileDesc | ? | ||||||
titleStmt | Information about the contents of the corpus | ||||||
title | ? | ||||||
respStmt | ? | ||||||
respType | describes the task for which somebody/institute was responsible | ||||||
respName | name of the responsible institute | ||||||
editionStmt | number of the release | ||||||
release * | ? | ||||||
version * | ? | ||||||
extent | size of the corpus | ||||||
wordCount | total amount of words in the corpus | ||||||
secCount | total amount of seconds of the corpus | ||||||
byteCount | total amount of bytes comprising the corpus | ||||||
extNote | additional information about the kind of counting(s). e.g. about the punctuation marks. | ||||||
tempoAv | gives the average speed of speaking in the corpus | ||||||
wph * | average amount of words per hour | ||||||
(id) * | id of the discerning component | ||||||
publicationStmt | information about the publication and distribution of the corpus | ||||||
distributor | name of the distributor | ||||||
pubAddress | address of the distributor | ||||||
telephone | telephone number of the distributor | ||||||
fax | fax number of the distributor | ||||||
eAddress | email address of the distributor | ||||||
availability | distributionregion of the (actual version of) the corpus | ||||||
region * | |||||||
status * | |||||||
pubDate | date of distribution | ||||||
copyright | name of the copyright holder | ||||||
encodingDesc | Documents the relationship between the texts and the sources | ||||||
projectDesc | Description of the CGN project | ||||||
samplingDecl | Description of the sampling method | ||||||
editorialDecl | Information about the state of affairs during digitisation and annotation of the text | ||||||
transduction | Describes the digitization and transcription process of recordings | ||||||
segmentation | Describes segmentation principles in the corpus, e.g. division by speakers, utterances, sentences, words etc. | ||||||
refDecl | Explains how (parts of) fragments are named and how they relate | ||||||
classDecl | Description of the classification of the samples in the corpus | ||||||
1..N |
category | ? | |||||
catDesc | ? | ||||||
profileDesc | Specific information about the corpus | ||||||
langUsage | describes the language (variety) which is included in the corpus | ||||||
wsdUsage | Contains one or more <writingSystem> sub-elements | ||||||
1..N |
writingSystem | indicates which ISO characterset is used. | |||||
revDesc | Documents the applied changes | ||||||
1..N |
change | ? | |||||
date | date on which the update was made | ||||||
respStmt | ? | ||||||
respType | description of the task | ||||||
resp | description of the change | ||||||
respName | name of the responsible institute | ||||||
Text Header | ? | ||||||
type * | describes to which type of document the header is part of (TEXT). | ||||||
creator * | name of the (final) producer of the header. | ||||||
version * | version of the header that is adapted. | ||||||
update * | gives the date of when the header is last modified | ||||||
fileDesc | contains a (bibliographic) description of the corpus | ||||||
titleStmt | information about the contents of the fragment | ||||||
title | ? | ||||||
respStmt | ? | ||||||
respType | Describes the task for which somebody/institute was responsible | ||||||
respName | Name of the responsible institute | ||||||
extent | size of the fragment | ||||||
wordCount | total amount of words in the fragment | ||||||
secCount | total amount of seconds of the fragment | ||||||
byteCount | total amount of bytes comprising the fragment | ||||||
extNote | additional information about the kind of counting(s). e.g. about the punctuation marks. | ||||||
tempoAv | average speed of speaking | ||||||
wph * | average amount of words per hour | ||||||
publicationStmt | information about the distribution of the fragment | ||||||
distributor | name of the distributor | ||||||
availability | spreading of the fragment | ||||||
corpus | ? | ||||||
cd | ? | ||||||
date | ? | ||||||
sourceDesc | bibliographic description of the source | ||||||
biblStr | bibliographic description | ||||||
author | first initial(s) and last name (of the writer) | ||||||
title | title | ||||||
pubName | name (of the publisher) | ||||||
pubPlace | place (of publishing) | ||||||
pubDate | year of distribution of the used print | ||||||
rec | ? | ||||||
date * | date of recording | ||||||
time * | time of recording | ||||||
source | indicates from where the material originates | ||||||
producent | producer of the recording | ||||||
encodingDesc | Documents the relation between texts and sources | ||||||
editorialDecl | Provides details of editorial principles and practices applied during the encoding of a text | ||||||
correction | |||||||
type * | ? | ||||||
status * | checked YES/NO | ||||||
profileDesc | Specific information about the corpus | ||||||
textClass | indicates which classifications are relevant for the text | ||||||
catRef | ? | ||||||
target * | one or more catDesc values for the fragment | ||||||
keywords | keyword chosen from a limited list | ||||||
term * | ? | ||||||
particDesc | ? | ||||||
person | ? | ||||||
id * | speaker identification code | ||||||
role * | speakers role | ||||||
age * | interpretation of the age of the speaker during the recording | ||||||
interaction | interaction between participants | ||||||
type * | ? | ||||||
active * | amount of active (identified) speakers | ||||||
passive * | amount of passive (unidentified) speakers | ||||||
relation | relation between the speakers | ||||||
active * | speaker identification of the active speaker in a directional relation or all speakers in a non-directional relation | ||||||
desc * | description of the relation | ||||||
mutual * | indicates whether the relation holds for all speakers or is directional | ||||||
settDesc | ? | ||||||
region | province where the recording is taken | ||||||
locName | place where the recording is taken | ||||||
locale | description of the space where the recording is taken | ||||||
activity | describes in short what the speakers are doing | ||||||
recCondition | ? | ||||||
recMedium | ? | ||||||
type * | medium of recording | ||||||
microphone | type of microphone used to make the recording | ||||||
micDistance | |||||||
person | speaker ID | ||||||
dist | ? | ||||||
cm | distance in centimeters | ||||||
noise | description of the background noise with the recording | ||||||
digitisation | ? | ||||||
opname | analog / digital ? | ||||||
verwerking | analog / digital ? | ||||||
status | analog / digital ? | ||||||
revDesc | Documents the changes that are applied | ||||||
1..N |
change | ? | |||||
date | Date on which the update was made | ||||||
respStmt | ? | ||||||
respType | description of the task | ||||||
resp | description of the change | ||||||
respName | name of the responsible institute | ||||||
Participant Header | ? | ||||||
type * | describes to which type of document the header is part of (PARTICIPANT). | ||||||
creator * | name of the (final) producer of the header. | ||||||
version * | version of the header that is adapted. | ||||||
update * | gives the date of when the header is last modified | ||||||
particDesc | Gives general information about the speaker | ||||||
person | ? | ||||||
id * | speaker identification code | ||||||
sex * | speaker's gender | ||||||
birth | ? | ||||||
year * | speaker's year of birth | ||||||
place * | speaker's place of birth | ||||||
reg * | region where the speaker is born | ||||||
language | ? | ||||||
firstLang | language variant in which the speaker is raised | ||||||
lang * | ? | ||||||
dialect * | ? | ||||||
homeLang | language variant the speaker uses at home | ||||||
lang * | ? | ||||||
dialect * | ? | ||||||
workLang | language variant the speaker uses at work | ||||||
lang * | ? | ||||||
dialect * | ? | ||||||
residence | ? | ||||||
place * | speaker's place of residence | ||||||
reg * | the region where the speaker is living | ||||||
size * | indication of the size of the population where the speaker's living | ||||||
education | ? | ||||||
place * | place where the speaker followed his/her education | ||||||
reg * | region where the speaker followed his/her education | ||||||
opleiding * | highest education the speaker finished | ||||||
level * | level of education | ||||||
occupation | ? | ||||||
job * | speaker's job | ||||||
level * | job level indication | ||||||
notes | Other remarks concerning the speaker, e.g. participation in other projects, other places of residence, etc. |