Spoken Dutch Corpus (Corpus Gesproken Nederlands

Spoken Dutch Corpus (Corpus Gesproken Nederlands - CGN)

Introduction	References	Corpus Structure	Corpus Information
Document Information	Header Information	Metadata Overview

Last update: 27-Feb-2001

Introduction

The Spoken Dutch Corpus Project is aimed at the construction of a database of contemporary standard Dutch as spoken by adults in the Netherlands and Flanders. Upon completion, the corpus will contain approximately ten million words, two thirds of which originate from the Netherlands and one third from Flanders. The Spoken Dutch Corpus comprises a large number of samples of (recorded) spoken text. In all about 1,000 hours of speech.

References

The Spoken Dutch Corpus (CGN-project)

Metadata Overview

Corpus Header	contains general information about the project and/or information which is equal for all samples
	type *	describes to which type of document the header is part of (CORPUS)
	creator *	name of the (final) producer of the header.
	version *	version of the header that is adapted.
	update *	gives the date of when the header is last modified
	fileDesc	?
		titleStmt	Information about the contents of the corpus
			title	?
			respStmt	?
				respType		describes the task for which somebody/institute was responsible
				respName		name of the responsible institute
		editionStmt	number of the release
			release *	?
			version *	?
		extent	size of the corpus
			wordCount	total amount of words in the corpus
			secCount	total amount of seconds of the corpus
			byteCount	total amount of bytes comprising the corpus
			extNote	additional information about the kind of counting(s). e.g. about the punctuation marks.
		tempoAv	gives the average speed of speaking in the corpus
			wph *	average amount of words per hour
			(id) *	id of the discerning component
		publicationStmt	information about the publication and distribution of the corpus
			distributor	name of the distributor
			pubAddress	address of the distributor
			telephone	telephone number of the distributor
			fax	fax number of the distributor
			eAddress	email address of the distributor
			availability	distributionregion of the (actual version of) the corpus
				region *
				status *
			pubDate	date of distribution
			copyright	name of the copyright holder
	encodingDesc	Documents the relationship between the texts and the sources
		projectDesc	Description of the CGN project
		samplingDecl	Description of the sampling method
		editorialDecl	Information about the state of affairs during digitisation and annotation of the text
			transduction	Describes the digitization and transcription process of recordings
			segmentation	Describes segmentation principles in the corpus, e.g. division by speakers, utterances, sentences, words etc.
		refDecl	Explains how (parts of) fragments are named and how they relate
		classDecl	Description of the classification of the samples in the corpus
		1..N	category	?
				catDesc	?
	profileDesc	Specific information about the corpus
		langUsage	describes the language (variety) which is included in the corpus
		wsdUsage	Contains one or more <writingSystem> sub-elements
		1..N	writingSystem	indicates which ISO characterset is used.
	revDesc	Documents the applied changes
	1..N	change	?
			date	date on which the update was made
			respStmt	?
				respType		description of the task
				resp		description of the change
				respName		name of the responsible institute
Text Header	?
	type *	describes to which type of document the header is part of (TEXT).
	creator *	name of the (final) producer of the header.
	version *	version of the header that is adapted.
	update *	gives the date of when the header is last modified
	fileDesc	contains a (bibliographic) description of the corpus
		titleStmt	information about the contents of the fragment
			title	?
			respStmt	?
				respType		Describes the task for which somebody/institute was responsible
				respName		Name of the responsible institute
		extent	size of the fragment
			wordCount	total amount of words in the fragment
			secCount	total amount of seconds of the fragment
			byteCount	total amount of bytes comprising the fragment
			extNote	additional information about the kind of counting(s). e.g. about the punctuation marks.
		tempoAv	average speed of speaking
			wph *	average amount of words per hour
		publicationStmt	information about the distribution of the fragment
			distributor	name of the distributor
			availability	spreading of the fragment
				*corpus*	?
				cd	?
				*date*	?
		sourceDesc	bibliographic description of the source
			biblStr	bibliographic description
				author	first initial(s) and last name (of the writer)
				title	title
				pubName	name (of the publisher)
				pubPlace	place (of publishing)
				pubDate	year of distribution of the used print
			rec	?
				date *	date of recording
				time *	time of recording
			source	indicates from where the material originates
			producent	producer of the recording
	encodingDesc	Documents the relation between texts and sources
		editorialDecl	Provides details of editorial principles and practices applied during the encoding of a text
			correction
				type *	?
				status *	checked YES/NO
	profileDesc	Specific information about the corpus
		textClass	indicates which classifications are relevant for the text
			catRef	?
				target *	one or more catDesc values for the fragment
				keywords	keyword chosen from a limited list
				term *	?
		particDesc	?
			person	?
				id *	speaker identification code
				role *	speakers role
				age *	interpretation of the age of the speaker during the recording
			interaction	interaction between participants
				type *	?
				active *	amount of active (identified) speakers
				passive *	amount of passive (unidentified) speakers
			relation	relation between the speakers
				active *	speaker identification of the active speaker in a directional relation or all speakers in a non-directional relation
				desc *	description of the relation
				mutual *	indicates whether the relation holds for all speakers or is directional
		settDesc	?
			region	province where the recording is taken
			locName	place where the recording is taken
			locale	description of the space where the recording is taken
			activity	describes in short what the speakers are doing
		recCondition	?
			recMedium	?
				type *	medium of recording
				microphone	type of microphone used to make the recording
				micDistance
					person		speaker ID
					dist		?
					cm		distance in centimeters
				noise	description of the background noise with the recording
			digitisation	?
				opname	analog / digital ?
				verwerking	analog / digital ?
				status	analog / digital ?
	revDesc	Documents the changes that are applied
	1..N	change	?
			date	Date on which the update was made
			respStmt	?
				respType		description of the task
				resp		description of the change
				respName		name of the responsible institute
Participant Header	?
	type *	describes to which type of document the header is part of (PARTICIPANT).
	creator *	name of the (final) producer of the header.
	version *	version of the header that is adapted.
	update *	gives the date of when the header is last modified
	particDesc	Gives general information about the speaker
		person	?
			id *	speaker identification code
			sex *	speaker's gender
		birth	?
			year *	speaker's year of birth
			place *	speaker's place of birth
			reg *	region where the speaker is born
		language	?
			firstLang	language variant in which the speaker is raised
				lang *	?
				dialect *	?
			homeLang	language variant the speaker uses at home
				lang *	?
				dialect *	?
			workLang	language variant the speaker uses at work
				lang *	?
				dialect *	?
		residence	?
			place *	speaker's place of residence
			reg *	the region where the speaker is living
			size *	indication of the size of the population where the speaker's living
		education	?
			place *	place where the speaker followed his/her education
			reg *	region where the speaker followed his/her education
			opleiding *	highest education the speaker finished
			level *	level of education
		occupation	?
			job *	speaker's job
			level *	job level indication
		notes	Other remarks concerning the speaker, e.g. participation in other projects, other places of residence, etc.