Corpus Encoding Standard (CES)

Introduction References Corpus Structure Corpus Information
Document Information Header Information Metadata Overview

Last update: 30-Aug-2000

 

Introduction

The Corpus Encoding Standard (CES) is an encoding standard for corpus-based work for use in the language engineering community. The CES is an application of SGML and conformant to the TEI guidelines.

 

References

Information about the CES was taken from the CES document version 1.4 (Nancy Ide, 1996).

 

Corpus Structure

 

Corpus Information

A CES encoded corpus contains a single corpus header (cesHeader) and one or more documents (cesDOC). Each document contains a single text header and a text. Additionally, the cesCorpus element can be recursively nested, and sequences of this element can appear at any nested level, in order to identify sub-corpora.

 

Document Information

A document, defined by cesDoc, contains a header (cesHeader) followed by either a <body> element or a <group> element.

 

Header Information

The header (cesHeader) provides information about the electronic text that has been encoded, including not only its title, author etc. but also information about its encoding. The elements in the header are:

 

Metadata Overview

type * The kind of document to which the header is attached. CORPUS when the header is attached to the corpus and TEXT when attached to a single text.
creator * The agency responsible for creating the header.
version * The version and revision of the CES header.elt used to encode this header. This number is found near the top of the header.elt itself
status* The revision status of the header. NEW when it is the first version of the header and UPDATE when the header has been updated.
date.created * The date on which the header content was created.
date.updated * The date on which the header content was last updated.
fileDesc Contains a full bibliographic description of the corpus itself or of a text within it. The elements contained are: titleStmt, editionStmt, extent, publicationStmt and SourceDesc. The elements titleStmt, publicationStmt and sourceDesc are required.
titleStmt Groups information concerning the title of the corpus or the individual text and its constituent texts.
h.title The title of the electronic file, including alternative titles or subtitles.
respStmt supplies information about any person or institution responsible for the intellectual content of a text, edition, or electronic transcription.
respType contains a phrase describing the nature of person's or institution's intellectual responsibility
respName the publisher of the corpus or text expressed as the proper name of a person, place or institution.
editionStmt Contains any additional information relating to a particular version of a text.
version
extent provides the size of the electronic text as stored on some carrier medium.
wordCount contains the count of words in the text
byteCount contains the count of bytes in the file containing the text together with its markup.
units * Gives the unit in which the bytecount is measured (BYTES : bytes, KB : kilobytes, MB : megabytes, GB: gigabytes)
extNote A descriptive note supplying additional information of any kind relating to an extent information provided within a corpus or text header.
publicationStmt Groups information concerning the publication or distribution of the corpus and its constituent texts.
distributor Gives the name of the person or institution who distributes the text or corpus
pubAddress Contains the postal address of the distributor
telephone Gives the telephone number of the person or institution who distributes the text or corpus, in format conformant to ITU-T/CCITT Recommendation E.123
fax Gives the fax number of the person or institution who distributes the text or corpus, in format conformant to ITU-T/CCITT Recommendation E.123
eAddress Gives an electronic address of the person or institution who distributes the text or corpus. Note that more than one occurence of this tag can appear, so that multiple addresses (possibly of different types) can be included
type * Gives the type of the electronic address (email address, web site, ftp site, etc.)
availability Supplies information about the availability of a text, for example, any restrictions on its use or distribution, its copyright status, etc
region * specifies the territories within which rights in the electronic text apply
status * supplies a code identifying the current availability of the text
idno Supplies a number (e.g., ISBN) used to identify a bibliographics item
pubDate The publication date expressed in any format
value * Specifies standard value for this date in ISO 8601 (Representation of dates and times) format
sourceDesc Supplies a bibliographic description of the copy text(s) from which an electronic text was derived or generated

1..N

biblStruct Contains a structured bibliographic citation, in which only bibliographic sub-elements appear and in a specified order
analytic Contains bibliographic elements describing an item (e.g. an article or poem) published within a monograph, journal, or periodical and not as an independent publication
monogr Contains bibliographic elements descibing an item (e.g. a book or journal) published as an independent item (i.e. as a separate physical object).
h.title the title of a work
h.author in a bibliographic reference, contains the name of an author (personal or corporate) of a work; names should be given in a canonical form, with surnames preceding forenames
respStmt supplies information about any person or institution responsible for the intellectual content of a text, edition, or electronic transcription
edition Provides bibliographic details for an edition of some text
imprint groups information relating to the publication or distribution of a bibliographic item
idno Supplies a standard (e.g., ISBN) number used to identify a bibliographic item
type * A name of abbreviation (e.g., ISBN) identifying what type of identifying number is given. Unless provided explicitly the default value is: ISBN
biblScope Defines the scope of a bibliographic reference, for example as a list of page numbers, or a named subdivision or a larger work.
type * Identifies the type of information conveyed by the element (PP : page number or page range, VOL : volume number, ISSUE : issue number)
biblNote A descriptive not supplying additional information of any kind relating to a bibliographic item described wihtin a corpus or text header
publisher Proper name of a person, place or institution
type * categorises the name (PERSON : name of person, PLACE : name of a place, ORG : name of an organization article in a periodical)
pubDate A calendar date in any format
value * Specifies standard value for this date in ISO 8601 format
pubPlace Place of publication for a book, article, etc
encodingDesc Documents the relationship between an electronic text and the source or sources from which it was derived
projectDesc Describes in detail the purpose for which an electronic file was encoded
samplingDecl Contains a prose description of the rationale and the methods used in sampling text in the creation of the corpus
editorialDecl Provides details of editorial principles and practices applied during the encoding of a text
conformance Provides the CES level of conformance for the text or corpus
level * Gives the level of CES conformance (legal values are 1, 2 or 3)
transduction Describes the principles according to which the text has been transduced, either in transcribing it from audio tape to written form, or in converting from an electronic original
correction Specifies a set of correction practices applied in creating one or more components of the corpus
quotation Specifies editorial practice adopted with respect to qoutation marks in the original
marks * Indicates whether or not quotation marks are retained as tag content in the text (NONE : no quotation marks retained, SOME: some quotation marks retained, ALL : all quotation marks retained)
form * Specifies how quotation marks are indicated within the text (STD : use of quotation marks has been standardized; open and close quote marks are distinct, NONSTD : open and close quote marks are represented indiscriminately by the ?????  , UNKNOWN : use of quotation marks unknown)
hyphenation Summarizes the way in which end-of-line hyphenation in a source text has been treated in an encoded version of it
segmentation Describes the principles according to which the text has been segmented, for example into sentences, tone-units, graphemic strata, etc
normalization Specifies a set of normalization practices applied in creating one or more components of the corpus
method * Indicates whether normalization made without notation or made by including editorial tags (TAGS : normalization indicated with tags, SILENT : normalization made silently)
tagsDecl Provides detailed information about the tagging applied to an SGML document

1..N

tagUsage Supplies information about the usage of a specific element within the corpus or text with which this header is associated
gi * The name (generic identifier) of the element indicated by the tag
occurs * Specifies the number of occurrences of this element within the text
wsd * Can be used on a <tagUsage> element to indicate that for every appearance of the described element in the text, the content defaults to the specified character set
refsDecl Specifies how canonical references are constructed for this text
classDecl Contains a series of <category> elements, defining the classification codes used for texts within the corpus

1..N

taxonomy Defines a typology used to classify texts

1..N

category Contains an individual descriptive category or feature-value pair
catDesc Describes a category within the text typology, in the form of a brief prose description
profileDesc Provides further information about various aspects of a text, specifically the language used, the situation and date of its production, the participants and their setting, and a descriptive classification for it
creation Contains information about the origination of a text
langUsage Groups information describing the languages, sublanguages, registers, dialects etc. represented within a text

1..N

language Characterizes a language, sublanguage, register, dialect, etc., used within a single text
iso639 * Gives the standard language code from ISO 639 in one of the following forms: a two-letter code from ISO 639, a three-letter code from ISO 639-2 or one of the above extended by a country code from ISO 3166
type * Indicates the type of language, e.g., sublanguage, dialect, etc
wsdUsage Groups information describing the character set(s) used within a text

1..N

writingSystem Characterizes a character set used within a single text
textClass Groups information which describes the nature or topic of a text in terms of a standard classification scheme, thesaurus, etc
catRef Specifies one or more defined categories within some taxonomy or text typology
target * Identifies the text category or categoeries, by means of an IDREF pointing to one or more <category> elements defined in the corpus header
scheme * identifies the classification scheme
h.keywords Contains a list of keywords or phrases identifying the topic or nature of a text, each of which is tagged as a term. A standard list will be provided by EAGLES/PAROLE

1..N

keyTerm Contains a technical term or phrase, particularly in a list of descriptive keywords
translations Groups information about existing translations of the text

1..N

translation Gives information about a translation of the text. The global lang attribute and the wsd attribute are required on this tag
trans.loc * Provides information (path/file name, URL, etc.) about the location of the translation
translator Gives the name of the translator
annotations Groups information about existing annotation files associated with the text

1..N

annotation Gives information about an annotation file associated with the text
type * Indicates the type of annotation (SEGMENT : annotation file contains segmentation into sentences and words, GRAM : annotation file contains morpho-syntactic category information for the words in the text, ALIGN : annotation file contains alignment links to a parallel translation
ann.loc * Provides information (path/file name, URL, etc.) about the location of the annotation file
trans.loc * For annotation files containing alignment information, provides information (path/file name, URL, etc.) about the location of the file containing the aligned text
revisionDesc Summarizes the revision history for a file

1..N

change Summarizes a particular change or correction made to a particular version of an electronic text which is shared between several researchers
changeDate Gives the date of the change
value * Specifies standard value for this date in ISO 8601 format
respName Specifies the person responsible for the change
h.item Specifies the nature of the change(s). One or more occurrences of this element may appear within each <change> element

* attribute