Corpus Encoding Standard (CES)
Introduction | References | Corpus Structure | Corpus Information |
Document Information | Header Information | Metadata Overview |
Last update: 30-Aug-2000
The Corpus Encoding Standard (CES) is an encoding standard for corpus-based work for use in the language engineering community. The CES is an application of SGML and conformant to the TEI guidelines.
Information about the CES was taken from the CES document version 1.4 (Nancy Ide, 1996).
A CES encoded corpus contains a single corpus header (cesHeader) and one or more documents (cesDOC). Each document contains a single text header and a text. Additionally, the cesCorpus element can be recursively nested, and sequences of this element can appear at any nested level, in order to identify sub-corpora.
A document, defined by cesDoc, contains a header (cesHeader) followed by either a <body> element or a <group> element.
The header (cesHeader) provides information about the electronic text that has been encoded, including not only its title, author etc. but also information about its encoding. The elements in the header are:
type * | The kind of document to which the header is attached. CORPUS when the header is attached to the corpus and TEXT when attached to a single text. | ||||||
creator * | The agency responsible for creating the header. | ||||||
version * | The version and revision of the CES header.elt used to encode this header. This number is found near the top of the header.elt itself | ||||||
status* | The revision status of the header. NEW when it is the first version of the header and UPDATE when the header has been updated. | ||||||
date.created * | The date on which the header content was created. | ||||||
date.updated * | The date on which the header content was last updated. | ||||||
fileDesc | Contains a full bibliographic description of the corpus itself or of a text within it. The elements contained are: titleStmt, editionStmt, extent, publicationStmt and SourceDesc. The elements titleStmt, publicationStmt and sourceDesc are required. | ||||||
titleStmt | Groups information concerning the title of the corpus or the individual text and its constituent texts. | ||||||
h.title | The title of the electronic file, including alternative titles or subtitles. | ||||||
respStmt | supplies information about any person or institution responsible for the intellectual content of a text, edition, or electronic transcription. | ||||||
respType | contains a phrase describing the nature of person's or institution's intellectual responsibility | ||||||
respName | the publisher of the corpus or text expressed as the proper name of a person, place or institution. | ||||||
editionStmt | Contains any additional information relating to a particular version of a text. | ||||||
version | |||||||
extent | provides the size of the electronic text as stored on some carrier medium. | ||||||
wordCount | contains the count of words in the text | ||||||
byteCount | contains the count of bytes in the file containing the text together with its markup. | ||||||
units * | Gives the unit in which the bytecount is measured (BYTES : bytes, KB : kilobytes, MB : megabytes, GB: gigabytes) | ||||||
extNote | A descriptive note supplying additional information of any kind relating to an extent information provided within a corpus or text header. | ||||||
publicationStmt | Groups information concerning the publication or distribution of the corpus and its constituent texts. | ||||||
distributor | Gives the name of the person or institution who distributes the text or corpus | ||||||
pubAddress | Contains the postal address of the distributor | ||||||
telephone | Gives the telephone number of the person or institution who distributes the text or corpus, in format conformant to ITU-T/CCITT Recommendation E.123 | ||||||
fax | Gives the fax number of the person or institution who distributes the text or corpus, in format conformant to ITU-T/CCITT Recommendation E.123 | ||||||
eAddress | Gives an electronic address of the person or institution who distributes the text or corpus. Note that more than one occurence of this tag can appear, so that multiple addresses (possibly of different types) can be included | ||||||
type * | Gives the type of the electronic address (email address, web site, ftp site, etc.) | ||||||
availability | Supplies information about the availability of a text, for example, any restrictions on its use or distribution, its copyright status, etc | ||||||
region * | specifies the territories within which rights in the electronic text apply | ||||||
status * | supplies a code identifying the current availability of the text | ||||||
idno | Supplies a number (e.g., ISBN) used to identify a bibliographics item | ||||||
pubDate | The publication date expressed in any format | ||||||
value * | Specifies standard value for this date in ISO 8601 (Representation of dates and times) format | ||||||
sourceDesc | Supplies a bibliographic description of the copy text(s) from which an electronic text was derived or generated | ||||||
1..N |
biblStruct | Contains a structured bibliographic citation, in which only bibliographic sub-elements appear and in a specified order | |||||
analytic | Contains bibliographic elements describing an item (e.g. an article or poem) published within a monograph, journal, or periodical and not as an independent publication | ||||||
monogr | Contains bibliographic elements descibing an item (e.g. a book or journal) published as an independent item (i.e. as a separate physical object). | ||||||
h.title | the title of a work | ||||||
h.author | in a bibliographic reference, contains the name of an author (personal or corporate) of a work; names should be given in a canonical form, with surnames preceding forenames | ||||||
respStmt | supplies information about any person or institution responsible for the intellectual content of a text, edition, or electronic transcription | ||||||
edition | Provides bibliographic details for an edition of some text | ||||||
imprint | groups information relating to the publication or distribution of a bibliographic item | ||||||
idno | Supplies a standard (e.g., ISBN) number used to identify a bibliographic item | ||||||
type * | A name of abbreviation (e.g., ISBN) identifying what type of identifying number is given. Unless provided explicitly the default value is: ISBN | ||||||
biblScope | Defines the scope of a bibliographic reference, for example as a list of page numbers, or a named subdivision or a larger work. | ||||||
type * | Identifies the type of information conveyed by the element (PP : page number or page range, VOL : volume number, ISSUE : issue number) | ||||||
biblNote | A descriptive not supplying additional information of any kind relating to a bibliographic item described wihtin a corpus or text header | ||||||
publisher | Proper name of a person, place or institution | ||||||
type * | categorises the name (PERSON : name of person, PLACE : name of a place, ORG : name of an organization article in a periodical) | ||||||
pubDate | A calendar date in any format | ||||||
value * | Specifies standard value for this date in ISO 8601 format | ||||||
pubPlace | Place of publication for a book, article, etc | ||||||
encodingDesc | Documents the relationship between an electronic text and the source or sources from which it was derived | ||||||
projectDesc | Describes in detail the purpose for which an electronic file was encoded | ||||||
samplingDecl | Contains a prose description of the rationale and the methods used in sampling text in the creation of the corpus | ||||||
editorialDecl | Provides details of editorial principles and practices applied during the encoding of a text | ||||||
conformance | Provides the CES level of conformance for the text or corpus | ||||||
level * | Gives the level of CES conformance (legal values are 1, 2 or 3) | ||||||
transduction | Describes the principles according to which the text has been transduced, either in transcribing it from audio tape to written form, or in converting from an electronic original | ||||||
correction | Specifies a set of correction practices applied in creating one or more components of the corpus | ||||||
quotation | Specifies editorial practice adopted with respect to qoutation marks in the original | ||||||
marks * | Indicates whether or not quotation marks are retained as tag content in the text (NONE : no quotation marks retained, SOME: some quotation marks retained, ALL : all quotation marks retained) | ||||||
form * | Specifies how quotation marks are indicated within the text (STD : use of quotation marks has been standardized; open and close quote marks are distinct, NONSTD : open and close quote marks are represented indiscriminately by the ????? , UNKNOWN : use of quotation marks unknown) | ||||||
hyphenation | Summarizes the way in which end-of-line hyphenation in a source text has been treated in an encoded version of it | ||||||
segmentation | Describes the principles according to which the text has been segmented, for example into sentences, tone-units, graphemic strata, etc | ||||||
normalization | Specifies a set of normalization practices applied in creating one or more components of the corpus | ||||||
method * | Indicates whether normalization made without notation or made by including editorial tags (TAGS : normalization indicated with tags, SILENT : normalization made silently) | ||||||
tagsDecl | Provides detailed information about the tagging applied to an SGML document | ||||||
1..N |
tagUsage | Supplies information about the usage of a specific element within the corpus or text with which this header is associated | |||||
gi * | The name (generic identifier) of the element indicated by the tag | ||||||
occurs * | Specifies the number of occurrences of this element within the text | ||||||
wsd * | Can be used on a <tagUsage> element to indicate that for every appearance of the described element in the text, the content defaults to the specified character set | ||||||
refsDecl | Specifies how canonical references are constructed for this text | ||||||
classDecl | Contains a series of <category> elements, defining the classification codes used for texts within the corpus | ||||||
1..N |
taxonomy | Defines a typology used to classify texts | |||||
1..N |
category | Contains an individual descriptive category or feature-value pair | |||||
catDesc | Describes a category within the text typology, in the form of a brief prose description | ||||||
profileDesc | Provides further information about various aspects of a text, specifically the language used, the situation and date of its production, the participants and their setting, and a descriptive classification for it | ||||||
creation | Contains information about the origination of a text | ||||||
langUsage | Groups information describing the languages, sublanguages, registers, dialects etc. represented within a text | ||||||
1..N |
language | Characterizes a language, sublanguage, register, dialect, etc., used within a single text | |||||
iso639 * | Gives the standard language code from ISO 639 in one of the following forms: a two-letter code from ISO 639, a three-letter code from ISO 639-2 or one of the above extended by a country code from ISO 3166 | ||||||
type * | Indicates the type of language, e.g., sublanguage, dialect, etc | ||||||
wsdUsage | Groups information describing the character set(s) used within a text | ||||||
1..N |
writingSystem | Characterizes a character set used within a single text | |||||
textClass | Groups information which describes the nature or topic of a text in terms of a standard classification scheme, thesaurus, etc | ||||||
catRef | Specifies one or more defined categories within some taxonomy or text typology | ||||||
target * | Identifies the text category or categoeries, by means of an IDREF pointing to one or more <category> elements defined in the corpus header | ||||||
scheme * | identifies the classification scheme | ||||||
h.keywords | Contains a list of keywords or phrases identifying the topic or nature of a text, each of which is tagged as a term. A standard list will be provided by EAGLES/PAROLE | ||||||
1..N |
keyTerm | Contains a technical term or phrase, particularly in a list of descriptive keywords | |||||
translations | Groups information about existing translations of the text | ||||||
1..N |
translation | Gives information about a translation of the text. The global lang attribute and the wsd attribute are required on this tag | |||||
trans.loc * | Provides information (path/file name, URL, etc.) about the location of the translation | ||||||
translator | Gives the name of the translator | ||||||
annotations | Groups information about existing annotation files associated with the text | ||||||
1..N |
annotation | Gives information about an annotation file associated with the text | |||||
type * | Indicates the type of annotation (SEGMENT : annotation file contains segmentation into sentences and words, GRAM : annotation file contains morpho-syntactic category information for the words in the text, ALIGN : annotation file contains alignment links to a parallel translation | ||||||
ann.loc * | Provides information (path/file name, URL, etc.) about the location of the annotation file | ||||||
trans.loc * | For annotation files containing alignment information, provides information (path/file name, URL, etc.) about the location of the file containing the aligned text | ||||||
revisionDesc | Summarizes the revision history for a file | ||||||
1..N |
change | Summarizes a particular change or correction made to a particular version of an electronic text which is shared between several researchers | |||||
changeDate | Gives the date of the change | ||||||
value * | Specifies standard value for this date in ISO 8601 format | ||||||
respName | Specifies the person responsible for the change | ||||||
h.item | Specifies the nature of the change(s). One or more occurrences of this element may appear within each <change> element |