Linguistic Data Consortium Catalog

Linguistic Data Consortium Catalog (LDC)

Introduction	References	Corpus Structure	Corpus Information
Document Information	Header Information	Metadata Overview

Last update: 13-Nov-2000

Introduction

The Linguistic Data Consortium supports language-related education, research and technology development by creating and sharing linguistic resources: data, tools and standards. The LDC's Catalog contains 168 corpora of language data.

References

The Linguistic Data Consortium Catalog

LDC List of Catalog Fields (not used for this overview)

Catalog Structure

The catalog is an access structure on top of corpora where the metadata is about the corpora in the catalog. Corpora are first divided into major categories according to the type of data they contain, and then are further broken down into minor categories based on the source of the data.
(See http://morph.ldc.upenn.edu/Catalog/by_type.html)

Meta Date Overview

Catalog number	Contains a unique LDC catalog number
Name	Contains the name of the corpus
ISBN	Contains the ISBN
Data Sources	Contains the corpus data source (broadcast, conversation, microphone etc.)
Research Project	Contains the projects in which the corpus was used
Recommended Application	Contains the recommended applications for which the corpus is useful
Language	Contains the language used in the corpus
Membership Year	Contains the year in which the corpus was released
Corpus Type	Defines the type of the corpus (Lexicon, Speech or Text)