Linguistic Data Consortium Catalog (LDC)

Introduction References Corpus Structure Corpus Information
Document Information Header Information Metadata Overview

Last update: 13-Nov-2000

 

Introduction

The Linguistic Data Consortium supports language-related education, research and technology development by creating and sharing linguistic resources: data, tools and standards. The LDC's Catalog contains 168 corpora of language data.

 

References

The Linguistic Data Consortium Catalog

LDC List of Catalog Fields (not used for this overview)

 

Catalog Structure

The catalog is an access structure on top of corpora where the metadata is about the corpora in the catalog. Corpora are first divided into major categories according to the type of data they contain, and then are further broken down into minor categories based on the source of the data.
(See http://morph.ldc.upenn.edu/Catalog/by_type.html)

 

Meta Date Overview

Catalog number Contains a unique LDC catalog number
Name Contains the name of the corpus
ISBN Contains the ISBN
Data Sources Contains the corpus data source (broadcast, conversation, microphone etc.)
Research Project Contains the projects in which the corpus was used
Recommended Application Contains the recommended applications for which the corpus is useful
Language Contains the language used in the corpus
Membership Year Contains the year in which the corpus was released
Corpus Type Defines the type of the corpus (Lexicon, Speech or Text)