`
DATABASES & CORPORA

 

#
Name
Language
Type
Description
1
LDC Corpora
1.1
Boston University Radio Speech Corpus
English
Microphone speech
Applications: prosody, speech recognition, and speech synthesis.
It consists of professionally read radio news data, including speech and accompanying annotations, suitable for speech and language research.
1.2
CALLHOME American English Speech
English
Telephone speech
Application: speech recognition.
It consists of 120 unscripted telephone conversations between native speakers of English.
1.3
CALLHOME American English Transcripts
English
Conversation text
Application: speech recognition.
It includes transcripts and documentation files for 120 unscripted telephone conversations between native speakers of English.
1.4
CELEX2
English
Dutch
German
Varied lexicon
Applications: parsing, pronunciation modeling, speech synthesis.
It contains detailed information on orthography, phonology, morphology, syntax, and word frequency in three languages.
1.5
DSO Corpus of Sense-tagged English
English
Varied text
Application: natural language processing.
It contains sense-tagged word occurrences of 121 nouns and 70 verbs which are among the most frequently occurring and ambiguous words in English.
1.6
TIMIT Acoustic-Phonetic Continuous Speech Corpus
English
Microphone speech
Application: speech recognition.
It contains broadband recordings of 630 speakers of 8 major dialects of American English, each reading 10 phonetically rich sentences.
1.7
Treebank-2
English
Varied text
Applications: natural language processing, parsing, tagging.
It contains over 1.6 million words of hand-parsed material from the Dow Jones News Service, plus an additional 1 million words tagged for part-of-speech.
1.8
Treebank-3
English
Varied text
Applications: natural language processing, parsing, tagging.
It contains Switchboard tagged, dysfluency-annotated, and parsed text, and Brown pared text in addition to the Treebank-2 material.
1.9
Chinese Treebank Final Release
Chinese
Newswire text
Applications: natural language processing, parsing, tagging.
It contains the Chinese Penn Treebank Project corpus Final Release and includes 100,000 Chinese words with syntactic bracketing.
1.10
CALLHOME Japanese Speech
Japanese
Telephone speech
Application: speech recognition.
It consists of 120 unscripted telephone conversations between native speakers of Japanese.
1.11
CALLHOME Japanese Transcripts
Japanese
Conversation text
Application: speech recognition.
It includes transcripts and documentation files. The transcripts cover a contiguous 5 or 10 minutes segment taken from 120 unscripted telephone conversations between native speakers of Japanese.
1.12
Japanese Business News Text
Japanese
Newswire text
Applications: information retrieval, language modeling.
It composes of business and financial news from Nihon Keizai Shimbun (Approximately 30 million words) and Dow Jones Telerate, a financial newswire produces by Kyodo News Service.
1.13
Korean Newswire
Korean
Newswire text
Applications: information retrieval, language modeling.
It is a collection of Korean Press Agency news argicles from June 2, 1994 to March 20, 2000.
1.14
1998 HUB5 English Evaluation
English
Telephone speech
Application: speech recognition
It contains forty sphere files encoded in two channel interleaved mulaw for a total of 605 Mbytes of sphere data.
1.15
Switchboard-2 Phrase III Audio
English
Telephone speech
Application: speech identification
The data focuses on native speakers of English in the American South. It contains 5-6 minute conversations of 2728 telephone calls from 640 participants (292 Male, 348 Female) under varied environmental conditions.
2
British National Corpus (BNC) World Edition
English
Varied speech and text
It contains 100 millions words of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English, both spoken and written.
3
Child Language Data Exchange System (CHILDES) 2001
Various languages
Transcripts of naturalistic speech of children
It provides tools for studying conversational interactions. These tools include a database of transcripts, programs for computer analysis of transcripts, methods for linguistic coding, and systems for linking transcripts to digitized audio and video.
4
Corpus of Spoken Professional American English tagged version
English
Conversation text
The corpus, which has been constructed from a selection of existing transcripts of interactions in professional settings, contains two main sub-corpora of a million words each.
5
ICAME (International Computer Archive of Modern and Medieval English)
English
Written, spoken, historical, and parsed corpora
It provides information on English language material available for computer processing and on linguistic research completed or in progress on the material, to compile an archive of English text corpora in machine-readable form, and to make material available to research institutions.
6
ICE-GB Sample(The British Component of the International Corpus of English)
English
Varied text
ICE-GB consists of a million words of spoken and written English and adheres to the common corpus design. 200 written and 300 spoken texts make up the million words. Every text is grammatically annotated, permitting complex and detailed searches across the whole corpus.
7
IViE (Intonational Variation in English).
English
Varied speech
It provides cross-varietal and stylistic variation in English intonation. It contains the data from nine urban varieties of English spoken in the British Isles. Intonation is investigated from a number of angles, such as intonational meaning, the role of intonation in discourse, intonation synthesis, focus structure, acoustic structure, and phonology.
8
ToBI Corpora
English Japanese Korean
Varied speech
ToBI is a framework for developing community-wide conventions for transcribing the intonation and prosodic structure of spoken utterances in a language variety.
9
UCLA Speech Error Corpus
Various languages
Database
It contains more than 10,000 speech errors in English, French, and other languages, from normal and aphasic individuals. The corpus is constantly expanding as new errors are collected and researchers from around the world send in errors to be entered.
10
ANC First Release
English (American)
11
CALLHOME German Speech
German
12
Chinese Gigaword
Chinese
13
Korean Telephone Conversations Speech
Korean
14
SLX Corpus of Classic Sociolinguistic Interviews
English
15
21st Century SEJONG Project
Korean
[TOP]

Maintained by the weblings. Revised 04/19/02