|
DATABASES & CORPORA
| `
| # | Name | Language | Type | Description |
| 1 | | | | |
| 1.1 | English | Microphone speech | Applications: prosody, speech recognition, and speech synthesis. It consists of professionally read radio news data, including speech and accompanying annotations, suitable for speech and language research. | |
| 1.2 | English | Telephone speech | Application: speech recognition. It consists of 120 unscripted telephone conversations between native speakers of English. | |
| 1.3 | English | Conversation text | Application: speech recognition. It includes transcripts and documentation files for 120 unscripted telephone conversations between native speakers of English. | |
| 1.4 | English Dutch German | Varied lexicon | Applications: parsing, pronunciation modeling, speech synthesis. It contains detailed information on orthography, phonology, morphology, syntax, and word frequency in three languages. | |
| 1.5 | English | Varied text | Application: natural language processing. It contains sense-tagged word occurrences of 121 nouns and 70 verbs which are among the most frequently occurring and ambiguous words in English. | |
| 1.6 | English | Microphone speech | Application: speech recognition. It contains broadband recordings of 630 speakers of 8 major dialects of American English, each reading 10 phonetically rich sentences. | |
| 1.7 | English | Varied text | Applications: natural language processing, parsing, tagging. It contains over 1.6 million words of hand-parsed material from the Dow Jones News Service, plus an additional 1 million words tagged for part-of-speech. | |
| 1.8 | English | Varied text | Applications: natural language processing, parsing, tagging. It contains Switchboard tagged, dysfluency-annotated, and parsed text, and Brown pared text in addition to the Treebank-2 material. | |
| 1.9 | Chinese | Newswire text | Applications: natural language processing, parsing, tagging. It contains the Chinese Penn Treebank Project corpus Final Release and includes 100,000 Chinese words with syntactic bracketing. | |
| 1.10 | Japanese | Telephone speech | Application: speech recognition. It consists of 120 unscripted telephone conversations between native speakers of Japanese. | |
| 1.11 | Japanese | Conversation text | Application: speech recognition. It includes transcripts and documentation files. The transcripts cover a contiguous 5 or 10 minutes segment taken from 120 unscripted telephone conversations between native speakers of Japanese. | |
| 1.12 | Japanese | Newswire text | Applications: information retrieval, language modeling. It composes of business and financial news from Nihon Keizai Shimbun (Approximately 30 million words) and Dow Jones Telerate, a financial newswire produces by Kyodo News Service. | |
| 1.13 | Korean | Newswire text | Applications: information retrieval, language modeling. It is a collection of Korean Press Agency news argicles from June 2, 1994 to March 20, 2000. | |
| 1.14 | English | Telephone speech | Application: speech recognition It contains forty sphere files encoded in two channel interleaved mulaw for a total of 605 Mbytes of sphere data. | |
| 1.15 | English | Telephone speech | Application: speech identification The data focuses on native speakers of English in the American South. It contains 5-6 minute conversations of 2728 telephone calls from 640 participants (292 Male, 348 Female) under varied environmental conditions. | |
| 2 | English | Varied speech and text | It contains 100 millions words of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English, both spoken and written. | |
| 3 | Various languages | Transcripts of naturalistic speech of children | It provides tools for studying conversational interactions. These tools include a database of transcripts, programs for computer analysis of transcripts, methods for linguistic coding, and systems for linking transcripts to digitized audio and video. | |
| 4 | English | Conversation text | The corpus, which has been constructed from a selection of existing transcripts of interactions in professional settings, contains two main sub-corpora of a million words each. | |
| 5 | English | Written, spoken, historical, and parsed corpora | It provides information on English language material available for computer processing and on linguistic research completed or in progress on the material, to compile an archive of English text corpora in machine-readable form, and to make material available to research institutions. | |
| 6 | English | Varied text | ICE-GB consists of a million words of spoken and written English and adheres to the common corpus design. 200 written and 300 spoken texts make up the million words. Every text is grammatically annotated, permitting complex and detailed searches across the whole corpus. | |
| 7 | English | Varied speech | It provides cross-varietal and stylistic variation in English intonation. It contains the data from nine urban varieties of English spoken in the British Isles. Intonation is investigated from a number of angles, such as intonational meaning, the role of intonation in discourse, intonation synthesis, focus structure, acoustic structure, and phonology. | |
| 8 | Varied speech | ToBI is a framework for developing community-wide conventions for transcribing the intonation and prosodic structure of spoken utterances in a language variety. | ||
| 9 | Various languages | Database | It contains more than 10,000 speech errors in English, French, and other languages, from normal and aphasic individuals. The corpus is constantly expanding as new errors are collected and researchers from around the world send in errors to be entered. | |
|
10
|
ANC First Release |
English (American) | | |
| 11 | CALLHOME German Speech | German | | |
| 12 | Chinese Gigaword | Chinese | | |
| 13 | Korean Telephone Conversations Speech | Korean | | |
| 14 | SLX Corpus of Classic Sociolinguistic Interviews | English | | |
| 15 | 21st Century SEJONG Project | Korean | | |
Maintained by the weblings. Revised 04/19/02