Guides: Text Mining &amp; Analysis @ Pitt: Language Corpora

Language Corpora

Language corpora make up a subdomain of all text corpora, which are sets of texts (usually large, unstructured, and electronically stored and processed) that are used to do statistical/computational research and analysis, testing, and algorithm training. The term language corpus may refer generally to any collection of linguistic data (written, spoken, signed, or multimodal) or more specifically to collections that have been organized or collected with a particular end in view (e.g., to characterize one or more languages).

Below are a list of language corpora, across various languages and purposes:

Acquis Communautaire (AC)
The Acquis Communautaire (AC) is the total body of European Union (EU) law applicable in the the EU Member States, and currently comprises selected texts written between the 1950s and now. A collection of parallel texts in the following 22 languages: Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Hungarian, Italian, Lithuanian, Latvian, Maltese, Dutch, Polish, Portuguese, Romanian, Slovak, Slovene and Swedish.
Language Data Commons of Australia (LDaCA)
The Language Data Commons of Australia is a discovery service that collates and provides access to assorted examples of Australian English text, transcriptions, audio and audio-visual materials.
BYU Law & Corpus Linguistics
Designed specifically for lawyers and scholars, the new Law and Corpus Linguistics Technology Platform for linguistic analysis includes:
- The Corpus of Founding Era American English;
- The Corpus of Early Modern English;
- The Corpus of Supreme Court of the United States.
Chinese corpora
A collection of Chinese corpora and frequency lists provided by Leeds University.
Chinese-English Parallel Corpora
Aligned sentence pairs from quality bilingual texts, covering the financial and legal domains in Hong Kong. The sources include government legislations and regulations, stock exchange announcements, financial offering documents, regulatory filings, regulatory guidelines, corporate constitutional documents and others.
corpus.byu.edu
The most widely used online corpora -- more than 130,000 distinct researchers, teachers, and students each month.
Demo corpora for teaching
Small to moderate sized text collections for teaching, text-analysis workshops, etc.
European language corpora
Collated by Humboldt University Berlin, Faculty of Language, Literature and Humanities.
Japanese corpora
Corpora built by the National Institute for Japanese Language and Linguistics.
Korean corpora
70 million eojeol Korean text Corpus, POS-annotated Corpus, Tree-annotated Corpus, Korean-Chinese parallel corpus, Korean-English parallel corpus.
Parallel corpora
Scroll down past navigation. This page is your 'shopping list' for parallel texts.
Research Centre for Professional Communication in English - Corpora resources
Department of English, The Hong Kong Polytechnic University.
SEAlang Library
SEAlang Library resources include bilingual and monolingual dictionaries, monolingual text corpora, aligned bitext corpora, and a variety of tools for manipulating, searching, and displaying complex scripts.
Virtual Language Observatory
Search through hundreds of thousands of language resources, browse and use facets to narrow down to your language of interest and resource type (coprora).
Wikipedia - list of text corpora
A list of text corpora in various languages collated by Wikipedia.

University of Pittsburgh Library System

Course & Subject Guides

Text Mining & Analysis @ Pitt

Get Help with Text Mining & Analysis

Guide Contributors

Language Corpora