Skip to Main Content

Course & Subject Guides

Text Mining & Analysis @ Pitt

An introduction to text mining/analysis and resources for finding text data, preparing text data for analysis, methods and tools for analyzing text data, and further readings regarding text mining and its various methods.

Language Corpora

Language corpora make up a subdomain of all text corpora, which are sets of texts (usually large, unstructured, and electronically stored and processed) that are used to do statistical/computational research and analysis, testing, and algorithm training. The term language corpus may refer generally to any collection of linguistic data (written, spoken, signed, or multimodal) or more specifically to collections that have been organized or collected with a particular end in view (e.g., to characterize one or more languages).

Below are a list of language corpora, across various languages and purposes:

  • Acquis Communautaire (AC)
    The Acquis Communautaire (AC) is the total body of European Union (EU) law applicable in the the EU Member States, and currently comprises selected texts written between the 1950s and now. A collection of parallel texts in the following 22 languages: Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Hungarian, Italian, Lithuanian, Latvian, Maltese, Dutch, Polish, Portuguese, Romanian, Slovak, Slovene and Swedish.

  • Language Data Commons of Australia (LDaCA)
    The Language Data Commons of Australia is a discovery service that collates and provides access to assorted examples of Australian English text, transcriptions, audio and audio-visual materials.

  • BYU Law & Corpus Linguistics
    Designed specifically for lawyers and scholars, the new Law and Corpus Linguistics Technology Platform for linguistic analysis includes:

    • The Corpus of Founding Era American English;
    • The Corpus of Early Modern English;
    • The Corpus of Supreme Court of the United States.
  • Chinese corpora
    A collection of Chinese corpora and frequency lists provided by Leeds University.

  • Chinese-English Parallel Corpora
    Aligned sentence pairs from quality bilingual texts, covering the financial and legal domains in Hong Kong. The sources include government legislations and regulations, stock exchange announcements, financial offering documents, regulatory filings, regulatory guidelines, corporate constitutional documents and others.

    The most widely used online corpora -- more than 130,000 distinct researchers, teachers, and students each month.

  • Demo corpora for teaching
    Small to moderate sized text collections for teaching, text-analysis workshops, etc.

  • European language corpora
    Collated by Humboldt University Berlin, Faculty of Language, Literature and Humanities.

  • Japanese corpora
    Corpora built by the National Institute for Japanese Language and Linguistics.

  • Korean corpora
    70 million eojeol Korean text Corpus, POS-annotated Corpus, Tree-annotated Corpus, Korean-Chinese parallel corpus, Korean-English parallel corpus.

  • Parallel corpora
    Scroll down past navigation. This page is your 'shopping list' for parallel texts.

  • Research Centre for Professional Communication in English - Corpora resources
    Department of English, The Hong Kong Polytechnic University.

  • SEAlang Library
    SEAlang Library resources include bilingual and monolingual dictionaries, monolingual text corpora, aligned bitext corpora, and a variety of tools for manipulating, searching, and displaying complex scripts.

  • Virtual Language Observatory
    Search through hundreds of thousands of language resources, browse and use facets to narrow down to your language of interest and resource type (coprora).

  • Wikipedia - list of text corpora
    A list of text corpora in various languages collated by Wikipedia.