Interlingua Corpus Project

Interlingua Corpus Project

Welcome to the Interlingua Corpus Project!

Interlingua (ISO 639 language codes ia, ina) is a naturalistic planned Italic international auxiliary language (IAL), developed between 1937 and 1951 by the International Auxiliary Language Association (IALA). Its vocabulary and grammar are derived from a wide range of western European natural languages. Interlingua was developed to combine a simple, mostly regular grammar with a vocabulary common to English, French, Italian, Spanish and Portuguese. These characteristics make it especially easy to learn for those whose native languages were sources of Interlingua's vocabulary and grammar. Interlingua can also be used as a rapid introduction to many natural languages. Written Interlingua is largely comprehensible to the hundreds of millions of people who speak Romance languages.

The goal of the Interlingua Corpus Project is to aggregate a large collection of Interlingua sentences as a community resource. The sentences in this corpus, including matched pairs of Interlingua-English sentences, are automatically collected from public websites and documents using a web crawler. The emphasis of this project is on sentences because sentences provide vocabulary in context and represent the grammar as used in various styles of writing. This collection of Interlingua sentences will provide a foundation for a variety of uses including enumeration of Interlingua's core vocabulary and development of comprehensive frequency dictionaries for language learners, corpus linguistics studies, and for the development of computational linguistics resources such as language models and machine translation tools for Interlingua.

The data collected in the Interlingua Corpus Project also provides vital resources for the creating and training of the Interlingua-English Translator. For links to the released Interlingua-English Translators, look in the "More Resources" section below.


Interlingua Corpus Release 1.0

Released: August 14th, 2021

Release Notes:

Jason Ding on August 14th, 2021

First release of the Interlingua corpus is out!

The initial release of the corpus contains 4 data files. The first of which are over 1.2 million Interlingua sentences which have been quality-controlled. The second file contains over 80,000 quality-controlled parallel English-Interlingua sentences. The third contains Interlingua token frequencies. The fourth and final file of this release contains the parsed dictionary pairs of the Interlingua English Dictionary by Alexander Gode.

Please email interlinguacorpus@gmail.com for questions and suggestions.
Viewer count: web counter