Frequently Asked Questions

Interlingua (ISO 639 language codes ia, ina) is a naturalistic planned Italic international auxiliary language (IAL), developed between 1937 and 1951 by the International Auxiliary Language Association (IALA). Its vocabulary and grammar are derived from a wide range of western European natural languages. Interlingua was developed to combine a simple, mostly regular grammar with a vocabulary common to English, French, Italian, Spanish and Portuguese. These characteristics make it especially easy to learn for those whose native languages were sources of Interlingua's vocabulary and grammar. Interlingua can also be used as a rapid introduction to many natural languages. Written Interlingua is largely comprehensible to the hundreds of millions of people who speak Romance languages.

The goal of the Interlingua Corpus Project is to aggregate a large collection of Interlingua sentences as a community resource. The sentences in this corpus, including matched pairs of Interlingua-English sentences, are automatically collected from public websites and documents using a web crawler. The emphasis of this project is on sentences because sentences provide vocabulary in context and represent the grammar as used in various styles of writing. This collection of Interlingua sentences will provide a foundation for a variety of uses including enumeration of Interlingua's core vocabulary and development of comprehensive frequency dictionaries for language learners, corpus linguistics studies, and for the development of computational linguistics resources such as language models and machine translation tools for Interlingua.

Yes, there is! You can find a link to it, as well as source code, in the "more resources" tab.

Written by Jason Ding on August 14th, 2021

I perform all work on this research project. I am directly supervised by Dr. Todd Mockler, a Principal Investigator at the Danforth Science Center.

The project was started in May of 2020 when I was an incoming junior in high school. The final goal of the project was to create an Interlingua-English Translator.

I came into the project with no knowledge of either Interlingua or Neural Networks. However, I devised and followed a plan to make the translator come to life.

The project has three main phases.

First, I needed to construct a large collection of Interlingua sentences by creating a web crawler program to automatically extract data from the Internet. In order to do this, I learned how to created a web crawler (i.e., a computer program that automatically searches through the Internet) that extracts any sentences written in Interlingua in both the HTML and the website's downloadable documents. I taught myself to use various Python modules, such as BeautifulSoup, requests, and os, while also learning and inventing techniques to do tasks such as accurately separating sentences from paragraphs. The final version of my web crawler program visited 6,373,297 websites and collected over 1,200,000 unique Interlingua sentences.

Second, I needed to collect as many matched pairs of Interlingua-English sentences as I could. To do this, I created a parallel sentence extractor program. The program would take as input a pair of texts that are near translations of each other and output the individual pairs of parallel sentences between the two texts. For example, I have used my program to extract parallel sentences from the Bible and the Book of Mormon. The key challenge to the program was identifying and rectifying edge cases that cause false positives and negatives, such as when one of the parallel texts skips a certain sentence and the other doesn’t. Thus far, I have used my program to extract over 80,000 parallel English-Interlingua sentences. (As a note, the link to the Interlingua Corpus Project which contains the data collected from these first two steps for free can be found below in the "More Resources" section)

Third, I used the data gathered from the first two steps to train a neural machine translation (NMT) system, or more specifically, a recurrent neural network (RNN) translator, that can translate between English and Interlingua. I self-taught myself how to use PyTorch and CUDA, and I learned to use git and terminal shell commands, Jupyter, HTML, Google Colab, deploy Python at Heroku, utilize Google and Dropbox API, and connect to remote servers all from scratch.

After around 16 months of work in August of 2021, I was able to release both the first version of the translator and the Interlingua corpus.

Please contact or for questions and suggestions.

The data collected in the corpus and the translator projects can be used for free by anyone. The data from the projects can also be used for anything. However, it is requested that credits or links to the project pages are given.

Please email for questions and suggestions.
Viewer count: web counter