A new algorithm has been shown to be able to automatically decipher a lost tongue, without the need for advanced knowledge of its relationship with other languages.

He end goal of the team of researchers Laboratory of Computer Science and Artificial Intelligence of the WITH (CSAIL) is that the system can decipher lost languages ​​that have eluded linguists for decades, using only a few thousand words.

Headed by MIT Professor Regina Barzilay, the system is based on several principles based on knowledge of historical linguistics, like the fact that languages ​​generally only evolve in certain predictable ways. For example, while a given language rarely adds or removes an entire sound, certain sound substitutions are likely to occur. A word with a “p” in the primary language may change to a “b” in the descendant language, but the change to a “k” is less likely due to the significant pronunciation gap.

By incorporating these and other linguistic limitations, Barzilay and the MIT doctoral student Jiaming Luo developed a decryption algorithm that can handle the vast space of possible transformations and the scarcity of a guide signal at the input.

The algorithm learns to embed sounds of language in a multidimensional space where the differences in pronunciation are reflected in the distance between the corresponding vectors. This design allows them capture patterns of language change and express them as computational constraints. He resulting model You can segment words in an old language and assign them to their counterparts in a related language.

The project is based on a document Barzilay and Luo wrote last year that deciphered dead languages. ugaritic (a Semitic language) and linear B (writing system used to write Mycenaean Greek) the latter of which has taken decades to decode. However, a key difference with that project was that the team knew that these languages ​​were related to the earliest forms from Hebrew and Greek, respectively.

The case of the Basque language

With the new system, the algorithm infers the relationship between languages. This question is one of the biggest deciphering challenges. In the case of Linear B, it took several decades to discover the correct known descendant. For him íbero, scholars still cannot agree on the related language: some defend Euskera, while others refute this hypothesis and claim that Iberian is not related to any known language.

The proposed algorithm can evaluate the proximity between two languages; in fact, when testing in known languages, it can even accurately identify language families. The team applied their algorithm to the Iberian considering the Basque, as well as the less likely candidates from the Romance, Germanic, Turkish and Uralic families. While the Basque and Latin they were more near the Iberian than other languages, they were still too different to be considered related.

Identification of the semantic meaning

In future work, the team hopes to expand their work beyond the act of connecting texts with related words in a known language, an approach known as “affine-based decryption“. This paradigm assumes that there is such a well-known language, but the Iberian example shows that this is not always the case. new focus of the team would imply identify semantic meaning words, even if they don’t know how to read them.

“For example, we can identify all references to people or places in the document that can then be further investigated in light of known historical evidence,” Barzilay says in a statement. “These methods of ‘entity recognition‘are commonly used in various word processing applications today and are very accurate, but the key research question is whether the task is feasible without training data in the old language. “