Press release from OpenAI
Iceland, an island nation in the middle of the North Atlantic, has a vibrant technology industry and booming tourism. Yet while most of its roughly 370,000 citizens speak English or another second language, its integration with the United States and Europe has put the country’s native tongue, Icelandic, at risk. Today there’s increasing worry that in a few generations, if Icelandic can’t remain the country’s default language in the face of rapid digitalization, the language might face de facto extinction.
Icelanders cherish their language for its rich cultural heritage and connection to the nation’s identity. The country’s government maintains a Language Planning Department that coins Icelandic terms for new ideas, rather than adopting so-called “loanwords” from other languages. A computer, for instance, is a tölva (“number prophetess”). Through these efforts, the language preserves its distinctive character and remains closely connected to its Old Norse roots.
On the initiative of the country’s President, HE Guðni Th. Jóhannesson, and with the help of private industry, Iceland has partnered with OpenAI to use GPT-4 in the preservation effort of the Icelandic language—and to turn a defensive position into an opportunity to innovate.
The partnership was envisioned not only as a way to boost GPT-4’s ability to service a new corner of the world, but also as a step towards creating resources that could serve to promote the preservation of other low-resource languages.
“We want to make sure that artificial intelligence will be used not only to help preserve language, culture and history, but also to underpin economic prosperity,” says Jóhanna Vigdís Guðmundsdóttir, chief executive officer of Almannarómur, a non-profit language technology center. “Icelandic is a bit of a peculiar case,” she admits, “but we saw the opportunity to become a trailblazer for other long-tail languages, help create a level playing field, and avoid an ‘AI divide’. It is not enough to gather data and build software locally for Icelandic; we have to get our language into the software and applications people use every day.”
OpenAI’s GPT models are trained on vast quantities of text on the internet. Accordingly, most of the model’s training set is in English and other major languages, which means that GPT doesn’t have the same abilities or breadth of understanding in smaller languages. The models have improved over time, but they don’t consistently produce clear and correct Icelandic translations.
“Now we can just jump directly to the general capabilities of the large models,” Þorsteinsson says, “and enable things with our language that used to require a lot of manual labor, data preparation and resource collection for each use case.” RLHF produces results with just 100 examples, which makes it more feasible for other low-resource languages, with less digital language data available, to replicate the process.
Any language model, depending on the strength and breadth of data training, has a scale of tasks it can perform. Simple language models are limited to simple tasks like assessing the sentiment of a tweet or identifying company names. More complex models are multi-taskers, able to work with abstract concepts, account for social and environment contexts and even exhibit degrees of creativity. GPT-4, for instance, can generate a poem about the vagaries of modern life in the style of Völuspá, an ancient Icelandic poem from the Poetic Edda of Norse mythology.
English to Icelandic
Hvað heitir Donald Duck á íslensku?
What is Donald Duck called in Icelandic?
Donald Duck heitir á íslensku Donaldi Kjáni
Donald Duck is called in Icelandic Donaldi the Fool
Donald Duck heitir sama nafni á íslensku og ensku
Donald Duck has the same name in Icelandic and English
Donald Duck heitir Andrés Önd á íslensku
Donald Duck is called Andrés Önd in Icelandic
Even GPT-4, while much better at Icelandic than GPT-3.5, produces Icelandic with grammatical errors, “translationese”, and incorrect cultural knowledge. To change this, Vilhjálmur Þorsteinsson, chief executive at Miðeind ehf, a language technology company involved in the initiative, assembled a team of 40 volunteers to train GPT-4 on proper Icelandic grammar and cultural knowledge.
In a process called Reinforcement Learning from Human Feedback, or RLHF, human testers give GPT-4 a prompt, and four possible completions are generated. Testers then select the best answer from the four responses and edit it to create an ideal completion. The data from this process is then used to further train GPT-4 to produce better responses in the future.
Prior to RLHF, the process of fine-tuning a model was labor and data-intensive. Þorsteinsson’s team attempted to fine-tune a GPT-3 model with 300,000 Icelandic language examples, but the results were disappointing.
GPT-4 gives a different answer to the same question depending on which language you ask in.
Hvað sitja margir á þingi?
Á Alþingi, sem er lögþing Íslands, sitja 63 þingmenn.
In Alþingi, which is the Icelandic parliament, there are 63 parliamentarians.
How many parliamentarians are there?
It depends on which country you are referring to. For example: / • In the United Kingdom, the Parliament has 650 members in the House of Commons and approximately 800 members in the House of Lords. / • In Canada, there are 338 members of Parliament in the House of Commons and 105 senators in the Senate. / • In India, there are 545 members in the Lok Sabha (House of the People) and 245 members in the Rajya Sabha (Council of States). / • In Australia, there are 151 members of the House of Representatives and 76 senators in the Senate.
Hver er forseti núna?
Forseti Íslands núna (2021) er Guðni Th. Jóhannesson.
The current (2021) President of Iceland is Guðni Th. Jóhannesson.
Who is president now?
As of 2021, the President of the United States is Joe Biden.
With one round of RLHF complete, the model still has some room for improvement. This is the ongoing work for the Iceland team: continue to train GPT-4 with enough examples that the model can power the most complex and creative applications in Icelandic, rather than defaulting to English, so that Icelandic companies can soon deploy GPT-4 in Icelandic in their interactive applications.
Room for improvement
Er það algengt að íslensk börn séu látin sofa úti í vagni?
Is it common for Icelandic babies to be put down for a nap outside in a pram?
Já, það er algengt að íslensk börn séu látin sofa úti í vagni, sérstaklega á vori og sumri. Margir foreldrar telja að útiloft og fersk hrein vindur séu góðir fyrir börnin og auki séu þau oftast rolegri við að sofna úti. Æskilegt er að fylgjast vel með veðurfari og hitastigi, og passa að barnið sé nægur klætt eftir aðstæðum.
Completion corrected for grammar and spelling
Já, það er algengt að íslensk börn séu látin sofa úti í vagni, sérstaklega á vori og sumri. Margir foreldrar telja að útiloft og ferskur hreinn vindur sé góður fyrir börnin og að auki séu þau oftast rólegri við að sofna úti. Æskilegt er að fylgjast vel með veðurfari og hitastigi, og passa að barnið sé nægur klætt eftir aðstæðum.
Yes, it is common for Icelandic babies to nap outside in a pram, especially in spring and summer. Many parents think that outside air and fresh clean wind are good for the babies and, furthermore, that they are usually calmer when they fall asleep outside. It is best to pay close attention to weather and temperature and to make sure that the baby is enough dressed according to circumstances.
Not only will Þorsteinsson’s team have trained the world’s most powerful language engine in their language, but they will have provided an entire country with the opportunity to interact with OpenAI’s models in their own language. Thus, using GPT-4 as its backend, Embla—Miðeind’s voice assistant app—will be able to have conversations with users in fluent Icelandic, as well as offering translations to other languages. As another example, Icelandic companies won’t have to rely on an English-speaking chatbot on their website, but can have one in Icelandic instead.
Originally published on 14 March.
Announcements are published as a service to readers. The sender is responsible for all content.
Announcements for publication can be submitted to [email protected].