OpenSpeaks
before <>
Indigenous, endangered and a range of other languages lack human, tech, funding and other resources that hinder the language sustenance.
Surveys and many other forms of engagements have helped us learn more about the challenges of the speakers of many low-resource languages. For instance, we organized an Internet Governance Forum 2021 panel titled “Building the wiki-way for low-resource languages” thin 2021. The panelists shared four key takeaways that are relevant for indigenous, endangered, and other low-resource languages.
First, the internet stakeholders must work collaboratively for supporting language communities with low/limited resources for addressing issues around accessibility and with removing entry-level barriers to platforms.
Second, stakeholders must support the creation of Open Educational Resources (OER) for new and potential contributors who are speakers of such languages to Open and collaborative platforms such as Wikipedia to remove these barriers.
Third, language technology developers and other experts who are not native speakers must work closely with native speakers to implement the development of language technology based on the advice of the latter.
Fourth, creating spaces for peer learning exchange can be a very powerful tool for many low-resource languages in order to protect and grow the use of languages, and stakeholders must emphasize on creation of such spaces.
“Before <>” was born keeping in mind all these four evolving issue areas mentioned above. The third one that is around building resources is key to this pilot. When we think of the building blocks of foundational technologies that are essential for most low-resource languages, we cannot ignore the challenge that many native speaker communities face. Their low access to funding, technical education, mentorship, which are often a result of historical oppression by dominant communities, hinder even the foundational layer of their language technologies. This initiative is in tandem with our OpenSpeaks project which aims at building resources for language multimedia archivists but is focused on the foundational layers of technology.
Some of the known examples of such layers can be:
a. a wordlist of all the unique headwords (also known as lemmas; these are typically the words for which dictionary definitions can be found, and different forms can be created from such words) in a written language
b. pronunciation of words as audio recordings to help with speech technology such as text-to-speech or speech-to-text, especially for languages with no writing system
c. scanned images of printed publications for Optical Character Recognition (OCR)
d. a growing body of oral knowledge in audio, image, and video forms
Instead of identifying and trying to solve a much larger and societal problem, we play a role in documenting the issues. We also try to create a space by inviting and equipping language digital activists who are often native speakers of different low-resource languages. We are mindful of not crossing the boundary of serving communities by playing a catalytic role instead of saving them. Our slow and participatory approach helps us offer strategic and technical support while drawing attention to the issues and scope for innovations.
A set of frameworks for creating the AI/ML building blocks for low-resource languages.
As a part of the MozFest Trustworthy AI Working Groups program, we are piloting resource development that would be foundational for AI research and development in a few low-resource languages.
A research project to study blockchain and web content, with support from Grant for the Web.
We have been studying the potential impact of Distributed Ledger Technology (DLT) on indigenous language ecosystems in India if it is implemented.
Odia is spoken by nearly 45 million people primarily residing in the Indian state of Odisha. We have been building two large speech corpora of audio recordings starting since 2017: a) word and phrases primarily using Lingua Libre, and b) sentences. The first corpus contains pronunciations of words and phrases that can be useful for training automatic speech recognition (ASR) models. Lingua Libre and Mozilla Common Voice are the primary platforms we currently use — before this, we also co-developed and deployed Kathabhidhana and Spell4wiki. As we are building this dataset, we are also documenting the strategy and process so that others can replicate and create their own workflow. Our current word and phrase corpus include words mostly taken from the Odia Wikipedia, news and other online publications for contemporary vocabulary in various topics. The corpus also includes many words/phrases that might or might not be used from the 1931-1941 lexicon Purnnachandra Ordia Bhashakosha, covering many topics. Within the larger Odia speech corpus is a smaller corpus of the Baleswari, the northern dialect of Odia. The second corpus includes the pronunciation of sentences. The sentences as text and the pronunciations are under a CC0 1.0 (Public Domain equivalent) release. Most sentences are either taken from creative literature under Public Domain, mostly by noted author Fakir Mohan Senapati who is also known for incorporating the rich spoken vocabulary from rural Odisha. There are also newly-created sentences covering social sciences and computing in the context of the Odia language. There are already 4,400 sentences recorded and 787 sentences validated after review (listening).
Status: Nearly 66,000 words (~22 hours) in Odia under a CC0 1.0 (Public Domain) License on Wikimedia Commons (August 2022); another 8 hours of voice data of recording of sentences under CC0 1.0 on Mozilla Common Voice (March 2022).
Language and Dialect(s) |
Current size |
Online Release | Physical Media Release |
Odia (Mugalbandi and Baleswari) |
66,000 words recorded |
June 2023 |
December 2022 DVD Audio ISBN: 978-81-958409-1-5 |
Odia (Baleswari) |
6,200 words in Baleswari-Odia recorded |
March 2023 |
December 2022 DVD Audio ISBN: 978-81-958409-0-8 |
Mugalbandi |
4,400 clips recorded (8 hours); 787 clips validated |
March 2022 |
Research
Publications and news coverage
UN Internet Governance Forum (IGF) 2021| 8 December 2021
We organized a panel to discuss language digital activism, volunteer-led movements such as Wikipedia, and academic and other research initiatives for the growth of low-resource languages.
OpenSpeaks before AI: Frameworks for Creating the AI/ML Building Blocks for Low-Resource Languages
Subhashish Panigrahi | 1 May 2023
As a part of its work around trustworthy AI, Mozilla started the MozFest Trustworthy AI Working Group. As members of the 2021 working group cohort, we at the O Foundation piloted an experimental framework called OpenSpeaks Before AI.
Building a Public Domain Voice Database for Odia
Subhashish Panigrahi | 16 August 2022
Projects like Mozilla Common Voice were born to address the challenges of unavailability of voice data or the high cost of available data for use in speech technology such as Automatic Speech Recognition (ASR) research and application development. The pilot detailed in this paper is about creating a large freely-licensed public repository of transcribed speech in the Odia language as such a repository was not known to be available. The strategy and methodology behind this process are based on the OpenSpeaks project. Licensed under a Public Domain Dedication (CC0 1.0), the repository currently includes audio recordings of pronunciations for more than 55,000 unique words in Odia, including more than 5,600 recordings of words in the northern Odia dialect Baleswari. No known public listing of words in this dialect was found by the author prior to this pilot. This repository is arguably the most extensive transcribed speech corpus in Odia that is also available publicly under any free and open license. This paper details the strategy, approach, and process behind building both the text and the speech corpus using many open source tools such as Lingua Libre, which can be helpful in building text and speech data for different low-medium-resource languages.
Meet the young Indians who are bringing an Adivasi language into the digital age
Karishma Mehrotra | 15 February 2022
Ho was languishing on the sidelines of the internet – until a few youngsters took it upon themselves to tear down the digital divide.
[read more…]
Low-resource languages, and their open source AI/ML solutions through a radical empathy lens
Subhashish Panigrahi and Sailesh Patnaik | 9 March 2022
The AI/ML infrastructure is very business driven as opposed to civil society driven. That is one the key reasons why the majority of the minorized (indigenous, endangered and low-resource) languages are sidelined. In the current state it has become a labyrinth, for anyone who wants to become a first generation digital language-activist, it has become difficult for them to understand “where do I to start?”..
[read more…]
Building a 50,000 pronunciation data repository in the Odia language
Subhashish Panigrahi | 10 March 2022
We had started a pilot under the OpenSpeaks project for building voice data as a foundational layer for speech synthesis research and application development. Recently, the pilot hit a 55,000 pronunciation milestone. The repository also includes pronunciations of 5,600 words in Baleswaria, the northern dialect of Odia. These recordings make the largest repository of Public-Domain voice data in Odia, and add to another 4,000+ recordings of sentences in Odia on Mozilla Common Voice.
[read more…]