OpenSpeaks
before <>

Indigenous, endangered and a range of other languages lack human, tech, funding and other resources that hinder the language sustenance.

Context

Surveys and many other forms of engagements have helped us learn more about the challenges of the speakers of many low-resource languages. For instance, we organized an Internet Governance Forum 2021 panel titled “Building the wiki-way for low-resource languages” thin 2021. The panelists shared four key takeaways that are relevant for indigenous, endangered, and other low-resource languages.

First, the internet stakeholders must work collaboratively for supporting language communities with low/limited resources for addressing issues around accessibility and with removing entry-level barriers to platforms.

Second, stakeholders must support the creation of Open Educational Resources (OER) for new and potential contributors who are speakers of such languages to Open and collaborative platforms such as Wikipedia to remove these barriers.

Third, language technology developers and other experts who are not native speakers must work closely with native speakers to implement the development of language technology based on the advice of the latter.

Fourth, creating spaces for peer learning exchange can be a very powerful tool for many low-resource languages in order to protect and grow the use of languages, and stakeholders must emphasize on creation of such spaces.

“Before <>” was born keeping in mind all these four evolving issue areas mentioned above. The third one that is around building resources is key to this pilot. When we think of the building blocks of foundational technologies that are essential for most low-resource languages, we cannot ignore the challenge that many native speaker communities face. Their low access to funding, technical education, mentorship, which are often a result of historical oppression by dominant communities, hinder even the foundational layer of their language technologies. This initiative is in tandem with our OpenSpeaks project which aims at building resources for language multimedia archivists but is focused on the foundational layers of technology.

Some of the known examples of such layers can be:

a. a wordlist of all the unique headwords (also known as lemmas; these are typically the words for which dictionary definitions can be found, and different forms can be created from such words) in a written language

b. pronunciation of words as audio recordings to help with speech technology such as text-to-speech or speech-to-text, especially for languages with no writing system

c. scanned images of printed publications for Optical Character Recognition (OCR)

d. a growing body of oral knowledge in audio, image, and video forms

Collective strategy

Instead of identifying and trying to solve a much larger and societal problem, we play a role in documenting the issues. We also try to create a space by inviting and equipping language digital activists who are often native speakers of different low-resource languages. We are mindful of not crossing the boundary of serving communities by playing a catalytic role instead of saving them. Our slow and participatory approach helps us offer strategic and technical support while drawing attention to the issues and scope for innovations.

mozilla-festival-logoAI/ML Frameworks

A set of frameworks for creating the AI/ML building blocks for low-resource languages.

As a part of the MozFest Trustworthy AI Working Groups program, we are piloting resource development that would be foundational for AI research and development in a few low-resource languages.

Grant for the Web logo

Blochchain-based Web Monetization

A research project to study blockchain and web content, with support from Grant for the Web.

We have been studying the potential impact of Distributed Ledger Technology (DLT) on indigenous language ecosystems in India if it is implemented.

 

OpenSpeaks Voice: Voice Data for Automatic Speech Recognition (ASR)

Voice data in Odia macrolanguage

Odia is spoken by nearly 45 million people primarily residing in the Indian state of Odisha. We have been building two large corpora of audio recordings starting in 2017. The first corpus contains pronunciations of words and phrases that can be useful for training automatic speech recognition (ASR) models. Lingua Libre and Mozilla Common Voice are the primary platforms we currently use — before this, we also co-developed and deployed Kathabhidhana‎ and Spell4wiki. As we are building this dataset, we are also documenting the strategy and process so that others can replicate and create their own workflow. Our current word and phrase corpus include words mostly taken from the Odia Wikipedia, news and other online publications for contemporary vocabulary in various topics. The corpus also includes many words/phrases that might or might not be used from the 1931-1941 lexicon “Purnnachandra Ordia Bhashakosha“, covering many topics. Within the larger Odia speech corpus is a smaller corpus of the Baleswari, the northern dialect of Odia. The second corpus includes the pronunciation of sentences. The sentences as text and the pronunciations are under a CC0 1.0 (Public Domain equivalent) release. Most sentences are either taken from creative literature under Public Domain, mostly by noted author Fakir Mohan Senapati who is also known for incorporating the rich spoken vocabulary from rural Odisha. There are also newly-created sentences covering social sciences and computing in the context of the Odia language.

Status: Nearly 61,000 words (~22 hours) in Odia under a CC0 1.0 (Public Domain) License on Wikimedia Commons (August 2022); another 8 hours of voice data of recording of sentences under CC0 1.0 on Common Voices (March 2022).

Dialect

Current size

Online ReleasePhysical Media Release

Mugalbandi

56,000 words recorded

August 2022

Wikimedia Commons

September 2022

DVD Audio

ISBN:

Baleswari

5,600 words in Baleswari-Odia recorded

March 2022

Wikimedia Commons

September 2022

DVD Audio

ISBN:

Mugalbandi

4,400 clips recorded (8 hours); 700 clips validated

March 2022

Mozilla Common Voice

 

Research

Panigrahi, Subhashish. “Building a Public Domain Voice Database for Odia.” Companion Proceedings of the Web Conference 2022, Association for Computing Machinery, 2022, pp. 1331–38. ACM Digital Library, https://doi.org/10.1145/3487553.3524931.
Panigrahi, Subhashish. “Building a 50,000 Pronunciation Data Repository in the Odia Language.” Diff, 10 Mar. 2022, https://diff.wikimedia.org/2022/03/10/building-a-50000-pronunciation-data-repository-in-the-odia-language/.
 

Publications and news coverage

Home

UN Internet Governance Forum (IGF) 2021|  8 December 2021

We organized a panel to discuss language digital activism, volunteer-led movements such as Wikipedia, and academic and other research initiatives for the growth of low-resource languages.

ACM Digital Library

Building a Public Domain Voice Database for Odia

Subhashish Panigrahi | 16 August 2022

Projects like Mozilla Common Voice were born to address the challenges of unavailability of voice data or the high cost of available data for use in speech technology such as Automatic Speech Recognition (ASR) research and application development. The pilot detailed in this paper is about creating a large freely-licensed public repository of transcribed speech in the Odia language as such a repository was not known to be available. The strategy and methodology behind this process are based on the OpenSpeaks project. Licensed under a Public Domain Dedication (CC0 1.0), the repository currently includes audio recordings of pronunciations for more than 55,000 unique words in Odia, including more than 5,600 recordings of words in the northern Odia dialect Baleswari. No known public listing of words in this dialect was found by the author prior to this pilot. This repository is arguably the most extensive transcribed speech corpus in Odia that is also available publicly under any free and open license. This paper details the strategy, approach, and process behind building both the text and the speech corpus using many open source tools such as Lingua Libre, which can be helpful in building text and speech data for different low-medium-resource languages.

[read more…]

scroll-logo

Meet the young Indians who are bringing an Adivasi language into the digital age

Karishma Mehrotra | 15 February 2022

Ho was languishing on the sidelines of the internet – until a few youngsters took it upon themselves to tear down the digital divide.

[read more…]

Mozfest logo. Mozilla.

Low-resource languages, and their open source AI/ML solutions through a radical empathy lens

Subhashish Panigrahi and Sailesh Patnaik | 9 March 2022

The AI/ML infrastructure is very business driven as opposed to civil society driven. That is one the key reasons why the majority of the minorized (indigenous, endangered and low-resource) languages are sidelined. In the current state it has become a labyrinth, for anyone who wants to become a first generation digital language-activist, it has become difficult for them to understand “where do I to start?”..

[read more…]

Diff-logo

Building a 50,000 pronunciation data repository in the Odia language

Subhashish Panigrahi | 10 March 2022

We had started a pilot under the OpenSpeaks project for building voice data as a foundational layer for speech synthesis research and application development. Recently, the pilot hit a 55,000 pronunciation milestone. The repository also includes pronunciations of 5,600 words in Baleswaria, the northern dialect of Odia. These recordings make the largest repository of Public-Domain voice data in Odia, and add to another 4,000+ recordings of sentences in Odia on Mozilla Common Voice.

[read more…]

ISBN: 978-93-5620-550-5

ISBN: 978-93-5620-550-5