OpenSpeaks
before <>

Indigenous, endangered and a range of other languages lack human, tech, funding and other resources that hinder the language sustenance.

Context

Through surveys and many other forms of engagements have helped us learn more about the challenges of the speakers of many low-resource languages. For instance, we organized an Internet Governance Forum 2021 panel titled “Building the wiki-way for low-resource languages” thin 2021. The panelists shared four key takeaways that are relevant for indigenous, endangered, and other low-resource languages.

First, the internet stakeholders must work collaboratively for supporting language communities with low/limited resources for addressing issues around accessibility and with removing entry-level barriers to platforms.

Second, stakeholders must support the creation of Open Educational Resources (OER) for new and potential contributors who are speakers of such languages to Open and collaborative platforms such as Wikipedia to remove these barriers.

Third, language technology developers and other experts who are not native speakers must work closely with native speakers to implement the development of language technology based on the advice of the latter.

Fourth, creating spaces for peer learning exchange can be a very powerful tool for many low-resource languages in order to protect and grow the use of languages, and stakeholders must emphasize on creation of such spaces.

“Before <>” was born keeping in mind all these four evolving issue areas mentioned above. The third one that is around building resources is key to this pilot. When we think of the building blocks of foundational technologies that are essential for most low-resource languages, we cannot ignore the challenge that many native speaker communities face. Their low access to funding, technical education, mentorship, which are often a result of historical oppression by dominant communities, hinder even the foundational layer of their language technologies. This initiative is in tandem with our OpenSpeaks project which aims at building resources for language multimedia archivists but is focused on the foundational layers of technology.

Some of the known examples of such layers can be:

a. a wordlist of all the unique headwords (also known as lemmas; these are typically the words for which dictionary definitions can be found, and different forms can be created from such words) in a written language

b. pronunciation of words as audio recordings to help with speech technology such as text-to-speech or speech-to-text, especially for languages with no writing system

c. scanned images of printed publications for Optical Character Recognition (OCR)

d. a growing body of oral knowledge in audio, image, and video forms

Collective strategy

We are mindful of not interrupting the efforts of language digital activists who are native speakers of different low-resource languages. Instead, we pay attention to their issues and inputs. We document what we hear. We also build long-term collaborations with them. This gradual approach helps us serve them using our strategic and technical abilities and privileges.

mozilla-festival-logobefore AI

A set of frameworks for creating the AI/ML building blocks for low-resource languages.

As a part of the MozFest Trustworthy AI Working Groups program, we are piloting resource development that would be foundational for AI research and development in a few low-resource languages.

Grant for the Web logo

before blockchain

A research project to study blockchain and web content, with support from Grant for the Web.

We have been studying the potential impact of Distributed Ledger Technology (DLT) on indigenous language ecosystems in India if it is implemented.

 

Before ASR ☃︎

 

Pilot to build pronunciation voice data in the Odia language

Automatic speech recognition (ASR) requires large transcribed voice-datasets for the training. We are building a large corpus of audio recordings containing pronunciations of words, phrases and sentences. Lingua Libre and Mozilla Common Voice are our friends. As we are building this dataset, we are also documenting the strategy and process so that others can replicate and create their own workflow.

Current size: Nearly 56,000 words in Odia under a CC0 1.0 (Public Domain) License on Wikimedia Commons; another 8 hours of voice data of recording of sentences under CC0 1.0 on Common Voices. (March 2022)

User:Reidab. Wikimedia Foundation. CC-BY-SA 3..0   commonvoice-logo. Mozilla 

Publications and news coverage

Home

UN Internet Governance Forum (IGF) 2021|  8 December 2021

We organized a panel to discuss language digital activism, volunteer-led movements such as Wikipedia, and academic and other research initiatives for the growth of low-resource languages.

scroll-logo

Meet the young Indians who are bringing an Adivasi language into the digital age

Karishma Mehrotra | 15 February 2022

Ho was languishing on the sidelines of the internet – until a few youngsters took it upon themselves to tear down the digital divide.

[read more…]

Mozfest logo. Mozilla.

Low-resource languages, and their open source AI/ML solutions through a radical empathy lens

Subhashish Panigrahi and Sailesh Patnaik | 9 March 2022

The AI/ML infrastructure is very business driven as opposed to civil society driven. That is one the key reasons why the majority of the minorized (indigenous, endangered and low-resource) languages are sidelined. In the current state it has become a labyrinth, for anyone who wants to become a first generation digital language-activist, it has become difficult for them to understand “where do I to start?”..

[read more…]

Diff-logo

Building a 50,000 pronunciation data repository in the Odia language

Subhashish Panigrahi | 10 March 2022

We had started a pilot under the OpenSpeaks project for building voice data as a foundational layer for speech synthesis research and application development. Recently, the pilot hit a 55,000 pronunciation milestone. The repository also includes pronunciations of 5,600 words in Baleswaria, the northern dialect of Odia. These recordings make the largest repository of Public-Domain voice data in Odia, and add to another 4,000+ recordings of sentences in Odia on Mozilla Common Voice.

[read more…]