Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages
In this article, the authors provide a diagnose to the problems of machine translation systems for low-resourced languages by reflecting on what agents and interactions are necessary for a sustainable machine translation research process.
Nekoto et al. (2020). Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2144–2160. Online. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.195
Data collection and analysis.
Participatory research.
Community consultations and meetings.
Interviews with community members.
Literature review of relevant academic content on the issue.
In this article, the authors address how machine translation (MT) plays an important role for information accessibility and communication worldwide. Despite immense improvements in MT over the past decade, MT is centered around a few high-resourced languages, leaving many Indigenous languages under-resourced. In their research project, the authors demonstrate the feasibility and scalability of participatory research with a case study on MT for African languages. They conclude that the implementation of MT in this context leads to a collection of novel translation datasets, MT benchmarks, and enables participants without formal training to make a unique scientific contribution to the field.
The workshops and programs described in the article were conducted with high ethical standards shown through a strong commitment to respecting and engaging with Indigenous values and promoting capacity building for community members to participate and conduct further research on the topic.
The authors have diagnosed the problems of machine translation systems for low-resourced languages by reflecting on what agents and interactions are necessary for a sustainable machine translation research process. They conclude that to involve the necessary agents and facilitate required interactions, it is necessary to use participatory research to build sustainable machine translation research networks for low-resourced languages.
Documents produced in community consultations and meetings.
Community knowledge, in material (written, visual, and audio) or oral history.
Academic publications on linguistics, machine learning, and machine translation.
The experiences from the participatory research and the project activities are well-documented and available on an open-source website and the cited published article. Moreover, as a result of mentorship and knowledge exchange between agents of the translation process, the implementation of participatory research has produced artifacts for NLP research, namely datasets, benchmarks, and models, which are publicly available online. Additionally, over 10 participants have gone on to publish works addressing language-specific challenges at conferences and workshops.
“Through the lens of a machine learning researcher, “low-resourced” identifies languages for which few digital or computational data resources exist, often classified in comparison to another language. However, to the sociolinguist, “low-resourced” can be broken down into many categories: low density, less commonly taught, or endangered, each carrying slightly different meanings. In this complex definition, the “low-resourced”-ness of a language is a symptom of a range of societal problems, e.g., authors oppressed by colonial governments have been imprisoned for writing novels in their languages impacting the publications in those languages, or that fewer PhD candidates come from op- pressed societies due to low access to tertiary education. This results in fewer linguistic resources and researchers from those regions to work on NLP for their language. Therefore, the problem of “low- resourced”-ness relates not only to the avail- able resources for a language, but also to the lack of geographic and language diversity of NLP researchers themselves.” (p. 2145)
Science and Technology Studies, Linguistics, Machine Learning, Digital Humanities, Information Studies, Education Studies