Publications and Research

Document Type

Working Paper

Publication Date

Spring 4-2026

Abstract

Artificial intelligence tools for education and language support are increasingly framed as scalable responses to access gaps in under-resourced communities. Yet the infrastructure underlying these tools—training corpora, tokenization schemes, evaluation benchmarks, and deployment architectures—encodes a set of assumptions that systematically disadvantages speakers of underrepresented languages before a single model is trained. This paper examines those assumptions through the lens of Bengali, one of the world’s most widely spoken languages with roughly 285 million speakers (Ethnologue, 2025; International Communication and Leadership School, 2026), and the structural barriers that emerge when attempting to build AI-assisted educational tools for Bengali-speaking learners in low-connectivity environments. We identify four interlocking structural failures: a severe web presence gap (Bengali accounts for fewer than 0.5% of global web content despite representing nearly 4% of the global population) (Pimienta, 2024; W3Techs, 2026); a 67:1 training token deficit between English and Bengali in major multilingual corpora (Khan et al., 2024; Langlais et al., 2025); a tokenization penalty rooted in Bengali’s alphasyllabary script that compounds the data deficit by requiring higher token fertility rates (Shahriar and Barbosa, 2024); and a connectivity exclusion that renders cloud-dependent AI tools inaccessible to the rural populations most likely to benefit from them, where individual internet penetration stands at 36.5% compared to 71.4% in urban areas (Bangladesh Bureau of Statistics, 2025; The Daily Star, 2025). These failures reflect recurring patterns in infrastructure design that cannot be resolved through isolated technical adjustments alone. They are the downstream consequences of longstanding resource allocation decisions, institutional priorities, and design defaults that did not center certain languages in mainstream AI development. Dataset scarcity should therefore be understood as a structural barrier shaped by those decisions rather than as an individual researcher limitation. Offline-first design functions as an equity-oriented infrastructure strategy rather than a secondary technical compromise. We close with specific directions for how the linguistics and AI communities might respond.

Comments

  • ORCiD (Avijit Roy): https://orcid.org/0009-0007-8036-0952
  • ORCiD (Proma Roy): https://orcid.org/0009-0004-1060-9116
  • DOI: http://dx.doi.org/10.2139/ssrn.6522858

Share

COinS
 
 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.