Publications and Research
Document Type
Working Paper
Publication Date
Spring 4-2026
Abstract
Artificial intelligence tools for education and language support are increasingly framed as scalable responses to access gaps in under-resourced communities. Yet the infrastructure underlying these tools—training corpora, tokenization schemes, evaluation benchmarks, and deployment architectures—encodes a set of assumptions that systematically disadvantages speakers of underrepresented languages before a single model is trained. This paper examines those assumptions through the lens of Bengali, one of the world’s most widely spoken languages with roughly 285 million speakers (Ethnologue, 2025; International Communication and Leadership School, 2026), and the structural barriers that emerge when attempting to build AI-assisted educational tools for Bengali-speaking learners in low-connectivity environments. We identify four interlocking structural failures: a severe web presence gap (Bengali accounts for fewer than 0.5% of global web content despite representing nearly 4% of the global population) (Pimienta, 2024; W3Techs, 2026); a 67:1 training token deficit between English and Bengali in major multilingual corpora (Khan et al., 2024; Langlais et al., 2025); a tokenization penalty rooted in Bengali’s alphasyllabary script that compounds the data deficit by requiring higher token fertility rates (Shahriar and Barbosa, 2024); and a connectivity exclusion that renders cloud-dependent AI tools inaccessible to the rural populations most likely to benefit from them, where individual internet penetration stands at 36.5% compared to 71.4% in urban areas (Bangladesh Bureau of Statistics, 2025; The Daily Star, 2025). These failures reflect recurring patterns in infrastructure design that cannot be resolved through isolated technical adjustments alone. They are the downstream consequences of longstanding resource allocation decisions, institutional priorities, and design defaults that did not center certain languages in mainstream AI development. Dataset scarcity should therefore be understood as a structural barrier shaped by those decisions rather than as an individual researcher limitation. Offline-first design functions as an equity-oriented infrastructure strategy rather than a secondary technical compromise. We close with specific directions for how the linguistics and AI communities might respond.
Included in
Anthropological Linguistics and Sociolinguistics Commons, Applied Linguistics Commons, Artificial Intelligence and Robotics Commons, Computational Linguistics Commons, Digital Humanities Commons, Instructional Media Design Commons, Science and Technology Studies Commons

Comments