Publications and Research

Structural Silence: When AI Infrastructure Fails Speakers of Underrepresented Languages

Document Type

Working Paper

Publication Date

Spring 4-2026

Abstract

Artificial intelligence tools for education and language support are increasingly framed as scalable responses to access gaps in under-resourced communities. Yet the infrastructure underlying these tools—training corpora, tokenization schemes, evaluation benchmarks, and deployment architectures—encodes a set of assumptions that systematically disadvantages speakers of underrepresented languages before a single model is trained. This paper examines those assumptions through the lens of Bengali, one of the world’s most widely spoken languages with roughly 285 million speakers (Ethnologue, 2025; International Communication and Leadership School, 2026), and the structural barriers that emerge when attempting to build AI-assisted educational tools for Bengali-speaking learners in low-connectivity environments. We identify four interlocking structural failures: a severe web presence gap (Bengali accounts for fewer than 0.5% of global web content despite representing nearly 4% of the global population) (Pimienta, 2024; W3Techs, 2026); a 67:1 training token deficit between English and Bengali in major multilingual corpora (Khan et al., 2024; Langlais et al., 2025); a tokenization penalty rooted in Bengali’s alphasyllabary script that compounds the data deficit by requiring higher token fertility rates (Shahriar and Barbosa, 2024); and a connectivity exclusion that renders cloud-dependent AI tools inaccessible to the rural populations most likely to benefit from them, where individual internet penetration stands at 36.5% compared to 71.4% in urban areas (Bangladesh Bureau of Statistics, 2025; The Daily Star, 2025). These failures reflect recurring patterns in infrastructure design that cannot be resolved through isolated technical adjustments alone. They are the downstream consequences of longstanding resource allocation decisions, institutional priorities, and design defaults that did not center certain languages in mainstream AI development. Dataset scarcity should therefore be understood as a structural barrier shaped by those decisions rather than as an individual researcher limitation. Offline-first design functions as an equity-oriented infrastructure strategy rather than a secondary technical compromise. We close with specific directions for how the linguistics and AI communities might respond.

Comments

ORCiD (Avijit Roy): https://orcid.org/0009-0007-8036-0952
ORCiD (Proma Roy): https://orcid.org/0009-0004-1060-9116
DOI: http://dx.doi.org/10.2139/ssrn.6522858

Download

Included in

Anthropological Linguistics and Sociolinguistics Commons, Applied Linguistics Commons, Artificial Intelligence and Robotics Commons, Computational Linguistics Commons, Digital Humanities Commons, Instructional Media Design Commons, Science and Technology Studies Commons

COinS

Publications and Research

Structural Silence: When AI Infrastructure Fails Speakers of Underrepresented Languages

Document Type

Publication Date

Abstract

Comments

Included in

Browse

Author Corner

Search

Links

Publications and Research

Structural Silence: When AI Infrastructure Fails Speakers of Underrepresented Languages

Authors

Document Type

Publication Date

Abstract

Comments

Included in

Share

Browse

Author Corner

Search

Links