Date of Degree


Document Type


Degree Name





Virginia Valian


Martin Chodorow

Committee Members

Kyle Gorman

Sandeep Prasada

Subject Categories

Cognitive Psychology | Computational Linguistics | Developmental Psychology | First and Second Language Acquisition


language acquisition, corpus linguistics, computational models, utterance length, syntactic development


How early do children produce multiword utterances? Do children's early utterances reflect abstract syntactic knowledge or are they the result of data-driven learning? We examine this issue through corpus analysis, computational modeling, and adult simulation experiments. Chapter 1 investigates when children start producing multiword utterances; we use corpora to establish the development of multiword utterances and a probabilistic computational model to account for the quantitative change of early multiword utterances. We find that multiword utterances of different lengths appear early in acquisition and increase together, and the length growth pattern can be viewed as a probabilistic and dynamic process.

Chapter 2 asks whether very early combinatorial speech reflects abstract syntactic knowledge or simply item-based learning driven by linguistic input. We use different language models (LMs) to track syntactic and lexical development separately. The results show that the syntactic structure behind children’s early combinatorial speech may exceed the development of word combinations acquired from the learning input. Chapter 3 investigates whether the ungrammatical utterances produced by children at an early age (such as 'key-open-door') have adult-like syntactic structure despite their incorrect word choices or missing words, or whether those sequences come from data-driven learning of words without syntactic knowledge. We ask a) adult native speakers, b) statistical LMs, and c) deep neural LMs to produce intelligible utterances from scrambled children's multiword utterances (e.g., 'door-key-open'). We found that the statistical LMs involving local statistical learning trained on child-directed speech can account for the production of those early multiword utterances. The predictive fit of a simple statistical model is as good as or even better than human subjects and the neural model which assumes more complex learning mechanisms and was trained on larger size data. Taken together, the three chapters provide a new, systematic account of when and how children's very early combinatorial speech develops.