Dissertations, Theses, and Capstone Projects

Date of Degree

2-2019

Document Type

Thesis

Degree Name

M.A.

Program

Linguistics

Advisor

William Sakas

Subject Categories

Computational Linguistics

Keywords

WGAN, Word Vector, NLG

Abstract

We explore using image generation techniques to generate natural language. Generative Adversarial Networks (GANs), normally used for image generation, were used for this task. To avoid using discrete data such as one-hot encoded vectors, with dimensions corresponding to vocabulary size, we instead use word embeddings as training data. The main motivation for this is the fact that a sentence translated into a sequence of word embeddings (a “word matrix”) is an analogue to a matrix of pixel values in an image. These word matrices can then be used to train a generative adversarial model. The output of the model’s generator are word matrices which can then be translated back into sentences using closest cosine similarity. Four models were designed and trained including two Deep Convolutional Generative Adversarial Networks (DCGAN) using this method. Mode collapse was a common problem encountered, along with generally ungrammatical outputs. However, by using Wasserstein GANs with gradient penalty (WGAN-GP) we were able to successfully train models with no mode collapse, whose generator outputs were reasonably well-formed. Model generators’ outputs were evaluated by well-formedness using a pretrained BERT language model, and by uniqueness using an inter-sample BLEU score. Both WGAN-GP models trained performed well in these two metrics.

All models were constructed and trained using PyTorch, a machine learning library for Python. All code used in the experiments can be found at https://github.com/robert-d-schultz/gan-word-embedding.

gan-word-embedding.zip (19 kB)
Copy of GitHub repository

Share

COinS