Date of Degree
Computer Sciences | Linguistics
alignment, bitext, corpora, parallel, translation
In this paper, we present a new sentence alignment system (Canvas), which is a Python implementation of a geometric approach to sentence alignment, based on lexical cues. Canvas system is designed mainly to handle parallel texts exhibiting complex misalignment patterns, namely within English-Arabic pairs for United Nations documents. The system relies heavily on pre-indexing words/tokens in the source and target texts, and it creates correspondences between the token indexes. From this point onward, the alignment problem is reduced to a geometric problem of finding the path that runs through the True Correspondence Points (TCPs). The likelihood of a point being a TCP depends on the clustering of other points nearby; so, we collect the most likely points, and we identify the shortest path containing the maximum number of these points using a modified form of Dijkstra's algorithm. The results of Canvas system are very promising, as they demonstrate that it can handle intricate misalignment patterns, with much better speed than other alignment approaches using lexical cues, and with good accuracy in general, in a completely automated fashion. The only drawback is that the system does not cover all the alignment segments and this coverage is generally lower than other systems, which can be a subject of future research.
Ghaly, Hussein M., "Canvas: A fast and accurate geometric sentence alignment system using lexical cues within complex misalignment settings" (2014). CUNY Academic Works.