Dissertations, Theses, and Capstone Projects
Date of Degree
2-2024
Document Type
Dissertation
Degree Name
Ph.D.
Program
Computer Science
Advisor
Yingli Tian
Committee Members
Zhigang Zhu
Ioannis Stamos
Hassan Akbari
Subject Categories
Artificial Intelligence and Robotics | Other Computer Sciences
Keywords
Computer Vision, Pattern Recognition, Video Analysis, Scene Understanding, Action Detection, Action Recognition
Abstract
The understanding of human actions in videos holds immense potential for technological advancement and societal betterment. This thesis explores fundamental aspects of this field, including action recognition in trimmed clips and action localization in untrimmed videos. Trimmed videos contain only one action instance, with moments before or after the action excluded from the video. However, the majority of videos captured in unconstrained environments, often referred to as untrimmed videos, are naturally unsegmented. Untrimmed videos are typically lengthy and may encompass multiple action instances, along with the moments preceding or following each action, as well as transitions between actions. In the task of action recognition in trimmed clips, the primary objective is to classify action categories. In contrast, action detection in untrimmed videos aims to accurately identify the starting and ending moments of actions within untrimmed videos while also assigning the corresponding action labels. Action understanding in videos has significant implications across various sectors. It is invaluable in surveillance for identifying potential threats and in healthcare for monitoring patient movements. Importantly, it serves as an indispensable tool for interpreting sign language, facilitating communication with the deaf and hard-of-hearing community. This research presents innovative frameworks for video-based action recognition and detection. Annotating temporal boundaries and action labels for all action instances in untrimmed videos is a labor-intensive and expensive process. To mitigate the need for exhaustive annotations, this work introduces pioneering frameworks that rely on limited supervision. The proposed models demonstrate significant performance improvements over the current state-of-the-art on benchmark datasets. Furthermore, the applications of action understanding in sign language videos are explored by pioneering automated detection of signing errors. The effectiveness of the models is evaluated on the collected sign language datasets.
Recommended Citation
Vahdani, Elahe, "Deep Learning-Based Human Action Understanding in Videos" (2024). CUNY Academic Works.
https://academicworks.cuny.edu/gc_etds/5653