Dissertations, Theses, and Capstone Projects

Date of Degree

2-2024

Document Type

Dissertation

Degree Name

Ph.D.

Program

Computer Science

Advisor

Yingli Tian

Committee Members

Zhigang Zhu

Ioannis Stamos

Hassan Akbari

Subject Categories

Artificial Intelligence and Robotics | Other Computer Sciences

Keywords

Computer Vision, Pattern Recognition, Video Analysis, Scene Understanding, Action Detection, Action Recognition

Abstract

The understanding of human actions in videos holds immense potential for technological advancement and societal betterment. This thesis explores fundamental aspects of this field, including action recognition in trimmed clips and action localization in untrimmed videos. Trimmed videos contain only one action instance, with moments before or after the action excluded from the video. However, the majority of videos captured in unconstrained environments, often referred to as untrimmed videos, are naturally unsegmented. Untrimmed videos are typically lengthy and may encompass multiple action instances, along with the moments preceding or following each action, as well as transitions between actions. In the task of action recognition in trimmed clips, the primary objective is to classify action categories. In contrast, action detection in untrimmed videos aims to accurately identify the starting and ending moments of actions within untrimmed videos while also assigning the corresponding action labels. Action understanding in videos has significant implications across various sectors. It is invaluable in surveillance for identifying potential threats and in healthcare for monitoring patient movements. Importantly, it serves as an indispensable tool for interpreting sign language, facilitating communication with the deaf and hard-of-hearing community. This research presents innovative frameworks for video-based action recognition and detection. Annotating temporal boundaries and action labels for all action instances in untrimmed videos is a labor-intensive and expensive process. To mitigate the need for exhaustive annotations, this work introduces pioneering frameworks that rely on limited supervision. The proposed models demonstrate significant performance improvements over the current state-of-the-art on benchmark datasets. Furthermore, the applications of action understanding in sign language videos are explored by pioneering automated detection of signing errors. The effectiveness of the models is evaluated on the collected sign language datasets.

Share

COinS