PEDESTRIAN INTENTION PREDICTION FOR AUTONOMOUS VEHICLES

SHARMA, NEHA

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More

Please use this identifier to cite or link to this item: http://dspace.dtu.ac.in:8080/jspui/handle/repository/22275

Title:	PEDESTRIAN INTENTION PREDICTION FOR AUTONOMOUS VEHICLES
Authors:	SHARMA, NEHA
Keywords:	PEDESTRIAN INTENTION AUTONOMOUS VEHICLES AV TECHNOLOGY PREDICTION CNN GAN
Issue Date:	Sep-2025
Series/Report no.:	TD-8266;
Abstract:	According to the Global Status Report on Road Safety 2023, vehicle crashes cause numerous annual deaths, particularly impacting vulnerable road users. Pedestrians, lacking protective gear, face high vulnerability and substantial injury risk in collisions. Consequently, the growing advancement of Autonomous Vehicle (AV) technology is being explored to enhance road safety and convenience for all users. AV technology can reduce accidents attributed to human errors like fatigue, misperception, and inattention. Leading automotive manufacturers and tech giants like BMW, Tesla, and Google are actively advancing AV technology in this pursuit. Predicting pedestrians' road-crossing decisions is pivotal for achieving a reliable driverless experience through AVs. Initial studies emphasised pedestrian dynamics to anticipate crossing intent. Yet, analysing merely the trajectory proves inadequate for understanding underlying intentions. Beyond trajectory, various factors impact pedestrian road-crossing decisions. These factors fall into three primary modalities: pedestrian-specific (encompassing pose, appearance, etc.), context- specific (involving scene infrastructure and social interaction with co-pedestrians), and hybrid modality encompassing comprehensive human cognitive aspects while observing a pedestrian on the road. Nonetheless, dealing with such diverse modalities necessitates an efficient multimodal fusion framework that can capture adequate discriminatory features for classification. Moreover, interpreting pedestrian interactions with the surrounding environment is highly challenging in a dynamic ego- centric setting. With the rise of deep learning, researchers started using deep neural networks (DNNs) to analyse large amounts of data and automatically learn features indicative of pedestrian intention. These models are trained on large datasets of pedestrian behaviour and show improved accuracy over traditional rule-based methods. This has led to the development of end-to-end models involving convolutional neural networks (CNNs), recurrent neural networks (RNNs), and their variants that process raw sensory data, such as camera images or lidar point clouds, to make predictions. These viii approaches are seen as more robust and capable of handling complex scenarios where single-modality approaches may fail, as they can learn the relationships between different modalities and make predictions in a more integrated manner. This thesis explores deep learning-based approaches for predicting pedestrian intentions in autonomous vehicles. Pedestrian intention prediction is a multi-stage process comprising input acquisition, feature extraction and encoding, spatiotemporal modelling, multimodal fusion, and final decoding or classification. Each stage plays a crucial role in ensuring accurate predictions, with variations in approach depending on the specific output required, such as pedestrian crossing intent classification or trajectory anticipation. The first stage of the process involves acquiring input data in the form of video frames and trajectory coordinates spanning a specific time window. These inputs can be sourced from real-time surveillance systems or pre-recorded video sequences captured from multiple camera angles. This data undergoes pre-processing to extract spatial and temporal features aligned with model requirements. Convolutional Neural Networks (CNNs), such as EfficientNet, are used to derive spatial representations from RGB sequences and segmentation maps, capturing posture, orientation, and environmental cues. To model temporal dependencies, Long Short-Term Memory (LSTM) and Bidirectional LSTM (BiLSTM) networks process historical trajectory data, enabling the inference of motion trends for accurate behaviour prediction. Following feature extraction, the system proceeds to spatiotemporal modelling, which aims to capture the evolving interactions between pedestrians and their surrounding environment over time. This thesis investigates two distinct approaches for this task: Graph Convolutional Networks (GCNs) and Co-Learning Transformers. The GCN-based approach, incorporating a multi-head adjacency matrix, structures pedestrian trajectory data as a graph, enabling the model to learn relational dependencies among individuals. In contrast, the Co-Learning Transformer approach focuses on temporal modelling, capturing long-range dependencies and refining motion features through attention mechanisms. ix Given that pedestrian intention prediction depends on multiple input modalities, an effective fusion strategy is critical for integrating these diverse sources of information. This thesis employs several advanced fusion mechanisms to address this challenge. Adaptive Fusion dynamically adjusts the importance of features based on contextual cues, allowing the model to prioritize relevant information. Co-Learning Architectures enable different modalities to contribute distinct and informative perspectives, enhancing the overall representation. The Multi-Head Shared Weights Mechanism promotes feature consistency across modalities by sharing parameters, thereby reducing redundancy and improving generalization. Finally, the Progressive Denoising Attention Mechanism incrementally filters out irrelevant noise while emphasizing salient patterns, leading to more refined and robust feature representations. The final stage of the process involves decoding the fused feature representations to generate meaningful predictions about pedestrian behaviour. This thesis explores two primary decoding approaches. Pedestrian Intention Classification employs a classifier, such as a SoftMax layer to infer whether a pedestrian intends to cross the street, based on their observed behaviour and contextual cues. Trajectory Prediction, on the other hand, utilizes generative models such as Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs) to forecast future trajectories by learning from historical motion patterns. The performance of each proposed pedestrian intention prediction approach is tested with various publicly available datasets and compared with earlier state-of-the- art algorithms. Finally, the research work is concluded followed by future research direction as well as possible future applications which are highlighted and discussed in detail.
URI:	http://dspace.dtu.ac.in:8080/jspui/handle/repository/22275
Appears in Collections:	Ph.D. Electronics & Communication Engineering

Files in This Item:

File	Description	Size	Format
NEHA SHARMA Ph.D..pdf		9.13 MB	Adobe PDF	View/Open

Show full item record