PhD 'Multimodal Multi-Hop Reasoning for Video Analysis' F/H
Overview
Orange Innovation brings together the research and innovation activities and expertise of the Group's entities and countries. We work every day to ensure that Orange is recognized as an innovative operator by its customers and we create value for the Group in each of our projects. With 720 researchers, thousands of marketers, developers, designers and data analysts, it is the expertise of our 6,000 employees that fuels this ambition every day. Orange Innovation anticipates technological breakthroughs and supports the Group's countries and entities in making the best technological choices to meet the needs of our consumer and business customers.
Role
Your role is to pursue a PhD thesis on "Multimodal multi‑hop reasoning for video analysis".
Research Focus
Multimodal reasoning represents a major shift in AI, going beyond single‑modality approaches to jointly process visual, linguistic and auditory information. The main challenge is to integrate these heterogeneous sources, which differ in structure and representation. Recently, so‑called "omni" unified models have emerged that can account for multiple modalities simultaneously, but their use of each modality remains poorly understood. Videos particularly illustrate this complexity: they combine visual, audio and sometimes textual content (subtitles) and constitute a demanding evaluation domain. Multi‑hop video reasoning must link cues dispersed across different segments while ensuring temporal alignment, semantic coherence and robust intermodal fusion in the presence of asynchronous signals.
The thesis goal is to study the interaction between modalities in video analysis and to improve multi‑hop reasoning across distinct segments. Determining when and how multiple modalities contribute to reasoning represents just part of the challenge. Current models fail to guarantee consistent use of the full modality set, with some multimodal configurations underperforming unimodal reasoning. These findings suggest dataset biases, "modality collapse" phenomena, and fundamental limitations in modality alignment and exploitation.
Axis 1: Evaluation, Robustness and Interpretability
This axis involves characterizing the conditions under which models truly exploit multiple modalities and when they fall back to a single one, using probing, systemic analyses, modality ablations and controlled data manipulations (synthetic data, counterfactual examples, physics‑informed scenarios). Robustness protocols (noise, suppression or misalignment of modalities) will allow diagnosing the causal role of each signal.
Axis 2: Solutions and Training of Truly Multimodal Models
Based on the identified challenges, the thesis will aim to design and train architectures and learning procedures that promote collaboration between modalities (attention or routing mechanisms, intermodal coherence constraints, temporal grounding objectives). The ambition is to obtain truly multimodal, robust, efficient and interpretable multi‑hop video reasoning models that outperform their unimodal counterparts.
Hard and Soft Skills Required for the Position
- Proficiency in Deep Learning techniques (text, image, audio or video processing).
- Programming skills, particularly in Python, with experience in deep learning frameworks such as PyTorch or TensorFlow.
- Ability to analyze and interpret complex data, with strong analytical skills.
- Personal qualities: scientific rigor, autonomy, curiosity, initiative, ability to work in a team.
- Strong oral and writing skills in English for presenting research findings and drafting publications and research reports.
- Ability to present results clearly and pedagogically to different audiences.
Required Education
You hold a professional or research master's degree or have graduated from an engineering school in computer science or applied mathematics, preferably with a specialization in one or more fields of artificial intelligence.
Desired Experience
- Prior experience in research projects or internships in video processing or multimodality.
- Experience with vision‑language models (VLMs) and/or multimodal LLMs (MLLMs).
- Experience in Natural Language Processing (NLP).
- In‑depth understanding of LLMs and reasoning models.
- Participation in scientific publications or presentations in the field is a plus.
Ref :
#J-18808-Ljbffr