Welcome to Joya Chen's Homepage!

Joya Chen （陈卓）

Hi! I’m Joya, a final-year Ph.D. candidate at the Show Lab, National University of Singapore (NUS), advised by Prof. Mike Shou. I’m currently interning at ByteDance Seed, working with Multimodal Interaction and World Model group, VLM base model team.

Previously, I have worked with Wei Li at TikTok AIIC, Huiyu Wang at FAIR, Meta, and Zhaoyang Lv at Reality Labs Research, Meta.

My research centers on large multimodal models for video, with experience in data scaling, model architecture, pre-training, post-training, and benchmarking techniques.

View my education background

I obtained my bachelor's degree in School of Automotive Engineering, WUT. To chase my AI dream, I took the National Postgraduate Entrance Examination and obtained the 1st place in School of Computer Science and Technology, USTC. I obtained my master's degree from here, under the supervision of Prof. Enhong Chen, Prof. Tong Xu, and Prof. Dong Liu. I also had a research assistant at CVML@NUS group, working closely with Prof. Angela Yao.

I anticipate graduating in the summer of 2026 and am interested in industrial research positions. Please feel free to reach out to me via email (joyachen@u.nus.edu)!

Google Scholar / Github / Zhihu

Activity

Invited talk at ByteDance Seed Video Gen Team on Learning Omni Video Stream. Slides will be available soon.

Core organizer of Multimodal Video Agent Workshop @ CVPR'25.

We have uploaded the recorded video of excellent talks given by
Prof. Song Han, Accelerating Large Language Models and Generative AI
Prof. Michael S. Ryoo, Multimodal Video Models for Robot Learning
Dr. Karl Pertsch, Training Multimodal Agents for Real Robot Manipulation at Scale
Prof. Yu Su, Augmenting Human Cognition with AI Agents that Use Computers
Prof. Katerina Fragkiadaki, Towards Grounded Reasoning in Multimodal Agents

Invited talk at Alibaba Qwen Team on VideoLLM-online: Online LLM for Streaming Video.

Core organizer of LOVEU: LOng-form VidEo Understanding Towards Multimodal AI Assistant and Copilot Workshop @ CVPR'24.

We have uploaded the recorded video:

Excellent talks given by Prof. Dima Damen, Prof. Marc Pollefeys, Dr. Chunyuan Li.

Great winner talks on Track1: Long-Term Video Question Answering and Track 2A: Text-Guided Video Editing & Track 2B: Text-to-Video Generation.

Research

Seed1.5-VL Technical Report
ByteDance Seed. I contributed to the streaming capability and the interactive demo.
arXiv, 2025
Homepage / HuggingFace Demo / Github / API

LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale
Joya Chen*, Ziyun Zeng*, Yiqi Lin*, Wei Li, Zejun Ma, Mike Zheng Shou
CVPR, 2025
All open-sourced! Checkpoints, Pre-training & SFT Datasets, Training Code, Evaluation Benchmark, Gradio Demo

VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation
Shiwei Wu*, Joya Chen*, Kevin Qinghong Lin, Qimeng Wang, Yan Gao, Qianli Xu, Tong Xu, Yao Hu, Enhong Chen, Mike Zheng Shou
NeurIPS, 2024

One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos
Zechen Bai, Tong He, Haiyang Mei, Pichao Wang, Ziteng Gao, Joya Chen, Lei Liu, Zheng Zhang, Mike Zheng Shou
NeurIPS, 2024
Code

Learning Video Context as Interleaved Multimodal Sequences
Kevin Qinghong Lin, Pengchuan Zhang, Difei Gao, Xide Xia, Joya Chen, Ziteng Gao, Jinheng Xie, Xuhong Xiao, Mike Zheng Shou
ECCV, 2024
Code

VideoLLM-online: Online Video Large Language Model for Streaming Video
Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, Mike Zheng Shou
CVPR, 2024
Homepage: Paper, Code, Data, Demo, Checkpoints

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives
Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, ..., Mike Zheng Shou, Michael Wray
CVPR (Oral), 2024
https://ego-exo4d-data.org/

	AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn Difei Gao, Lei Ji, Luowei Zhou, Kevin Qinghong Lin, Joya Chen, Zihan Fan, Mike Zheng Shou arXiv, 2023 Paper / Page
	UniVTG: Towards Unified Video-Language Temporal Grounding Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shraman Pramanick, Difei Gao, Alex Jinpeng Wang, Rui Yan, Mike Zheng Shou ICCV, 2023 Paper / Code / Demo
	Affordance Grounding from Demonstration Video to Target Image Joya Chen, Difei Gao, Kevin Qinghong Lin, Mike Zheng Shou CVPR, 2023 Paper / Code
	DropIT: Dropping Intermediate Tensors for Memory-Efficient DNN Training Joya Chen, Kai Xu, Yuhui Wang, Yifei Cheng, Angela Yao ICLR, 2023 OpenReview / arXiv / Code
	AssistQ: Affordance-centric Question-driven Task Completion for Egocentric Assistant Benita Wong, Joya Chen, You Wu, Stan Weixian Lei, Dongxing Mao, Difei Gao, Mike Zheng Shou ECCV*, 2022 Paper / Page / Code / Challenge@CVPR'22
	Is Heuristic Sampling Necessary in Training Deep Object Detectors? Joya Chen, Dong Liu, Tong Xu, Shiwei Wu, Yifei Chen, Enhong Chen IEEE Transactions on Image Processing, 2021 Paper / Code
	Linking the Characters: Video-oriented Social Graph Generation via Hierarchical-cumulative GCN Shiwei Wu, Joya Chen, Tong Xu, Liyi Chen, Lingfei Wu, Yao Hu, Enhong Chen ACM MM (Oral), 2021 Paper

Engineering

	Ranked 1st in HO-3D Leaderboard in Mesh Error/AUC and F@15mm metrics in Dec. 2020
	Ranked 1st in PASCAL VOC Object Detection Competition 3 Leaderboard in Sep. 2018

Internship

Thanks go to Jon Barron's website!