Joya Chen (陈卓)

Hi! I’m Joya, a third-year Ph.D. candidate at the Show Lab, National University of Singapore (NUS), advised by Prof. Mike Shou. I’m currently interning at ByteDance Seed, working with Multimodal Interaction and World Model team.

Previously, I have worked with Wei Li at TikTok AIIC, Huiyu Wang at FAIR, Meta, and Zhaoyang Lv at Reality Labs Research, Meta.

My research centers on large multimodal models for video, with experience in data scaling, model architecture, pre-training, post-training, and benchmarking techniques.

View my education background

I obtained my bachelor's degree in School of Automotive Engineering, WUT. To chase my AI dream, I took the National Postgraduate Entrance Examination and obtained the 1st place in School of Computer Science and Technology, USTC. I obtained my master's degree from here, under the supervision of Prof. Enhong Chen, Prof. Tong Xu, and Prof. Dong Liu. I also had a research assistant at CVML@NUS group, working closely with Prof. Angela Yao.

Nice to meet you: joyachen@u.nus.edu :)

Google Scholar  /  Github  /  Zhihu

profile photo
Activity
Core organizer of LOVEU: LOng-form VidEo Understanding Towards Multimodal AI Assistant and Copilot Workshop @ CVPR'24.

We have uploaded the recorded video:

Excellent talks given by Prof. Dima Damen, Prof. Marc Pollefeys, Dr. Chunyuan Li.

Great winner talks on Track1: Long-Term Video Question Answering and Track 2A: Text-Guided Video Editing & Track 2B: Text-to-Video Generation.
Research
Seed1.5-VL Technical Report
ByteDance Seed. I contributed to the streaming capability and the interactive demo.
arXiv, 2025
Homepage / HuggingFace Demo / Github / API
LiveCC: Learn Video LLM with Streaming Speech Transcription at Scale
Joya Chen*, Ziyun Zeng*, Yiqi Lin*, Wei Li, Zejun Ma, Mike Zheng Shou (*Equal contribution)
CVPR, 2025
All open-sourced! Checkpoints, Pre-training & SFT Datasets, Training Code, Evaluation Benchmark, Gradio Demo
VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation
Shiwei Wu*, Joya Chen*, Kevin Qinghong Lin, Qimeng Wang, Yan Gao, Qianli Xu, Tong Xu, Yao Hu, Enhong Chen, Mike Zheng Shou (*Equal contribution)
NeurIPS, 2024
One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos
Zechen Bai, Tong He, Haiyang Mei, Pichao Wang, Ziteng Gao, Joya Chen, Lei Liu, Zheng Zhang, Mike Zheng Shou
NeurIPS, 2024
Code
Learning Video Context as Interleaved Multimodal Sequences
Kevin Qinghong Lin, Pengchuan Zhang, Difei Gao, Xide Xia, Joya Chen, Ziteng Gao, Jinheng Xie, Xuhong Xiao, Mike Zheng Shou
ECCV, 2024
Code
VideoLLM-online: Online Video Large Language Model for Streaming Video
Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, Mike Zheng Shou
CVPR, 2024
Homepage: Paper, Code, Data, Demo, Checkpoints
Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives
Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, ..., Mike Zheng Shou, Michael Wray
CVPR (Oral), 2024
https://ego-exo4d-data.org/
AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn
Difei Gao, Lei Ji, Luowei Zhou, Kevin Qinghong Lin, Joya Chen, Zihan Fan, Mike Zheng Shou
arXiv, 2023
Paper / Page
UniVTG: Towards Unified Video-Language Temporal Grounding
Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shraman Pramanick, Difei Gao, Alex Jinpeng Wang, Rui Yan, Mike Zheng Shou
ICCV, 2023
Paper / Code / Demo
Affordance Grounding from Demonstration Video to Target Image
Joya Chen, Difei Gao, Kevin Qinghong Lin, Mike Zheng Shou
CVPR, 2023
Paper / Code
DropIT: Dropping Intermediate Tensors for Memory-Efficient DNN Training
Joya Chen*, Kai Xu*, Yuhui Wang, Yifei Cheng, Angela Yao (*Equal contribution)
ICLR, 2023
OpenReview / arXiv / Code
AssistQ: Affordance-centric Question-driven Task Completion for Egocentric Assistant
Benita Wong*, Joya Chen*, You Wu*, Stan Weixian Lei, Dongxing Mao, Difei Gao, Mike Zheng Shou (*Equal contribution)
ECCV, 2022
Paper / Page / Code / Challenge@CVPR'22
Is Heuristic Sampling Necessary in Training Deep Object Detectors?
Joya Chen, Dong Liu, Tong Xu, Shiwei Wu, Yifei Chen, Enhong Chen
IEEE Transactions on Image Processing, 2021
Paper / Code
Linking the Characters: Video-oriented Social Graph Generation via Hierarchical-cumulative GCN
Shiwei Wu, Joya Chen, Tong Xu, Liyi Chen, Lingfei Wu, Yao Hu, Enhong Chen
ACM MM (Oral), 2021
Paper
Engineering
Ranked 1st in HO-3D Leaderboard in Mesh Error/AUC and F@15mm metrics in Dec. 2020
Ranked 1st in PASCAL VOC Object Detection Competition 3 Leaderboard in Sep. 2018
Internship
Worked as a LLM research intern at Bytedance from Aug. 2024 to May. 2025
Worked as a AI research scientist intern at FAIR, Meta AI from Dec. 2023 to May. 2024
Worked as a computer vision research intern in Tencent from Jun. 2018 to Nov. 2019

Thanks go to Jon Barron's website!