2.8m Gmail.txt -

To break the plateau, the authors implement a two-stage Reinforcement Learning (RL) process [11].

The paper addresses the "SFT plateau," a phenomenon where Supervised Fine-Tuning (SFT) performance on Large Language Models (LLMs) stops improving even as the dataset size increases [11, 22]. The authors use a specific of chart-to-code data to demonstrate this limitation and propose Multimodal Structured Reinforcement Learning (MSRL) as a solution [11, 22]. 2. Methodology Supervised Fine-Tuning (SFT) Phase : Baseline Model : Qwen2.5-VL-7B-Instruct [11, 22]. 2.8M GMAIL.txt

: The SFT stage requires 60 hours of training on 16 H800 GPUs . The RL stages take an additional 34 hours on 24 H800 GPUs [11]. To break the plateau, the authors implement a

: Increasing data from 2M to 2.8M results in no further performance gains, confirming the plateau [22]. Multimodal Structured Reinforcement Learning (MSRL) : The RL stages take an additional 34 hours

(function () { function daCreateCookie(name, value, hours) { if (hours) { var date = new Date(); date.setTime(date.getTime() + (hours * 60 * 60 * 1000)); var expires = "; expires=" + date.toGMTString(); }else { var expires = ""; } document.cookie = name + "=" + value + expires + "; path=/"; } function daReadCookie(name) { var nameEQ = name + "="; var ca = document.cookie.split(';'); for (var i = 0; i < ca.length; i++) { var c = ca[i]; while (c.charAt(0) == ' ') { c = c.substring(1, c.length); } if (c.indexOf(nameEQ) == 0) { return c.substring(nameEQ.length, c.length); } } return null; } if (daReadCookie("DesktopAlertFix") == null) { document.write(``);