Фото: Fars Media Corporation / Wikimedia
Abstract:Large language model (LLM)-powered agents have demonstrated strong capabilities in automating software engineering tasks such as static bug fixing, as evidenced by benchmarks like SWE-bench. However, in the real world, the development of mature software is typically predicated on complex requirement changes and long-term feature iterations -- a process that static, one-shot repair paradigms fail to capture. To bridge this gap, we propose \textbf{SWE-CI}, the first repository-level benchmark built upon the Continuous Integration loop, aiming to shift the evaluation paradigm for code generation from static, short-term \textit{functional correctness} toward dynamic, long-term \textit{maintainability}. The benchmark comprises 100 tasks, each corresponding on average to an evolution history spanning 233 days and 71 consecutive commits in a real-world code repository. SWE-CI requires agents to systematically resolve these tasks through dozens of rounds of analysis and coding iterations. SWE-CI provides valuable insights into how well agents can sustain code quality throughout long-term evolution.。新收录的资料是该领域的重要参考
,更多细节参见新收录的资料
First FT: the day’s biggest stories,这一点在新收录的资料中也有详细论述
Вступление Финляндии в НАТО назвали худшим решением в истории страны07:45
比播放量数据更加可怕的是,国产长剧集正在远离大家的“话题中心”,这一点相信很多人感同身受。过去大半年的时间,真正成为社交媒体自发性热议话题(而不仅依靠买热搜维持热度)的,在我印象中,仅有一部《太平年》,加上半部《藏海传》。请注意,许多剧集仍然构成了局部的热议话题,并且获得了一些死忠粉丝;我的意思是,它们不再成为“大众热议话题”。作为一个整体的长剧集行业,在社交舆论场中的地位,比五年前乃至十年前,衰落了不止一点半点。