SciRisk-Bench: A Risk-Dimension-Aware Benchmark for AI4Science Safety

본문 미리보기

arXiv:2606.18936v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly embedded in AI for Science (AI4Science) workflows, from scientific question answering and literature analysis to laboratory planning and autonomous discovery. This progress creates an urgent need for safety benchmarks that evaluate not only scientific competence, but also whether models recognize and avoid risks in high-stakes scientific contexts. Existing AI4Science safety datasets cover several disci

SciRisk-Bench: A Risk-Dimension-Aware Benchmark for AI4Science Safety

본문 미리보기

관련 글

MosaicLeaks: Can your research agent keep a secret?

CaVe-VLM-CoT: An Interpretable Vision-Language Model Framework

WorldLines: Benchmarking and Modeling Long-Horizon Stateful Embodied Agents

What Must Generalist Agents Remember?