When Sample Selection Bias Precipitates Model Collapse | AIChainDay

🇰🇷 한국어 요약by Claude · 2026. 6. 15.

합성 데이터 재귀 학습은 데이터 부족을 완화하지만, 반복 학습이 분포의 꼬리를 침식하고 출력을 동질화하는 '모델 붕괴'를 일으킬 수 있다. 흔히 해법으로 여겨지는 데이터 선택은 검증기가 참조하는 분포에 크게 의존하는데, 연구진은 각 검증기가 목표 매니폴드의 작고 편향된 일부만 관찰하는 저자원 환경에서는 선택 자체가 편향됨을 보인다. 이런 상황은 원본 데이터를 통합할 수 없는 의료 컨소시엄이나 금융기관 같은 데이터 사일로에서 발생하며, 선택이 지역 매니폴드에 맞는 표본만 남기고 전역적으로 중요한 꼬리 모드를 제거해 붕괴를 막기는커녕 촉진한다. 저자들은 원본 공유 없이 여러 사일로에서 Wasserstein 프록시 참조를 구성해 다양성 저하를 완화하는 초기 해법을 제시한다.

•저자원 검증 환경에서는 데이터 선택이 편향되어 모델 붕괴를 막기는커녕 촉진함을 이론적으로 증명
•사일로화된 선택이 전역 ɱ관련 꼬리 모드를 제거해 멱격칙 다양성 감쇠를 유발
•의료 컴소시엄·금융기관 같이 원본 데이터를 통합할 수 없는 사일로에서 자연히 발생
•원본 공유 없이 여러 사일로에서 Wasserstein 프록시 참조를 구성해 다양성 저하 완화
•재귀 합성데이터 파이프라인은 실데이터 커버리지가 파편적일 때 각별한 주의 필요

AI2026년 6월 15일AI 점수: 92%

When Sample Selection Bias Precipitates Model Collapse

출처:arXiv cs.AI

본문 미리보기

arXiv:2606.13732v1 Announce Type: new Abstract: The proliferation of recursive training on synthetic data can alleviate data scarcity but risks model collapse, where repeated training erodes distributional tails and homogenizes outputs. Data selection is widely viewed as a remedy, yet its reliability depends critically on the reference distribution used by the verifier. We show that in low-resource verification regimes, where each verifier observes only a small, fragmented, and biased slice of

전체 내용이 궁금하다면?

원문을 직접 읽어보세요

원문 보기

#모델 붕괴#합성 데이터#데이터 선택#모델 학습

3시간 전

Hyperdimensional computing for structured querying on tabular data embeddings

arXiv:2606. 13871v1 Announce Type: new Abstract: Tabular data embeddings have become a cornerstone of data profiling and data integration pipelines, enabling tasks such as entity annotation and resolution; schema matching; column type detection; and table search, among others. Existing approaches em

#초차원 컴퓨팅#임베딩#테이블 데이터

📰미디어arXiv cs.AI

원문

When Sample Selection Bias Precipitates Model Collapse

본문 미리보기

관련 글

Hyperdimensional computing for structured querying on tabular data embeddings

AI Receptivity or AI Adoption Breadth? A Tool-Specific Reanalysis of the Lower-Literacy/Higher-Usage Link

Minim: Privacy-Aware Minimal View for Agents via Trusted Local Sanitization

Formalizing Numerical Analysis: An Agent Pipeline and Quality Audit Beyond Kernel Acceptance