Toten: Knowledge-Based Ontological Tokenization Of Physical Quantities And Technical Notation In Brazilian Portuguese | AIChainDay

🇰🇷 한국어 요약by Claude · 2026. 6. 19.

바이트쌍부호화(BPE) 토크나이저는 어휘 압축에 통계적으로 효율적이나 물리량·숫자·단위·기호 표현 같은 구조적 기술 개체를 임의의 서브워드로 쪼개 의미를 잃는다. TOTEN은 통계적 유도 대신 공학 개체 온톨로지(OEE)에 근거한 선언적 분류로 토큰화하는 지식 기반 온톨로지 프레임워크로, 온톨로지·분류 함수·인스턴스화기의 삼중 구조로 정형화되며 Pint(차원)·유니코드 문자 DB(표기)·RSLP(포르투갈어 형태론) 세 외부 오라클과 결정론적으로 결합해 견고성을 확보한다. 내부 벤치마크 EngQuant(N=800)와 브라질 포르투갈어 외부 코퍼스 4종에서 수치 재구성 정확도가 외부 0.775~0.904로 최우수 기준선 Quantulum3의 0.627~0.703을, EngQuant에서는 0.780 대 0.340으로 크게 앞섰으며 차이는 통계적으로 유의(McNemar·Holm 보정)했다.

•BPE가 물리량·단위·기호를 임의 서브워드로 쪼개는 문제를 온톨로지 기반 선언적 분류로 해결
•온톨로지·분류 함수·인스턴스화기 삼중 구조로 정형화하고 Pint·유니코드·RSLP 오라클과 결합
•외부 코퍼스 수치 재구성 0.775~0.904 vs 최우수 기준선(Quantulum3) 0.627~0.703
•EngQuant에서 0.780 vs 0.340으로 압도, McNemar·Holm 보정으로 통계적 유의성 확인

AI2026년 6월 19일

Toten: Knowledge-Based Ontological Tokenization Of Physical Quantities And Technical Notation In Brazilian Portuguese

출처:arXiv cs.AI

본문 미리보기

arXiv:2606.19626v1 Announce Type: new Abstract: Byte-Pair Encoding tokenization is statistically efficient for vocabulary compression, but semantically blind to structured technical entities, fragmenting physical quantities, numbers, units, and symbolic expressions into lexically arbitrary subwords. We present TOTEN, a knowledge-based ontological tokenization framework that replaces statistical derivation with declarative classification grounded in a formal ontology of engineering entities (OEE

전체 내용이 궁금하다면?

원문을 직접 읽어보세요

원문 보기

2시간 전

Deontic Policies for Runtime Governance of Agentic AI Systems

arXiv:2606. 19464v1 Announce Type: new Abstract: Autonomous agentic AI systems driven by Large Language Models (LLMs) introduce a new class of security, privacy, and compliance challenges: an agent that can invoke tools, manipulate data, install software, and coordinate with peer agents across organ

📰미디어arXiv cs.AI

원문

Toten: Knowledge-Based Ontological Tokenization Of Physical Quantities And Technical Notation In Brazilian Portuguese

본문 미리보기

관련 글

Deontic Policies for Runtime Governance of Agentic AI Systems

Diffusion Language Models: An Experimental Analysis

Hidden Anchors in Multi-Agent LLM Deliberation

LLM Doesn't Know What It Doesn't Know: Detecting Epistemic Blind Spots via Cross-Model Attribution Divergence on Clinical Tabular Data