A comprehensive benchmark for evaluating Large Language Models on Korean Literary Sinitic (한문, 韓國漢文)
Large Language Models (LLMs) have demonstrated limited performance on low-resource historical languages. Korean Literary Sinitic (KLS), a low-resource historical language, exhibits significant connections to modern Chinese through shared characters and to Korean through substantial lexical overlap (approximately 60%).
To address the absence of a comprehensive evaluation framework for LLM performance on KLS, we introduce KLSBench, a benchmark comprising 7,871 instances spanning five distinct tasks: classification, retrieval, punctuation, natural language inference, and translation.
The dataset incorporates materials from Joseon Dynasty civil service examinations and the Four Books (四書), presenting distinct challenges in understanding historical texts that require both cultural knowledge and logical reasoning capabilities.
Classify the rhetorical style of classical texts (賦/詩/疑/義)
Identify the source (Book/Chapter) of a given passage
Determine logical relationships between two sentences (entailment/contradiction/neutral)
Translate between Literary Sinitic, Korean, and English
Restore appropriate punctuation to unpunctuated classical texts (白文)
Explore sample instances from each task
Evaluation results across 8 LLMs
Models demonstrate effective source identification capabilities when provided with sufficient contextual information
Performance remains limited, requiring comprehensive understanding of classical rhetorical conventions and stylistic variations
Logical reasoning performance is constrained by insufficient historical and cultural contextual knowledge