KLSBench

Korean Literary Sinitic Benchmark

A comprehensive benchmark for evaluating Large Language Models on Korean Literary Sinitic (한문, 韓國漢文)

7,871
Instances
5
Tasks
8
Models Evaluated

About KLSBench

Large Language Models (LLMs) have demonstrated limited performance on low-resource historical languages. Korean Literary Sinitic (KLS), a low-resource historical language, exhibits significant connections to modern Chinese through shared characters and to Korean through substantial lexical overlap (approximately 60%).

To address the absence of a comprehensive evaluation framework for LLM performance on KLS, we introduce KLSBench, a benchmark comprising 7,871 instances spanning five distinct tasks: classification, retrieval, punctuation, natural language inference, and translation.

The dataset incorporates materials from Joseon Dynasty civil service examinations and the Four Books (四書), presenting distinct challenges in understanding historical texts that require both cultural knowledge and logical reasoning capabilities.

Data Sources

  • Joseon Dynasty Civil Service Exams
  • Four Books (Analects, Mencius, Great Learning, Doctrine of the Mean)

Key Findings

  • Retrieval: 83% accuracy
  • Classification: 4% accuracy
  • NLI: 23% accuracy

Benchmark Tasks

Classification

Classify the rhetorical style of classical texts (賦/詩/疑/義)

808 instances Accuracy

Retrieval

Identify the source (Book/Chapter) of a given passage

1,209 instances Accuracy

Natural Language Inference

Determine logical relationships between two sentences (entailment/contradiction/neutral)

1,854 instances Accuracy

Translation

Translate between Literary Sinitic, Korean, and English

2,000 instances BLEU Score

Punctuation

Restore appropriate punctuation to unpunctuated classical texts (白文)

2,000 instances F1 Score

Data Explorer

Explore sample instances from each task

Loading samples...

Model Performance

Evaluation results across 8 LLMs

Best Performance

Retrieval
83%

Models demonstrate effective source identification capabilities when provided with sufficient contextual information

Most Challenging Task

Classification
4%

Performance remains limited, requiring comprehensive understanding of classical rhetorical conventions and stylistic variations

Cultural Dependencies

NLI
23%

Logical reasoning performance is constrained by insufficient historical and cultural contextual knowledge

Key Findings

  • Models exhibit substantial performance degradation on tasks requiring cultural knowledge, despite potential advantages from cross-lingual transfer
  • Comprehension of Literary Sinitic necessitates specialized training incorporating historical and cultural context
  • Classical rhetorical understanding extends beyond simple cross-lingual transfer from modern Chinese or Korean