PETCI: A Parallel English Translation Dataset of Chinese Idioms

PETCI is a Parallel English Translation dataset of Chinese Idioms, collected from an idiom dictionary and Google and DeepL translation. PETCI contains 4,310 Chinese idioms with 29,936 English translations. These translations capture diverse translation errors and paraphrase strategies.

We provide several baseline models to facilitate future research on this dataset.

2026 Update

An updated version of this dataset can be found at IdiomTranslate30. The updated version includes 9,066 idioms in 3 source languages and 2,719,800 translations in 10 target languages.

Data

The Chinese idioms and their translations are in the ./data/json/raw.json file. Here is one example:

{
    "id": 0,
    "chinese": "一波未平，一波又起",
    "book": [
        "suffer a string of reverses",
        "hardly has one wave subsided when another rises",
        "one trouble follows another"
    ],
    "google": [
        "One wave is not flat, another wave is rising"
    ],
    "deepl": [
        "before the first wave subsides, a new wave rises"
    ]
}

id is the index of the idiom in the dictionary
chinese is the Chinese idiom
book is the translations from the dictionary
google is the translation from Google
deepl is the translation from DeepL

In ./data/json/filtered.json, the machine translations that are the same as dictionary translations are removed, and the dictionary translations are split into gold and human translations.

Training and Testing

Prerequisites

Run pip install -r ./models/requirements.txt to install required packages. Download and put glove.840B.300d.txt in ./data/embedding. Download CoreNLP.

Create Datasets

Before training, run java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -parse.binaryTrees to start the CoreNLP server, and run the following commands in the ./data folder to create the necessary datasets.

mkdir label simplify tree
python dataset.py

LSTM

In the enclosing folder, run

./auto_train.sh
./auto_test.sh

Tree-LSTM

In the enclosing folder, run

./auto_train.sh
./auto_test.sh

BERT

In the enclosing folder, run

SEED=45
HM=ghm
PART=5
python train.py --seed $SEED --train-set train-$HM-$PART --dev-set dev-$HM

MODEL=checkpoint-5000
python test.py --model $MODEL --test-set dev-$HM --seed $SEED --hm $HM --part $PART

NTS

In the enclosing folder, run

onmt_build_vocab -config vocab.yaml -n_sample -1 

onmt_train -config nts.yaml

BEST=checkpoints/checkpoint_step_300.pt
SRC=../../data/simplify/test-src.txt
OUTPUT=../test-output.txt
onmt_translate -model $BEST -src $SRC -output $OUTPUT -verbose -beam_size 5

Figures

In the figs folder, run python plot.py --model lstm, where the model name can be replaced by tree_lstm or bert.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
figs		figs
models		models
LICENSE		LICENSE
README.md		README.md
petci.pdf		petci.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PETCI: A Parallel English Translation Dataset of Chinese Idioms

2026 Update

Data

Training and Testing

Prerequisites

Create Datasets

LSTM

Tree-LSTM

BERT

NTS

Figures

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PETCI: A Parallel English Translation Dataset of Chinese Idioms

2026 Update

Data

Training and Testing

Prerequisites

Create Datasets

LSTM

Tree-LSTM

BERT

NTS

Figures

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages