1. 数据

本项目旨在使用原生PyTorch统一实现法律判决预测LJP（legal judgment prediction）任务的当前各重要模型，包括对多种语言下多种公开数据集的预处理、多种子任务下的实现。
直接通过命令行即可调用torch_ljp/main.py文件，传入参数并得到对应的结果，需要预先在torch_ljp文件夹下创建config.py文件（由于真实文件的内容对用户来说无意义，因此没有上传，但是我上传了一个fakeconfig.py文件，把里面需要填的参数填上就行）。
具体的使用命令可参考example.txt。
op_examples文件夹是输出示例，见example.txt中介绍的对应的命令行。模型的预测指标及其计算方式详见metrics文件夹中的介绍。我所使用的系统环境中的重要版本见enviroment_v.txt所示。

以下分别介绍本项目中已经可实现分析和处理的数据，模型，及二者相对应的任务中，我跑出来的实验结果和原论文或其他引用论文中跑出来的结果的对比（有海量没整好的内容，等我慢慢补吧）：（如果您希望我添加什么数据或模型，可以直接给我提issue！）

1. 数据

中文：

CAIL（又名CAIL2018数据集）（来源：CAIL2018: A Large-Scale Legal Dataset for Judgment Prediction，下载地址：https://cail.oss-cn-qingdao.aliyuncs.com/CAIL2018_ALL_DATA.zip）（在CAIL2018比赛中，原始任务是：以事实文本作为输入，以分类任务的范式，预测罪名（accusation）、法条（law）、刑期（imprisonment，单位为月，如被判为无期徒刑则是-1、死刑是-2）
CAIL2021（来源：Equality before the law: Legal judgment consistency analysis for fairness，改自CAIL数据集。包含在FairLex中）
LJP-E（还没有完全公开，我发邮件问过一作，他说会全部公开的。来源：Legal Judgment Prediction via Event Extraction with Constraints）
attribute_charge（来源：Few-Shot Charge Prediction with Discriminative Legal Attributes）
LEVEN（来源：LEVEN: A Large-Scale Chinese Legal Event Detection Dataset，下载地址：https://cloud.tsinghua.edu.cn/d/6e911ff1286d47db8016/）

英文：

LJP-MSJudge（来源：Legal Judgment Prediction with Multi-Stage Case Representation Learning in the Real Court Setting）

英文（美国）：

ILLDM（作者在论文里说要公开的，但是GitHub项目里还没有放出来。来源：Interpretable Low-Resource Legal Decision Making）

英文（欧洲）：

ECHR（来源：Neural Legal Judgment Prediction in English，下载地址：https://archive.org/download/ECHR-ACL2019/ECHR_Dataset.zip。包含在LexGLUE中）
ECtHR（来源：Paragraph-level Rationale Extraction through Regularization: A case study on European Court of Human Rights Cases，下载地址：ecthr_cases · Datasets at Hugging Face。使用时同时需引用Neural Legal Judgment Prediction in English。包含在FairLex、LexGLUE中）

英文（印度）：

ILDC（来源：ILDC for CJPE: Indian Legal Documents Corpus for Court Judgment Prediction and Explanation）
ILSI（来源：LeSICiN: A Heterogeneous Graph-Based Approach for Automatic Legal Statute Identification from Indian Legal Documents，下载地址：Dataset and additional files/softwares required for the paper "LeSICiN: A Heterogeneous Graph-based Approach for Automatic Legal Statute Identification from Indian Legal Documents" | Zenodo（除best_model.pt和ils2v.bin外都是数据相关的文件）

法语（比利时）：

BSARD（来源：A Statutory Article Retrieval Dataset in French，下载地址：https://raw.githubusercontent.com/maastrichtlawtech/bsard/master/data/bsard_v1.zip）

多语言：

Swiss-Judgment-Predict dataset（瑞士，德语、法语、意大利语，来源：Swiss-Judgment-Prediction: A Multilingual Legal Judgment Prediction Benchmark，下载地址1 SwissJudgmentPrediction | Zenodo，下载地址2 swiss_judgment_prediction · Datasets at Hugging Face。包含在FairLex中）

2. 模型

2.1 general-domain分类模型（非纯预训练模型的）

TFIDF+SVM（又名LibSVM）：定类数据，多分类单标签范式。（TFIDF来自Term-weighting approaches in automatic text retrieval，SVM来自Least Squares Support Vector Machine Classifiers。CAIL2018: A Large-Scale Legal Dataset for Judgment Prediction使用的baseline。代码参考：CAIL2018/baseline at master · thunlp/CAIL2018）
fastText（来源：Bag of Tricks for Efficient Text Classification。CAIL2018: A Large-Scale Legal Dataset for Judgment Prediction使用的baseline。代码参考：fastText/python at main · facebookresearch/fastText）
TextCNN（又名CNN）（来源：Convolutional neural networks for sentence classification，CAIL2018: A Large-Scale Legal Dataset for Judgment Prediction、LADAN使用的baseline）
LSTM（来源：Long short-term memory）
GRU（来源：Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation)
RCNN（来源：Recurrent Convolutional Neural Networks for Text Classification）
HAN（又名HARNN）（来源：Hierarchical Attention Networks for Document Classification，LADAN使用的baseline）
DPCNN（来源：Deep Pyramid Convolutional Neural Networks for Text Categorization）
随机森林

2.2 domain-specific分类模型（非纯预训练模型的）

2.3 预训练模型的分类模型

2.3.1 general-domain

Bert（来源：BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding）
RoBerta（来源：Roberta: A robustly optimized bert pretraining approach）
DistillBert（来源：DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter）
XLNet（来源：XLNet: Generalized Autoregressive Pretraining for Language Understanding）
NEZHA（来源：NEZHA: Neural Contextualized Representation for Chinese Language Understanding）
Longformer（来源：Longformer: The Long-Document Transformer）

2.3.2 domain-specific

LegalBert（来源：LEGAL-BERT: The Muppets straight out of Law School）
Lawformer（来源：Lawformer: A Pre-trained Language Model for Chinese Legal Long Documents）

2.4 general-domain回归模型

线性回归

2.5 inductive link prediction模型

DEAL（来源：Inductive Link Prediction for Nodes Having Only Attribute Information，LeSICiN使用的baseline）

3. 实验结果

3.1 论文中原有的结果

3.1.1 CAIL数据集

CAIL2018: A Large-Scale Legal Dataset for Judgment Prediction

使用CAIL2018原始任务范式。

训练集是first_stage/train.json，测试集是 first_stage/test.json + restData/rest_data.json（文中说，这个配置是删除多被告情况，仅保留单一被告的案例；删除了出现频数低于30的罪名和法条；删除了不与特定罪名相关的102个法条（没看懂这句话是啥意思））。用THULAC分词，Adam优化器，学习率为0.001，dropout rate是0.5，batch size是128

baseline： ①TFIDF+SVM（SVM是线性核，特征维度是5000，用skip-gram训练200维词向量） ②TextCNN（输入限长4096，filter widths是(2, 3, 4, 5)，filter size是64） ③FastText

指标：accuracy, macro-precision, macro-recall

实验结果：

3.2 我运行官方代码复现的结果

见reappear_files文件夹

3.3 使用本项目代码复现的结果

实验配置见example.txt中的命令行。

其他注意事项：

torch_ljp/dataset_utils/other_data文件夹内放的是一些比较小，而且不太好解释怎么制作的文件，所以直接跟着GitHub项目一起上传了。
1. cn_criminal_law.txt：2021版中华人民共和国刑法。复制自中华人民共和国刑法（2022年最新版） - 中国刑事辩护网中下载的Word文件，并删除了其中语涉“中国刑事辩护网提供……”的字样。

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
configs		configs
op_examples		op_examples
pics		pics
reappear_files		reappear_files
torch_ljp		torch_ljp
.gitignore		.gitignore
README.md		README.md
enviroment_v.txt		enviroment_v.txt
example.md		example.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

1. 数据

2. 模型

2.1 general-domain分类模型（非纯预训练模型的）

2.2 domain-specific分类模型（非纯预训练模型的）

2.3 预训练模型的分类模型

2.3.1 general-domain

2.3.2 domain-specific

2.4 general-domain回归模型

2.5 inductive link prediction模型

3. 实验结果

3.1 论文中原有的结果

3.1.1 CAIL数据集

CAIL2018: A Large-Scale Legal Dataset for Judgment Prediction

3.2 我运行官方代码复现的结果

3.3 使用本项目代码复现的结果

About

Languages

PolarisRisingWar/pytorch_ljp

Folders and files

Latest commit

History

Repository files navigation

1. 数据

2. 模型

2.1 general-domain分类模型（非纯预训练模型的）

2.2 domain-specific分类模型（非纯预训练模型的）

2.3 预训练模型的分类模型

2.3.1 general-domain

2.3.2 domain-specific

2.4 general-domain回归模型

2.5 inductive link prediction模型

3. 实验结果

3.1 论文中原有的结果

3.1.1 CAIL数据集

CAIL2018: A Large-Scale Legal Dataset for Judgment Prediction

3.2 我运行官方代码复现的结果

3.3 使用本项目代码复现的结果

About

Topics

Resources

Stars

Watchers

Forks

Languages