Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate on subsets: x-shot and synthetic clusters #48

Merged
merged 1 commit into from
Feb 6, 2023

Conversation

simsa-st
Copy link
Member

@simsa-st simsa-st commented Feb 1, 2023

Default is now to print without showing fieldtypes and with showing these subsets: 0-shot, 1-3-shot and 4+-shot.

@simsa-st
Copy link
Member Author

simsa-st commented Feb 1, 2023

Default report:

poetry run docile_print_evaluation_report --evaluation-result-path data/example_val_results_KILE.json --dataset-path data/docile

Evaluation report for docile221221-0:val subsets

KILE

Primary metric (AP): 0.43411554364183247

subsets AP f1 precision recall TP FP FN
docile221221-0:val 0.434 0.640 0.688 0.599 3510 1589 2352
Dataset(docile:val-0-shot) 0.299 0.520 0.552 0.491 623 505 647
Dataset(docile:val-1-3-shot) 0.441 0.640 0.698 0.592 783 339 540
Dataset(docile:val-4+-shot) 0.489 0.688 0.739 0.644 2104 745 1165

Notes:

  • '{dataset}-x-shot' means that the evaluation is restricted to documents from layout clusters with x documents for training available. Here 'training' means trainval for test and train for val.
  • '{dataset}-synth-clusters-only' means that the evaluation is restricted to documents from layout clusters for which synthetic data exists.
  • For AP all predictions are used. For f1, precision, recall, TP, FP and FN predictions explicitly marked with flag use_only_for_ap=True are excluded.

With synthetic subsets:

poetry run docile_print_evaluation_report --evaluation-result-path data/example_val_results_KILE.json --dataset-path data/docile --evaluate-synthetic-subsets

Evaluation report for docile221221-0:val subsets

KILE

Primary metric (AP): 0.43411554364183247

subsets AP f1 precision recall TP FP FN
docile221221-0:val 0.434 0.640 0.688 0.599 3510 1589 2352
Dataset(docile:val-synth-clusters-only) 0.509 0.698 0.743 0.658 1218 421 633
Dataset(docile:val-0-shot) 0.299 0.520 0.552 0.491 623 505 647
Dataset(docile:val-1-3-shot) 0.441 0.640 0.698 0.592 783 339 540
Dataset(docile:val-1-3-shot-synth-clusters-only) 0.482 0.671 0.710 0.636 456 186 261
Dataset(docile:val-4+-shot) 0.489 0.688 0.739 0.644 2104 745 1165
Dataset(docile:val-4+-shot-synth-clusters-only) 0.527 0.715 0.764 0.672 762 235 372

Notes:

  • '{dataset}-x-shot' means that the evaluation is restricted to documents from layout clusters with x documents for training available. Here 'training' means trainval for test and train for val.
  • '{dataset}-synth-clusters-only' means that the evaluation is restricted to documents from layout clusters for which synthetic data exists.
  • For AP all predictions are used. For f1, precision, recall, TP, FP and FN predictions explicitly marked with flag use_only_for_ap=True are excluded.

No subsets but fieldtypes:

poetry run docile_print_evaluation_report --evaluation-result-path data/example_val_results_KILE.json --evaluate-x-shot-subsets "" --evaluate-fieldtypes

Evaluation report for docile221221-0:val

KILE

Primary metric (AP): 0.43411554364183247

fieldtype AP f1 precision recall TP FP FN
-> micro average 0.434 0.640 0.688 0.599 3510 1589 2352
account_num 0.000 0.000 0.000 0.000 0 4 9
amount_due 0.579 0.750 0.791 0.713 371 98 149
amount_paid 0.533 0.622 0.667 0.583 14 7 10
amount_total_gross 0.514 0.697 0.710 0.684 355 145 164
amount_total_net 0.377 0.548 0.607 0.500 34 22 34
amount_total_tax 0.584 0.729 0.775 0.689 31 9 14
bank_num 0.611 0.571 0.500 0.667 4 4 2
bic 0.000 0.000 0.000 0.000 0 0 0
currency_code_amount_due 0.038 0.084 0.320 0.048 16 34 315
customer_billing_address 0.612 0.748 0.745 0.752 318 109 105
customer_billing_name 0.620 0.763 0.777 0.749 384 110 129
customer_delivery_address 0.182 0.333 0.296 0.381 8 19 13
customer_delivery_name 0.257 0.456 0.419 0.500 13 18 13
customer_id 0.601 0.726 0.748 0.705 122 41 51
customer_order_id 0.275 0.410 0.447 0.378 17 21 28
customer_other_address 0.594 0.760 0.864 0.679 19 3 9
customer_other_name 0.402 0.598 0.641 0.560 75 42 59
customer_registration_id 0.000 0.000 0.000 0.000 0 0 2
customer_tax_id 0.000 0.000 0.000 0.000 0 0 0
date_due 0.731 0.812 0.867 0.765 65 10 20
date_issue 0.732 0.819 0.835 0.803 411 81 101
document_id 0.610 0.740 0.729 0.753 341 127 112
iban 0.000 0.000 0.000 0.000 0 0 0
order_id 0.225 0.448 0.534 0.386 117 102 186
payment_reference 0.000 0.000 0.000 0.000 0 0 0
payment_terms 0.560 0.692 0.670 0.715 118 58 47
tax_detail_gross 0.371 0.588 0.645 0.541 20 11 17
tax_detail_net 0.298 0.529 0.600 0.474 18 12 20
tax_detail_rate 0.458 0.533 0.571 0.500 4 3 4
tax_detail_tax 0.460 0.615 0.667 0.571 24 12 18
vendor_address 0.250 0.489 0.541 0.447 244 207 302
vendor_email 0.253 0.488 0.512 0.467 21 20 24
vendor_name 0.264 0.507 0.575 0.453 290 214 350
vendor_order_id 0.336 0.529 0.474 0.600 9 10 6
vendor_registration_id 0.500 0.667 0.500 1.000 1 1 0
vendor_tax_id 0.324 0.517 0.508 0.525 31 30 28

Notes:

  • For AP all predictions are used. For f1, precision, recall, TP, FP and FN predictions explicitly marked with flag use_only_for_ap=True are excluded.

it is also possible to combine subsets and fieldtypes which then prints subsets summary and individual per-subset reports (not showing here because it is too long).

@simsa-st simsa-st force-pushed the sts-few-shot-clusters-eval branch from c61dba4 to 129f3d9 Compare February 1, 2023 14:40
@simsa-st simsa-st force-pushed the sts-few-shot-clusters-eval branch from 129f3d9 to eb10af9 Compare February 1, 2023 14:47
@simsa-st simsa-st requested a review from ahHamdi February 1, 2023 14:56
@simsa-st simsa-st merged commit 41c2db8 into main Feb 6, 2023
@simsa-st simsa-st deleted the sts-few-shot-clusters-eval branch February 6, 2023 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants