Semantic Role Labeling is a task within the domain of natural language processing, that attempts to answer who, did what, to whom given a sentence. This repository is a partial implementation of a masters dissertation, (BELTRAO, 2016) that uses Propbank Br v1.1 as the golden set, Liblinear implemention for Support Vector Machines for multiclass classfication, Official Conll 2005 Shared Task scrips for evaluation and three sets of feature enginnering.
ID | FORM | LEMMA | GPOS | MORPH | DTREE | FUNC | CTREE | PRED | ARG |
---|---|---|---|---|---|---|---|---|---|
1 | A | o | art | F|S | 2 | >N | (FCL(NP* | - | * |
2 | série | série | n | F|S | 8 | SUBJ | * | - | (A1*) |
3 | exibida | exibir | v-pcp | F|S | 2 | N< | (ICL(VP*) | - | * |
4 | aqui | aqui | adv | - | 3 | ADVL | (ADVP*) | - | * |
5 | por | por | prp | - | 3 | PASS | (PP* | - | * |
6 | a | o | art | F|S | 7 | >N | (NP* | - | * |
7 | Cultura | cultura | n | F|S | 5 | P< | *))) | - | * |
8 | estreiou | estrear | v-fin | PS|3S|IND | 0 | STA | (VP*) | estreiar | * |
9 | em | em | prp | - | 8 | ADVL | (PP* | - | (AM-LOC*) |
10 | a | o | art | F|S | 11 | >N | (NP* | - | * |
11 | TVI | TVI | prop | F|S | 9 | P< | * | - | * |
12 | de | de | prp | _ | 11 | N< | (PP* | - | * |
13 | Portugal | Portugal | prop | M|S | 12 | P< | (NP*)))) | _ | * |
14 | . | . | pu | - | 8 | PUC | *) | - | * |
Num | Name | Description |
---|---|---|
1 | ID | Counter of tokens that begins at 1 for each new proposition. |
2 | FORM | Tokenized word or punctuation sign. |
3 | LEMMA | Lemma gold-standard for FORM. |
4 | GPOS | Post tagging gold-standard for FORM. |
5 | MORPH | Morphological features gold-standard. |
6 | DTREE | Dependency tree gold-standard. |
7 | FUNC | Syntactic function of the token for your regent in dependency tree. |
8 | CTREE | Syntactic tree gold-standard. |
9 | PRED | Semantic Predicates on preposition. |
10 | ARG | Semantic role label for the regent of the argument on DTREE according to PropBank annotations. |
Tag | Description |
---|---|
A0 | Usually the agent (actor). |
A1 | Usually the patient or theme (receiver). |
A2...A5 | Verb oriented tags. |
AM-ADV | Modifier adverb. |
AM-CAU | Cause. |
AM-DIR | Direction. |
AM-DIS | Discursive. |
AM-EXT | Extension. |
AM-MED | Non documented tag. |
AM-LOC | Location. |
AM-MNR | Manner. |
AM-NEG | Negation. |
AM-PNC | Purpose. |
AM-PRD | Secondary predication. |
AM-REC | Reciprocal. |
AM-TMP | Temporal. |
There are 5 groups of features described on the dissertation, 4 of which are implemented as follows:
All golden standard features except ARG.
Attribute | Description |
---|---|
FORM | FORM Tokenized word or punctuation sign. |
LEMMA | Lemma gold-standard for FORM. |
GPOS | Post tagging gold-standard for FORM. |
MORPH | Morphological features gold-standard. |
FUNC | Syntactic function of the token for your regent in dependency tree. |
Lead and lag features around the token.
Attribute | Description |
---|---|
LeftForm 1,2, and 3 | FORM from the 3 tokens to the left. |
RightForm 1,2 and 3 | FORM from the 3 tokens to the right. |
LeftFunc 1,2 and 3 | FUNC from the 3 tokens to the left. |
RightFunc 1,2 and 3 | FUNC from the 3 tokens to the right. |
LeftLemma 1,2 and 3 | LEMMA from the 3 tokens to the left. |
RightLemma 1,2, and 3 | LEMMA from the 3 tokens to the right. |
LeftGPOS 1,2 and 3 | GPOS from the 3 tokens to the left. |
RightGPOS 1,2 and 3 | GPOS from the 3 tokens to the right. |
Lead and lag features around predicate.
Attribute | Description |
---|---|
PredLemma | LEMMA of target verb |
PredLeftLemma | LEMMA of token to the left of target verb. |
PredRightLemma | LEMMA of token to the right of the target verb. |
PredGPOS | GPOS of target verb. |
PredLeftGPOS | GPOS of token to the left of target verb. |
PredRightGPOS | GPOS of token to the right of target verb. |
PredFunc | FUNC of target verb. |
PredLeftFunc | FUNC of the token to the left of the verb. |
PredRightFunc | FUNC of the token to the right of the verb. |
PredicateDistance | ID of the target verb minus ID of current token. |
PredMorph 1..n | Set of 32 MORPH for target verb. |
PassiveVoice | Passive voice indicator. True if verb has GPOS=v-pcp and is proceeded for token with LEMMA=ser having or not _token_with GPOS=adv between them. |
PosRelVerb | If token is before or after verb. |
Token depentencies and path to predicate.
Atribute | Description |
---|---|
DepLemmaParent | LEMMA from the father of the token. |
DepLemmaGrandparent | LEMMA from the grandfather of the token. |
DepLemmaChild 1, 2 and 3 | LEMMA from the 3 first children of the token. |
DepGPOSParent | GPOS from the father of the token. |
DepGPOSGrandParent | GPOS from the grandfather of the token. |
DepGPOSChild 1, 2 and 3 | GPOS from the 3 first children of the token. |
DepFuncParent | FUNC from the father of the token. |
DepFuncGrandParent | FUNC from the grandfather of the token. |
DepFuncChild 1, 2 and 3 | FUNC from the 3 first children of the token. |
DepPathFunc | Path of FUNC tags between token and target verb passing through minor common ancestor. |
DepPathGPOS | Path of GPOS tags between token and target verb passing through minor common ancestor. |
The following defines the current project structure tree:
- srlbr/
- datasets_1.1/
- conll/
- csvs/
- props/
- models/
- lib/
- __init__.py
- evaluator.py
- feature_factory.py
- svm.py
- utils.py
- srlconll-1.1/
- requirements.txt
- README.md
- srl.py
- .gitignore
Formatted Train, test and validation gold-standard in a format suitable for machine learning models. Originals can be found here.
Stored feature columns.
gold-standard propositions which are evaluated by conll script.
Liblinear executables and python liblinear.py and liblinearutil.py
Exports svm_srl function
Thin wrapper for the conll script uses subprocess and runs the official script.
Here the features are engineered
Wrapper of liblinear calls on liblinear lib and main function svm_srl.
Handles storing.
Official Conll 2005 Shared Task Pearl script
Python project requirements.
Command line callable script to run the script
> pip install -r requirements.txt
Conll 2005 shared task uses Pearl 5. Chances are that is already installed.
> perl -v
Set the PERL5LIB environment variable.
> setenv PERL5LIB $HOME/path/to/srlbr/srlconll-1.1/lib:$PERL5LIB
Usage:
> perl $HOME/path/to/srlbr/srlconll-1.1/bin/srl-eval.pl <gold_props> <predicted_props>
The full installation installation instructions can be found srlconll-1.1/
Download liblinear for python. Navegate to the folder.
> make
Place executables *.so. files at models/lib and liblinear.py and liblinearutils.py.
> python srl.py -h
Summons help
> python srl.py
Runs multi-class classification using the following solvers:
Solver | Description |
---|---|
0 | L2-regularized logistic regression (primal) |
1 | L2-regularized L2-loss support vector classification (dual) |
2 | L2-regularized L2-loss support vector classification (primal) |
3 | L2-regularized L1-loss support vector classification (dual) |
4 | support vector classification by Crammer and Singer |
5 | L1-regularized L2-loss support vector classification |
6 | L1-regularized logistic regression |
7 | L2-regularized logistic regression (dual),for regression |
Using only gold-standard and windows set of features.
> python srl.py 1 -context -dtree -windows
Runs L2-regularized L2-loss support vector classification (dual) with context dtree and windows set of features.
C | Precision | Recall | F1 | |
---|---|---|---|---|
(BELTRAO, 2016) | 0.0625 | 82.17% | 82.88% | 82.52% |
Features | C | S | Precision | Recall | F1 |
---|---|---|---|---|---|
context + dtree + window | 0.0625 | 1 | 76.80% | 75.67% | 76.23% |
dtree | 0.0625 | 4 | 76.31% | 75.13% | 75.72% |
context + window | 0.0625 | 4 | 57.13% | 53.96% | 55.50% |
context | 0.0625 | 4 | 57.37% | 46.42% | 51.32% |
window | 0.0625 | 2 | 40.61% | 26.79% | 32.28% |
(ALVA-MANCHEGO, 2013)
Anotação automática semissupervisionada de papéis semânticos para o português do Brasil
(BELTRAO, 2016)
Anotador de Papeis Semânticos para Português