This is a natural language classifier, which uses key-word methods to classify future-referring sentences in terms of whether they use the present tense, future tense, or express epistemic certainty or uncertainty.
It is recommended that ftr_classifier be used in a conda
environment with python 3.6
installed. Instructions on getting started with conda
can be found here. Once conda
is installed, create an appropriate environment, with conda create --name my_env_name python=3.6
, then activate the environment with conda activate my_env_name
or source activate my_env_name
. Before installing I recommend you install pandas>2.x
and spaCy 2.2.1
, with e.g. conda install -c conda-forge spacy=2.2.1
Then, to install run:
pip install ftr-classifier
To install a version of the ftr-classifier
which is designed to operate on naturally-occuring (rather than experimental) data, install from the natural_ftr
branch. The classification logis is changed slightly here. This branch also includes a function which wraps a spaCy
model which estimates whether, rather than how, a text datum refers to the future described here. When processing naturally-occuring data, I recommend you use this and then pass FTR (> 50%) statements to the ftr-classifier
, as described in the linked pre-print above.
See minimal examples for further explanation on usage.
#import ftr_classifier
import ftr_classifier as ftr
#load data
df = pd.read_excel('data.xlsx')*
#classify
"""
to implement the spaCy model which estimates whether an item of text refers to the future (and the past) run:
"""
ftr.estimate_ftr_ptr(df)
"""
to implement the ftr-classifier, whih estimates how an FTR statement refers to the future, run:
"""
class_df = ftr.classify_df(df)
#count lemmas
lemma_count = ftr.count_lemmas(class_df)
#clean spacy docs
class_df = ftr.clean_spacy(class_df)
#save
class_df.to_excel('classified_data.xlsx',index=False)*
lemma_count.to_excel('lemma_counts.xlsx',index=False)*
.xlsx
orpickle
are the recommended filetypes, over.csv
, because excel tends to mangle non-ascii (i.e. Dutch, German) characters when it opens.csv
files. Seepandas
docs on importing different file types.
ftr_classifier
has two main purposes. Firstly, it scores some pandas
dataframe, "df
", containing English, Dutch or German natural language strings in terms of how these data refer to the future, i.e whether a future tense marker is used, whether the present tense is used, or whether some kind of epistemic modal expression is used. For explanation and justification of this classification scheme as regards English, Dutch, and German, see Robertson et al. (TKTK)
. Secondly, it provides counts of the lemmas/stems of the words in these data which are used in the key-word classification procedure.
The default column names are response
for the column containing natural language strings, and language
for the column indexing language. But these can be altered according to the user's data by passing key word arguments as in, ftr.prepare(lang_col='new_language_col_name',text_col='new_response_col_name')
or ftr.score(lang_col='new_language_col_name',text_col='new_response_col_name')
. Natural language responses should have been generated using the experimental methods described in Robertson et al (TKTK)
.
ftr.prepare()
processesdf['response']
usingspacy
natural language processing models, and returnsdf
withdf['spacy_doc']
, the spaCy-processed version ofdf['response']
, anddf['final_sentence']
, which is aspacy
document of the last sentence indf['response']
.ftr.score()
classifiesdf['final_sentence']
in terms of how it refers to the future.ftr.apply_dominance()
applies a dominance relationship to the output offtr.score()
, as described inRobertson et al. (TKTK)
. Any analyses on the results of this package should be performed on dominance-subjected results, i.e. the columns ending in_dom
. If there is no_dom
column for a particular category, e.g.verb_poss
, then this category is the dominant category and can be used in analysis.
Finally, ftr.classify_df()
calls ftr.prepare()
, ftr.score()
and ftr.apply_dominance()
in sequence and returns a dataframe scored according to the descriptions in Robertson et al (TKTK)
. This is the recommended apprach, as given in the minimal example.
Calling ftr.lassify_df(df)
appends the following columns to df
. Except for when a language does not have the word category of the column in question, columns are scored as 1
when a given feature is present and 0
when it is not. When a language does not have the word category in question, all values for that feature for that langauge are scored -999
.
response_clean
a pythonlist
, of the tokens in the final sentence in the strings indf['response']
.present
: indicates whether the reponse uses one of the words inftr.WORD_LISTS[lang]['present']
wherelang
== the language indf['language']
, and is in['english','dutch','german']
, i.e. whether it uses the present tense of the main verb in each response.future
: indicates whether the reponse uses one of the words inftr.WORD_LISTS[lang]['future']
wherelang
== the language indf['language']
, and is in['english','dutch','german']
, i.e. whether it uses a future tense marker.verb_poss
: indicates whether the reponse uses one of the words inftr.WORD_LISTS[lang]['verb_poss']
wherelang
== the language indf['language']
, and is in['english','dutch','german']
, i.e. whether the response uses a modal verb indicating uncertainty, e.g. could, might, or may, in English.verb_cert
: indicates whether the reponse uses one of the words inftr.WORD_LISTS[lang]['verb_cert']
wherelang
== the language indf['language']
, and is in['english','dutch','german']
, i.e. whether the response uses a modal verb indicating certainty, i.e. must, in English.adv_adj_poss
: indicates whether the reponse uses one of the words inftr.WORD_LISTS[lang]['adv_adj_poss']
wherelang
== the language indf['language']
, and is in['english','dutch','german']
, i.e. whether the response uses a modal adverb/adjective indicating uncertainty, e.g. possibly, maybe, or probably, in English.adv_adj_cert
: indicates whether the reponse uses one of the words inftr.WORD_LISTS[lang]['adv_adj_cert']
wherelang
== the language indf['language']
, and is in['english','dutch','german']
, i.e. whether the response uses a modal adverb/adjective indicating certainty, e.g. certain, definitely, or surely, in English.mental_poss
: indicates whether the reponse uses one of the words inftr.WORD_LISTS[lang]['mental_poss']
wherelang
== the language indf['language']
, and is in['english','dutch','german']
, i.e. whether the response uses an epistemic mental state predicate indicating uncertainty, e.g. think, or believe, in English.mental_cert
: indicates whether the reponse uses one of the words inftr.WORD_LISTS[lang]['mental_cert']
wherelang
== the language indf['language']
, and is in['english','dutch','german']
, i.e. whether the response uses an epistemic mental state predicate indicating certainty, i.e. know, in English.particle_poss
: indicates whether the reponse uses one of the words inftr.WORD_LISTS[lang]['particle_poss']
wherelang
== the language indf['language']
, and is in['dutch','german']
, i.e. whether the response uses an epistemic modal particle indicating uncertainty, i.e. wel in Dutch, or wohl in German (English does not have modal particles, so alldf['language'=='english','particle_poss']
==-999
, formissing
).particle
: indicates whether the reponse uses one of the words inftr.WORD_LISTS[lang]['particle']
wherelang
== the language indf['language']
, and is in['dutch','german']
, i.e. whether the response uses an epistemic modal particle apart from those indicating uncertainty, i.e. toch in Dutch or doch in German (English does not have modal particles, so alldf['language'=='english','particle']
==-999
, formissing
).will_future
: indicates whether the reponse uses one of the words inftr.WORD_LISTS[lang]['will_future']
wherelang
== the language indf['language']
, and is in['english','dutch','german']
, i.e. whether it uses a future tense marker will (English), zullen (Dutch), or werden (German).*go_future
: indicates whether the reponse uses one of the words inftr.WORD_LISTS[lang]['go_future']
wherelang
== the language indf['language']
, and is in['english','dutch']
, i.e. whether it uses a form of the future tense marker be going to (English), or gaan (Dutch) (German does not have a future tense marker grammaticised from a motion verb, i.e. 'go', so alldf['language'=='german','go_future']
==-999
, formissing
).*negated
a column indicating whether a negation is present indf['response']
.present_dom
: indicates whetherdf['present'] == 1
and notdf[['future','*_cert','*_poss']] == 1
, i.e. whether a response uses the present tense of the main verb and not also a future tense marker or an epistemic modal expression.future_dom
: indicates whetherdf['future'] == 1
and notdf[['*_cert','*_poss']] == 1
, i.e. whether a response uses the a future tense marker and not also an epistemic modal expression.will_future_dom
: indicates whetherdf['will_future'] == 1
and notdf[['*_cert','*_poss']] == 1
, i.e. whether a response uses the a non-go-based future tense marker and not also an epistemic modal expression.will_future_dom
: indicates whetherdf['will_future'] == 1
and notdf[['*_cert','*_poss']] == 1
, i.e. whether a response uses a future tense marker not based on 'go', and not also an epistemic modal expression.go_future_dom
: indicates whetherdf['go_future'] == 1
and notdf[['*_cert','*_poss']] == 1
, i.e. whether a response uses a future tense marker based on 'go', and not also an epistemic modal expression. (-999
for all German responses).lexi_poss
:1
ifany
indf[['adv_adj_poss',particle_poss','mental_poss']] == 1
, else0
, i.e. whether a response uses an expression indicating epistemic uncertainty, which is not a modal verb.lexi_cert
:1
ifany
indf[['adv_adj_cert',particle_cert','mental_cert']] == 1
, else0
, i.e. whether a response uses an expression indicating epistemic certainty, which is not a modal verb.
- Note that if either
'go_future'
or'will_future'
==1
, then'future'
==1
.
Additionally, ftr_classifier
includes functionality to count word occuances according to the sementically relevant lemmas/stems of the words used to classify sentences. The function for doing this is ftr.count_lemmas()
. If already created, result of ftr.prepare()
or ftr.classify_df()
should be passed to ftr.count_lemmas()
, as in ftr.count_lemmas(df=df_class)
, where df_class == ftr.prepare(df) OR ftr.classify_df(df)
. ftr.count_lemmas()
. If ftr.count_lemmas()
does not find a column called final_sentence
, it will create it by calling ftr.prepare()
. ftr.count_lemmas()
returns a dataframe containing the following columns:
language
: the language defined indf['language']
.feature
: which classification feature the lemma is defined as being a part of, see above.lemma
: the lemma/stem in question. The custom lemmatizer is actually not strictly a lemmatiser, as words from different word classes (i.e. epistemic modal adverbs and adjectives) are "lemmatised" to their adjectival/nominal form. It therefore sometimes behaves more like a stemer, but stems back to a real in-the-dictionary lexeme. The choice of adjectival forms to be the "lemma" is entirely arbitrary. The reasons for the hibrid lemmatising/stemming approach, is that we are interested inRobertson et al. (TKTK)
, in the semantic domain associated with a given root more so than the subtle differences delineated by similar derived/inflected versions of the same root. For instance, in German, we drop person and number marking of modal verbs, but retain mood marked, as mood marking of modal adverbs alters their modal strength. The the subjunctive/konjunktiv II mood is indicated by the suffix_SUBJ
, while the indicative is indicated with the suffix_IND
.count
: the count of the lemma within each language.num_responses
: the number of responses within the language. This and the next column are included in case researchers wish to normalise counts, in the case they have differing number of responses in each language.num_words
the sum of the number of words indf['final_sentence']
for each language.
Finally we provide a function which drops the spacy
docs automatically appended to the dataframe when ftr.prepare()
or ftr.classify_df()
are called. This is because these are memory intensive and sometimes cause display difficulties in the resultant dataframes in some common IDEs, e.g. spyder
. Use is clean_df = ftr.clean_spacy(df_class)
, where df_class == ftr.prepare(df) OR ftr.classify_df(df)
.
Minimal example scripts are located in ./minimal_examples/
. There is an iPython
and base python
version, which demonstrate the same function calls. Opening minimal_example.ipynb
by clicking on it in git
will show the interested reader examples of usage with printed results.
If researchers wish to change the word lists, they should clone this repo and open word_lists.py
, and alter importing paths so they point locally, as well as the default paths in ftr.classify._save()
and ftr.classify._load()
, any additions/subtractions can be added as line edits in word_lists.py
, when this script is run, it will prompt for the desired lemmas to be saved to path defined in ftr.classify._save()
. Eventually, functions to save/load local word lists will be added.