UniFunc is a text mining tool that processes and analysis text similarity between a pair of protein function annotations. It is mainly used as a cross-linking mechanism or redundancy elimination tool when processing annotations without any sort of database identifiers.
Name | Downloads | Version | Platforms | Latest release |
---|---|---|---|---|
Please cite https://doi.org/10.1515/hsz-2021-0125 when using UniFunc.
Installing unifunc
from the conda-forge
channel can be achieved by adding conda-forge
to your channels with:
conda config --add channels conda-forge
conda config --set channel_priority strict
Once the conda-forge
channel has been enabled, unifunc
can be installed with:
conda install unifunc
UniFunc can be run in two modes:
The default mode returns the similarity score (float) between the provided strings, to run it use:
unifunc "this is string1" "this is string2"
The secondary mode requires the user to set a threshold (e.g. 0.95) with the argument -t
, and True will be returned if the string similarity is above the threshold, and False otherwise. To run it use:
unifunc string1 string2 -t 0.95
To use verbose mode add the argument -v
, to redirect output to a file, add the argument -t file_path
To run a sample execution use: unifunc --example
At the moment, only one workflow is available cluster_function.
To use it run unifunc cluster_function -h
and you will get all the information regarding inputs.
- Delete all files in
UniFunc/Resources/
- Go to https://www.uniprot.org/uniprot/?query=reviewed
- Search for all protein entries
- Choose the columns
Entry
,Protein names
,andFunction [CC]
- Apply columns
- Download results in tab separated format
- Check if download file has these 3 headers:
Entry Protein names Function [CC]
- Rename the downloaded file to
uniprot.tab
and move itUniFunc/Resources/
- Go to http://geneontology.org/docs/download-ontology/
- Download
go.obo
- Move the file
go.obo
toUniFunc/Resources/
Here's an overview of the UniFunc workflow:
The natural language processing of functional descriptions entails several steps:
- Text pre-processing:
- Split functional descriptions into documents
- Remove identifiers
- Standardize punctuation
- Remove digits that are not attached to a token
- Standardize ion patterns
- Replace Roman numerals with Arabic numerals
- Divide document into groups of tokens
- Unite certain tokens (for example: “3” should be merged with “polymerase 3”)
- Part-of-speech tagging
- pos_tag with universal tagging (contextual)
- Wordnet tagging (independent)
- Choose best tag (Wordnet takes priority)
- Removal of unwanted tags (determiners, pronouns, particles, and conjunctions)
- Token scoring
- Try to find synonyms (wordnet lexicon) shared between the 2 compared documents
- Build Term frequency- Inverse Document Frequency vectors (TF-IDF)
- Similarity analysis
- Calculate cosine distance between the two scaled vectors
- Calculate Jaccard distance between the two sets of identifiers
- If similarity score is above the 0.8 consider, it a match
Part-of-speech tagging (POST) is the method of lexically classifying tokens based on their definition and context. In the context of this application, the point is to eliminate tokens that are not relevant to the similarity analysis.
After pre-processing, tokens are tagged with a custom tagger SequentialBackOffTagger independent of context. This tagger uses Wordnet’s lexicon to identify the most common lexical category of any given token.
Should a token be present in Wordnet’s lexicon, a list of synonyms and their lexical category is generated, for example:
[(token,noun),(synonym1,noun) ,(synonym2,verb),(synonym3,adjective),(synonym4,noun)]
The token is then assigned the most common tag noun.
To adjust this lexicon to biological data, gene ontology tokens are also added.
Untagged tokens are then contextually classified with a Perceptron tagger. The classification obtained from this tagger is not optimal (as a pre-trained classifier is used), however, in the current context this is barely of consequence, as this tagger is merely used as a backup when no other tag is available. Optimally a new model would be trained, but unfortunately this would require heavy time-investment in building a training dataset.
The tokens tagged as being determiners, pronouns, particles, or conjunctions are removed.
In this step, tokens are scored based on the “Term frequency- Inverse Document Frequency” technique. This allows the analysis on which tokens are more relevant to a certain annotation, which in turn allows for the identification of other annotations with the same similarly important tokens.
TF-IDF measures the importance of a token to a document in a corpus. To summarize:
- TF - Tokens that appear more often in a document should be more important. This is a local (document wide) metric.
- IDF - tokens that appear in too many documents should be less important. This is a global (corpus wide) metric.
TF-IDF is calculated with the following equation:
- NT, times token appears in document
- TT, total amount of tokens in document
- TD, total amount of documents
- DT, total amount of times a certain token appears in a document – frequency table
The corpus used to build this metric were all the 561.911 reviewed proteins from Uniprot (as of 2020/04/14). After pre-processing, each protein annotation is split into tokens, and a token frequency table (DT) is calculated and saved into a file.
The TF-IDF score is then locally scaled (min_max scaling relative to the document) so that we can better understand which tokens are more relevant within the analysed document.
Finally, we can then compare annotations from different sources, by calculating the cosine distance between each pair of TF-IDF scaled vectors. Should the tokens they contain and their importance within the document be around the same, the annotations are classified as “identical”. Identifiers within the free-text description are also taken into account, via the Jaccard distance metric. A simple intersection is not used as more general identifiers might lead to too many false positives.
In this manner we are able to construct groups of hits (from different sources) that match between either via identifiers or free-text descriptions. We then evaluate the quality of each group of consensuses and select the best one, taking into account:
- Percentage of the sequence covered by the hits in the consensus
- Significance of the hits (e-value) in the consensus
- Significance of the reference datasets
- Number of different reference datasets in the consensus