Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add wrapper for training Spec2Vec in Galaxy #314

Merged
merged 60 commits into from
Jan 5, 2023
Merged
Show file tree
Hide file tree
Changes from 45 commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
7770b5a
init Spec2Vec Training tool
maximskorik Nov 29, 2022
33e1b53
add dockerfile
maximskorik Nov 28, 2022
9f2d3e3
add parameters
maximskorik Nov 29, 2022
c03ef6d
add prefixes; move command to a proper tag
maximskorik Nov 30, 2022
eba223f
combine checkpoints and epochs; fix filter; remove n_workers setter
maximskorik Nov 30, 2022
8ac2acf
handle checkpoint in wrapper
maximskorik Nov 30, 2022
bfe58c7
add help section
maximskorik Nov 30, 2022
dd5271f
remove duplicated argument
maximskorik Nov 30, 2022
38d91c4
fix args in command
maximskorik Nov 30, 2022
ecfbba5
remove indentation
maximskorik Nov 30, 2022
a24b910
fix reference to attribute
maximskorik Nov 30, 2022
2c073a1
fix reference to envvar
maximskorik Nov 30, 2022
205a8e9
remove unsupported operator
maximskorik Nov 30, 2022
7568759
fix argname
maximskorik Dec 1, 2022
b75080e
fix argnames in galaxy wrapper
maximskorik Dec 1, 2022
4ea8cd4
add call to main func
maximskorik Dec 1, 2022
9ccaf48
unquote numeric and bool args; fix references to attributes
maximskorik Dec 1, 2022
2e71b92
fix args initialization
maximskorik Dec 1, 2022
c757b69
remove quotes form var; fix boolean args
maximskorik Dec 1, 2022
ec9ddfc
move cli to config
maximskorik Dec 1, 2022
248dcd2
fix args in python wrapper
maximskorik Dec 1, 2022
7321ed9
sync defaults with spec2vec defaults
maximskorik Dec 1, 2022
c0c523a
fix workers arg
maximskorik Dec 1, 2022
bb520c5
remove trim_rule arg
maximskorik Dec 1, 2022
934bc4a
edit argnames
maximskorik Dec 1, 2022
e19fe44
configure pickle output
maximskorik Dec 1, 2022
d075726
add symlink for hardcoded `.npy` ext; change output argnames
maximskorik Dec 1, 2022
905ccb9
add checkpoints discovery
maximskorik Dec 1, 2022
1076870
edit output labels
maximskorik Dec 1, 2022
1f0fca2
add missing param
maximskorik Dec 2, 2022
f003622
change `cbow_mean` default
maximskorik Dec 2, 2022
a0a06f8
rename var
maximskorik Dec 2, 2022
8d2a5ab
change arg names
maximskorik Dec 2, 2022
107d9f2
change `sample` default
maximskorik Dec 2, 2022
1229c33
change data formats and fix symlink
maximskorik Dec 5, 2022
6082db9
add filter for checkpoints output
maximskorik Dec 5, 2022
46568ae
add test for pickle output
maximskorik Dec 6, 2022
310c69c
add test data
maximskorik Dec 6, 2022
aee664a
add test for outputting checkpoints
maximskorik Dec 6, 2022
f0f6ad0
check num outputs to validate `filter` works
maximskorik Dec 6, 2022
d80eba5
add validation for checkpoints
maximskorik Dec 6, 2022
0a8b8ff
clarify inputs help; add validatiors
maximskorik Dec 6, 2022
0a5cf23
add creator
maximskorik Dec 6, 2022
0ce10fd
remove trailing slash from `.shed.yml`
maximskorik Dec 6, 2022
48c10b6
fix collection name
maximskorik Dec 6, 2022
f0b9710
remove heavy pickle test file
maximskorik Dec 7, 2022
61ae8a9
fix label
maximskorik Dec 7, 2022
4477ff4
trim test input to reduce output size
maximskorik Dec 7, 2022
c53dc46
lint files
maximskorik Dec 7, 2022
88b9ad5
Merge branch 'spec2vec_wrapper' of https://github.com/maximskorik/gal…
maximskorik Dec 7, 2022
1810531
Revert "trim test input to reduce output size"
maximskorik Dec 7, 2022
f2884d3
remove binary comparison since not consistent with GH CI
maximskorik Dec 7, 2022
61318f4
add test for embedding size correspondece
maximskorik Dec 7, 2022
f55bc72
update validator regex
maximskorik Dec 14, 2022
70991af
remove suffixes due to tests failing
maximskorik Dec 19, 2022
4e0fab4
move symlinks to command tag; quote script path
maximskorik Dec 19, 2022
0cc4fa0
use proper input type for dynamic options
maximskorik Dec 19, 2022
662d691
add numeric validators
maximskorik Dec 19, 2022
ad63882
use `select` for dynamic option
maximskorik Dec 19, 2022
8ec040a
add shebang
maximskorik Dec 19, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions tools/spec2vec/.shed.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
owner: recetox
remote_repository_url: "https://github.com/RECETOX/galaxytools/tree/master/tools/spec2vec"
homepage_url: "https://github.com/iomega/spec2vec"
categories:
- Metabolomics
repositories:
spec2vec_training:
description: "Train a Spec2Vec model for mass spectra similarity scoring."
include:
- spec2vec_training.xml
- macros.xml
- spec2vec_training_wrapper.py
- test-data
12 changes: 12 additions & 0 deletions tools/spec2vec/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
FROM python:3.9

ARG COMMIT_SHA=c9b54b950e0dbb8053ba95aabdb2d815e11e7503

WORKDIR /spec2vec

# download src
RUN wget -O /tmp/$COMMIT_SHA.zip https://github.com/iomega/spec2vec/archive/${COMMIT_SHA}.zip && \
unzip /tmp/${COMMIT_SHA}.zip

# install spec2vec
RUN pip install ./spec2vec-${COMMIT_SHA}
19 changes: 19 additions & 0 deletions tools/spec2vec/macros.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
<macros>
<token name="@COMMIT_SHA@">c9b54b9</token>
<token name="@TOOL_VERSION@">0.6.0</token>
<token name="@TOOL_DEV_VERSION@">0</token>

<xml name="creator">
<creator>
<person
givenName="Maksym"
familyName="Skoryk"
url="https://github.com/maximskorik"
identifier="0000-0003-2056-8018" />
<organization
url="https://www.recetox.muni.cz/"
email="GalaxyToolsDevelopmentandDeployment@space.muni.cz"
name="RECETOX MUNI" />
</creator>
</xml>
</macros>
222 changes: 222 additions & 0 deletions tools/spec2vec/spec2vec_training.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,222 @@
<tool id="spec2vec_training" name="Spec2Vec Model Training" version="@TOOL_VERSION@-@TOOL_DEV_VERSION@+galaxy0" python_template_version="3.5" profile="21.05">
<description>Train a Spec2Vec model for mass spectra similarity scoring</description>

<macros>
<import>macros.xml</import>
</macros>
<expand macro="creator"/>

<requirements>
<container type="docker">recetox/spec2vec:@COMMIT_SHA@</container>
maximskorik marked this conversation as resolved.
Show resolved Hide resolved
</requirements>

<command detect_errors="exit_code"><![CDATA[
sh ${weights_symlink} &&
sh ${spec2vec_python_cli}
]]></command>

<configfiles>
<configfile name="weights_symlink">
ln -fs '${weights_filename}' '${weights_filename}.npy'
maximskorik marked this conversation as resolved.
Show resolved Hide resolved
</configfile>
<configfile name="spec2vec_python_cli">
python3 ${__tool_directory__}/spec2vec_training_wrapper.py \
maximskorik marked this conversation as resolved.
Show resolved Hide resolved
--spectra_filename '$spectra_filename' \
--spectra_fileformat '$spectra_filename.ext' \
#if $output_parameters.model_checkpoints.is_true
--checkpoints '$output_parameters.model_checkpoints.checkpoints' \
#else
--epochs $output_parameters.model_checkpoints.epochs \
#end if
--vector_size $training_parameters.vector_size \
--alpha $training_parameters.alpha \
--min_alpha $training_parameters.min_alpha \
--window $training_parameters.window \
--min_count $training_parameters.min_count \
--sample $training_parameters.sample \
--seed $training_parameters.seed \
--sg $training_parameters.sg_param.sg \
#if not $training_parameters.sg_param.sg
--cbow_mean $training_parameters.sg_param.cbow_mean \
#end if
--hs $training_parameters.hs_param.hs \
#if not $training_parameters.hs_param.hs
--negative $training_parameters.hs_param.negative \
--ns_exponent $training_parameters.hs_param.ns_exponent \
#end if
--sorted_vocab $training_parameters.sorted_vocab \
--batch_words $training_parameters.batch_words \
--shrink_windows $training_parameters.shrink_windows \
#if $training_parameters.trim_vocab.max_vocab_size_bool
--max_vocab_size $training_parameters.trim_vocab.max_vocab_size \
#end if
--n_decimals $training_parameters.n_decimals \
--n_workers \${GALAXY_SLOTS:-1} \
maximskorik marked this conversation as resolved.
Show resolved Hide resolved
#if $output_parameters.as_pickle
--model_filename_pickle '$model_filename_pickle' \
#end if
--model_filename '$model_filename' \
--weights_filename '$weights_filename' \
</configfile>
</configfiles>

<inputs>
<param label="Training spectra" name="spectra_filename" type="data" format="msp,mgf"
help="Spectra file to train a Spec2Vec model."/>

<section title="Output parameters" name="output_parameters" expanded="true">
<param label="Save model as Pickle file" name="as_pickle" type="boolean" checked="false" truevalue="TRUE" falsevalue="FALSE"
help="Add a Pickle output besides default JSON."/>
<conditional name="model_checkpoints">
<param label="Model chekpoints" name="is_true" type="boolean" checked="false" truevalue="TRUE" falsevalue="FALSE"
maximskorik marked this conversation as resolved.
Show resolved Hide resolved
help="Epochs after which to save a model."/>
<when value="TRUE">
<param label="Number of training epochs with checkpoints" name="checkpoints" type="text" value="10,20,50"
help="Comma-separated epoch numbers after which to save a model. The highest number will be used as a total number of epochs for training.">
<validator type="empty_field"/>
<validator type="regex"
message="The input has to be a comma-separated sequence of integers without trailing commas. For example: 10,20,50">^[0-9][0-9,]*[0-9]$</validator>
xtrojak marked this conversation as resolved.
Show resolved Hide resolved
</param>
</when>
<when value="FALSE">
<param label="Number of training epochs" name="epochs" type="integer" value="10"
help="Number of epochs to train the model."/>
</when>
</conditional>
</section>

<section title="Training hyperparameters" name="training_parameters" expanded="true">
<param label="Vector size" name="vector_size" type="integer" value="300"
help="Dimensionality of the feature vectors (i.e., into how many dimensions to encode each m/z and neutral loss peak."/>
maximskorik marked this conversation as resolved.
Show resolved Hide resolved
<param label="Alpha" name="alpha" type="float" value="0.025"
help="The initial learning rate."/>
<param label="Minimum Alpha" name="min_alpha" type="float" value="0.00025"
help="Learning rate will linearly drop to this value as training progresses."/>
<param label="Window" name="window" type="integer" value="500"
help="Maximum distance between the current and predicted peak within a spectrum."/>
<param label="Minimum peak count" name="min_count" type="integer" value="1"
help="Ignores all peaks with absolute frequency lower than this."/>
<param label="Sample" name="sample" type="float" value="0.001"
help="The threshold for configuring which higher-frequency peaks are randomly downsampled."/>
<param label="Seed" name="seed" type="integer" value="1"
help="Seed of random number generator for model reproducibility."/>
<conditional name="sg_param">
<param label="Word-Embedding type" name="sg" type="select"
help="Embedding type: Skip-gram or Continuous Bag of Words">
<option value="0">CBOW</option>
<option value="1">Skip-gram</option>
</param>
<when value="0">
<param label="CBOW mean" name="cbow_mean" type="select"
help="Whether to use the sum of the context word vectors or their mean.">
<option value="0">Sum</option>
<option value="1" selected="true">Mean</option>
</param>
</when>
</conditional>
<conditional name="hs_param">
<param label="Last Layer Activation" name="hs" type="select"
help="Activation function of the last layer of the neural network. Negative sampling is more computationally efficient.">
<option value="0">Negative Sampling</option>
<option value="1">Hierarchical Softmax</option>
</param>
<when value="0">
<param label="Negative Samples" name="negative" type="integer" value="5"
help="Specify how many 'negative' examples should be drawn for each peak and neutral loss (usually between 5-20).">
<validator type="in_range" min="1" message="The value must be larger than 0."/>
</param>
<param label="Negative Sample Exponent" name="ns_exponent" type="float" value="0.75"
help="The exponent used to shape the negative sampling distribution. A value of 1.0 samples exactly in proportion to the frequencies,
0.0 samples all peaks and neutral losses equally, while a negative value samples low-frequency peaks more often than high-requency peaks.">
<validator type="in_range" min="-1.0" max="1.0" message="The value must be within -1.0 and 1.0 range."/>
</param>
</when>
</conditional>
<param label="Sort the vocabulary of spectra" name="sorted_vocab" type="boolean" checked="true" truevalue="TRUE" falsevalue="FALSE"
help="If true, sort the vocabulary by descending frequency before assigning peak and neutral loss indices."/>
<param label="Batch size" name="batch_words" type="integer" value="10000"
help="Target size (in peaks and neutral losses) for batches of examples passed to worker threads (and thus cython routines).
Larger batches will be passed if individual peak sequences are longer than 10000 words, but the standard cython code truncates to that maximum."/>
<param label="Shrink windows" name="shrink_windows" type="boolean" checked="true" truevalue="TRUE" falsevalue="FALSE"
help="EXPERIMENTAL. If true, the effective window size is uniformly sampled in range [1,Window] for each target peak during training."/>
<conditional name="trim_vocab">
<param label="Maximum unique peaks and neutral losses in the spectral vocabulary" name="max_vocab_size_bool" type="boolean" truevalue="TRUE" falsevalue="FALSE" checked="false"
help="Limits the RAM during vocabulary building; if there are more unique peaks and neutral losses than this, then prune the infrequent ones. Disable for no limit (default)."/>
<when value="TRUE">
<param label="Maximum unique peaks and neutral losses" name="max_vocab_size" type="integer" value="100000"/>
</when>
</conditional>
<param label="Number of decimals to round m/z values" name="n_decimals" type="integer" value="2"
maximskorik marked this conversation as resolved.
Show resolved Hide resolved
help="Rounds peak position to this number of decimals."/>
</section>
</inputs>

<outputs>
<data label="Spec2Vec model on ${on_string}" name="model_filename" format="json"/>
<data label="Spec2Vec weights on ${on_string}" name="weights_filename" format="binary"/>
<data label="Spec2Vec pickle model on ${on_string}" name="model_filename_pickle" format="binary">
<filter>output_parameters['as_pickle']</filter>
</data>
<collection name="model_checkpoints" type="list" label="Spec2Vec model checkpoints on ${on_string}">
<discover_datasets pattern="__name_and_ext__" />
<filter>output_parameters['model_checkpoints']['is_true']</filter>
</collection>
</outputs>

<tests>
<test expect_num_outputs="2"> <!-- Test 1: with default parameters -->
<param name="spectra_filename" value="RECETOX_Exposome_pesticides_HR_MS_normalized_20220323.msp" ftype="msp"/>
<output name="model_filename" file="model.json" ftype="json"/>
<output name="weights_filename" file="weights.npy" ftype="binary"/>
</test>
<test expect_num_outputs="3"> <!-- Test 2: pickle output -->
<param name="spectra_filename" value="RECETOX_Exposome_pesticides_HR_MS_normalized_20220323.msp" ftype="msp"/>
<param name="as_pickle" value="TRUE"/>
<output name="model_filename" file="model.json" ftype="json"/>
<output name="weights_filename" file="weights.npy" ftype="binary"/>
<output name="model_filename_pickle" file="model.pkl" ftype="binary" compare="sim_size" delta_frac="0.001"/>
</test>
<test expect_num_outputs="3"> <!-- Test 3: model checkpoints -->
<param name="spectra_filename" value="RECETOX_Exposome_pesticides_HR_MS_normalized_20220323.msp" ftype="msp"/>
<conditional name="model_checkpoints">
<param name="is_true" value="TRUE"/>
<param name="checkpoints" value="1,5,8,10"/>
</conditional>
<output name="model_filename" file="model.json" ftype="json"/>
<output name="weights_filename" file="weights.npy" ftype="binary"/>
<output_collection name="model_checkpoints" type="list" count="3">
<element name="spec2vec_iter_1">
<assert_contents>
<has_size value="3468k" delta="1k" />
<has_text text="gensim.models.word2vec" />
<has_text text="peak@" n="1423" />
</assert_contents>
</element>
<element name="spec2vec_iter_5">
<assert_contents>
<has_size value="3468k" delta="1k" />
<has_text text="gensim.models.word2vec" />
<has_text text="peak@" n="1423" />
</assert_contents>
</element>
<element name="spec2vec_iter_8">
<assert_contents>
<has_size value="3468k" delta="1k" />
<has_text text="gensim.models.word2vec" />
<has_text text="peak@" n="1423" />
</assert_contents>
</element>
</output_collection>
</test>
</tests>

<help><![CDATA[
**Spec2vec** is a spectral similarity score inspired by a natural language processing algorithm – Word2Vec.
Where Word2Vec learns relationships between words in sentences, spec2vec does so for mass fragments and neutral losses in MS/MS spectra.
The spectral similarity score is based on spectral embeddings learnt from the fragmental relationships within a large set of spectral data.
]]></help>

<citations>
<citation type="doi">10.1371/journal.pcbi.1008724</citation>
</citations>
</tool>
Loading