- initial commit
- added
rb-dual-motifs
dataset - added
tadf
dataset
- Added module
visual_graph_datasets.cli
- Improved installation process. It's now possible to install in non-editable mode
- Added tests
- Added function
get_dataset_path
which returns the full dataset path given the string name of a dataset.
- Added the dataset
movie_reviews
which is a natural language classification dataset which was converted into a graph dataset. - Extended the function
visual_graph_datasets.data.load_visual_graph_dataset
to be able to load natural language text based graph datasets as well.
Completely refactored the way in which datasets are managed.
- by default the datasets are now stored within a folder called
.visual_graph_datasets/datasets
within the users home directory, but the datasets are no longer part of the repository itself. Instead, the datasets have to be downloaded from a remote file share provider first. CLI commands have been added to simplify this process. Assuming the remote provider is correctly configured and accessible, datasets can simply downloaded by name using thedownload
CLI command. - Added
visual_graph_datasets.config
which defines a config singleton class. By default this config class only returns default values, but a config file can be created at.visual_graph_datasets/config.yaml
by using theconfig
CLI command. Inside this config it is possible to change the remote file share provider and the dataset path. - The CLI command
list
can be used to display all the available datasets in the remote file share.
- Somewhat extended the
AbstractFileShare
interface to also include a methodcheck_dataset
which retrieves the files shares metadata and then checks if the provided dataset name is available from that file share location. - Added the sub package
visual_graph_datasets.generation
which will contain all the functionality related to the generation of datasets. - Added the module
visual_graph_datasets.generation.graph
and the classGraphGenerator
which presents a generic solution for graph generation purposes. - Added the sub package
visual_graph_datasets.visualization
which will contain all the functionality related to the visualization of various different kinds of graphs - Added the module
visual_graph_datasets.visualization.base
- Added the module
visual_graph_datasets.visualization.colors
and functionality to visualize grayscale graphs which contain a single attribute that represents the grayscale value - Added a
experiments
folder which will containpyxomex
experiments - Added an experiment
generate_mock.py
which generates a simple mock dataset which will subsequently be used for testing purposes. - Extended the dependencies
- Added module
visual_graph_datasets.visualization.importances
which implements the visualization of importances on top of graph visualizations. - Other small fixes, including a problem with the generation of the mock dataset
- Added
imageio
to dependencies
- Default config now has the public nextcloud provider url
- Fixed a bug with the
list
command which crashed due to non-existing terminal color specification
- Finally finished the implementation of the
bundle
command. - updated the rb_motifs dataset for the new structure and also recreated all the visualizations with a transparent background.
- Implemented the visualization of colored graphs
- Changed the config file a bit: It is now possible to define as many custom file share providers as
possible under the
providers
section. Each new provider however needs to have a unique name, which is then required to be supplied for theget_file_share
function to actually construct the corresponding file share provider object instance. - Added the package
visual_graph_datasets.processing
which contains functionality to process source datasets into visual graph datasets. - Added experiment
generate_molecule_dataset_from_csv
which can be used to download the source CSV file for a molecule (SMILES based) dataset from the file share and then generate a visual graph dataset based on that.
- Fixed a bug in the
bundle
command - Added a module
visual_graph_datasets.testing
with testing utils.
- Renamed
TestingConfig
toIsolatedConfig
due to a warning in pytest test collection
- Fixed a bug in
experiments.generate_molecule_dataset_from_csv
where faulty node positions were saved for the generated visualizations of the molecules - Added the experiment
experiments.generate_molecule_multitask_dataset_from_csv
which generates a molecule based dataset for a multitask regression learning objective using multiple CSVs and merging them together. - Fixed a bug in
experiments.generate_molecule_multitask_dataset_from_csv
where invalid molecules were causing problems down the line. These are being filtered now. - Update
README.md
- Added a
examples
folder
- Initial implementation of the "dataset metadata" feature: The basic idea is that special metadata files
can be added to the various dataset folders optionally to provide useful information about them, such as a
description, a version string, a changelog, information about the relevant tensor shapes etc... In the
future the idea is to allow arbitrary metadata files which begin with a "." character. For now, the
central
.meta.yaml
file has been implemented to hold the bulk of the textual metadata in a machine readable format. - Added a main logger to the main config singleton, such that this can be used for the command line interface.
- Added the
gather
cli command which can be used to generate/update the metadata information for a single dataset folder. This will create an updated version of the.meta.yaml
file within that folder. - Changed the
bundle
command such that the metadata file is now always updated with the new dataset specific metadata, regardless of whether it exists or not. Additionally, custom fields added to that file which do not interfere with the automatically generated part now persist beyond individual bundle operations. - Updated jinja template for
list
command to be more idiomatic and don't use logic within the template. Additionally extended it with more metadata information that is now available for datasets - Switched to the new version of
pycomex
which introduces experiment inheritance. - Started to implement more specific sub experiments using experiment inheritance.
INTERFACE CHANGES
- The central function
load_visual_graph_dataset
now has a backward-incompatible signature: The function still returns a tuple of two elements as before, but the first element of that tuple is now the metadata dict of the dataset as it was loaded byload_visual_graph_dataset_metadata
- changed dependencies to fit together with
graph_attention_student
- Added experiment to generate
aggregators_binary
dataset
Implemented the "preprocessing" feature. Currently a big problem with the visual graph datasets in general is that they are essentially limited to the elements which they already contain. There is no easy way to generate more input elements in the same general graph representation / format as the elements already in a dataset. This is a problem if any model trained based on a VGD is supposed to be actually used on new unseen data: It will be difficult to process a new molecule for example into the appropriate input tensor format required to query the model.
The "preprocessing" feature addresses this problem. During the creation of each VGD a python module "process.py" is automatically created from a template and saved into the VGD folder as well. It contains all the necessary code needed to transform a domain specific implementation (such as a SMILES code for example) into a new input element of that dataset, including the graph representation as well as the visualization. This module can either be imported to use the functionality directly in python code. It also acts as a command line application.
Added the base class
processing.base.ProcessingBase
. This class encapsulates the previously described pre-processing functionality. Classes inheriting from this automatically act as a command line interface as well.- Code for a standalone python module with the same processing functionality can be generated from an
instance using the
processing.base.create_processing_module
function.
- Code for a standalone python module with the same processing functionality can be generated from an
instance using the
Added the class
processing.molecules.MoleculeProcessing
. This class provides a standard implementation for processing molecular graphs given as SMILES strings.Added unittests for base processing functionality
Added unittests for molecule processing functionality
Extended the function
typing.assert_graph_dict
to do some more in-depth checks for a valid graph dictAdded module
generation.color
. It implements utility functions which are needed specifically for the generation of color graph datasets.Added the experiment
experiment.generate_rb_adv_motifs
which generates the synthetic "red-blue adversarial motifs" classification dataset of color graphs.
- Changed the "config" cli command to also be usable without actually opening the editor. This can be used to silently create or overwrite a config file for example.
- Fixed a bug in
utils.dynamic_import
- Fixed a bug in
data.load_visual_graph_element
- Changed the version dependency for numpy
- Slightly changed the generation process of the "rb_adv_motifs" dataset.
- Added the class of experiments based on
experiments.csv_sanchez_lengeling_dataset.py
, which convert the datasets from the paper into a single CSV file, which can then be further processed into a visual graph dataset. - Added utility function
util.edge_importances_from_node_importances
to derive edge explanations from the node explanations in cases where they are not created. - Started to move towards the new pycomex Functional API with the experiments
- Added more documentation to
typing
Model Interfaces and Mixins
- Added the
visual_graph_datasets.models
module which will contain all the code which is relevant for models that specifically work with visual graph datasets - Added the
models.PredictGraphMixin
class, which is essentially an interface that can be implemented by a model class to signify that it supports thepredict_graph
method which can be used to query a model prediction directly based on a GraphDict object.
Examples
- Added a
examples/README.rst
- Added
examples/01_explanation_pdf
- Added a section about dataset conversion to the readme file
- Fixed a bug with the
create_processing_module
function where it did not work if the Processing class was not defined at the top-level indentation. - Changed some dependency versions
- Moved some more experiment modules to the pycomex functional API
Important
- Made some changes to the
BaseProcessing
interface, which will be backwards incompatible - Mainly made the base interface more specific such as including "output_path" or "value" as concrete positional arguments to the various abstract methods instead of just specifying args and kwargs
- Made some changes to the
- Added the
Batched
utility iterator class which will greatly simplify working in batches for predictions etc. - Made some changes to the base molecule processing file
- Started moving more experiment modules to the new pycomex functional api
- Added an experiment module to process QM9 dataset into a visual graph dataset.
Additions to the processing.molecules
module. Added various new molecular node/features based on
RDKit computations.
- Partial Gasteiger Charges of atoms
- Crippen LogP contributions of atoms
- Estate indices
- TPSA contributions
- LabuteASA contributions
- Changed the default experiment
generate_molecule_dataset_from_csv.py
to now use these additional atom/node features for the default Processing implementation.
Overhaul of the dataset writing and reading process. The main difference is that I added support for dataset chunking. Previously a dataset would consist of a single folder which would directly contain all the files for the individual dataset elements. For large datasets these folders would become very large and thus inefficient for the filesystem to handle. With dataset chunking, the dataset can be split into multiple sub folders that contain a max. number of elements each thus hopefully increasing the efficiency.
Added
data.DatasetReaderBase
class, which contains the base implementation of reading a dataset from the persistent folder representation into the index_data_map. This class now supports the dataset chunking feature.- Added
data.VisualGraphDatasetReader
which implements this for the basic dataset format that represents each element as a JSON and PNG file.
- Added
Added
data.DatasetWriterBase
class, which contains the base implementation of writing a dataset from a data structure representation into the folder. This class now supports the dataset chunking feature.- Added
data.VisualGraphDatasetWriter
which implements this for the basic dataset format where a metadata dict and a mpl Figure instance are turned into a JSON and PNG file.
- Added
Changed the
processing.molecule.MoleculeProcessing
class to now also support a DatasetWriter instance as an optional argument to make use of the dataset chunking feature during the dataset creation process.
Introduction of COGILES (Color Graph Input Line Entry System) which is a method of specifying colored graphs with a simple human-readable string syntax, which is strongly inspired by SMILES for molecular graphs.
- Added
generate.colors.graph_from_cogiles
- Added
generate.colors.graph_to_cogiles
Bugfixes
- I think I finally solved the performance issue in
generate_molecule_dataset_from_csv.py
. Previously there was an issue where the avg write speed would rapidly decline for a large dataset, causing the process to take way too long. I think the problem was the matplotlib cache in the end - Also changed
visualize_graph_from_mol
and made some optimizations there. It no longer relies on the creation of intermediate files and no temp dir either, which shaved of a few ms of computational time.
- Added the new module
graph.py
which will contain all GraphDict related utility functions in the future - Added a function to copy graph dicts - Added a function to create node adjecency matrices for graph dicts - Added a function to add graph edges - Added a function to remove graph edges
- Fixed a bug where
ColorProcesing.create
would not save the name or the domain representation
- Fixed a bug where the COGILES decoding procedure produced graph dicts with "edge_attributes" arrays of the incorrect data type and shape.
- Fixed a bug where the CogilesEncoder duplicated edges in some very weird edge cases!
- Added the experiment
profile_molecule_processing.py
to profile and plot the runtime of the different process components that create a visual graph dataset element with the aim of identifying the source of the runtime degradation bug. - Fixed the runtime degradation / memory leak issue in
generate_molecule_dataset_from_csv.py
. It seems like the problem actually wasn't in the code but in the matplotlib backend! The problem clearly occurs when using theTkAgg
backend but does not appear when using theAgg
backend. - Modified the generation of the QM9 dataset in
generate_molecule_dataset_from_csv__qm9.py
- Added the new experiment file
generate_molecule_dataset_from_csv__qm9sub.py
which generates the QM9 sub dataset which is a smaller subset of QM9 with only 22k elements and 9 target columns. - Added the new experiment
generate_molecule_dataset_from_csv__aggregators_binary_protonated
which processes the larger version of the aggregators dataset where each individual molecule is replaced by all it's protonated variants - Added the new background flavor of visualizing the attributional graph masks. In this method, a filled light green circle will be painted behind the nodes of the graph.
- Slightly modified the
ensure_dataset
function - Updated the readme file
- Updated the documentation of the standard sub experiments for
generate_molecule_dataset_from_csv.py
- Added a utility function to count how often a subgraph motif appears in a larger graph
- Added experiment
analyze_color_graph_dataset.py
to analyze the properties of color graph based datasets
- Fixed a minor issue where the datasets folder was not created during the
config
initialization which has led to errors when trying to download a dataset.
- Added back in the dictionaries defining the alternative versions for the node and edge importance plotting
- Added some more graph utility functions such as functions to extract sub graphs, add and remove nodes and to identify connected regions of a graph.
- Added documentation for the
ColorProcessing
class - Changed the
ColorProcessing.visualize_as_figure
method to now also accept external graph dict parameter and external node_positions array. - Modified the
generate_molecule_dataset_from_csv.py
experiment so that it is now possible to optionally define a indices blacklist of elements that should be skipped during processing. - Moved the dependencies to the most recent version of RDKit. This seems to have fixed the issue of the molecule image generation occasionally crashing with a segmentation fault.
- Added the
generic
graph type. This is a graph type that can be used to represent any kind of graph that cannot be associated with any kind of specific domain. Added theGenericProcessing
class which can be used to process these generic graphs. - Modifed the
colors_layout
function such that it is possible to pass a partially defined list of node_positions as an argument, such that the positions of some nodes can be fixed during the layouting.
- The "load" method of the Config instance now returns the itself, which is just a small quality of life improvement for the scripts that will use have to use the config instance.
- Added some additional documentation for the Processing classes
- Added the function ``create_combined_importances_pdf" to generate a visualization PDF that visualizes the explanations not as separate figures, but all the explanation channels into the same figure, encoding the different channels as different colors.
- Changed the version requirement to be compatible with newer python versions
- Fixed the dependency error where utils imported from
graph_attention_student
have caused a circular import error
- Extended the
ColorsProcesing
class to also support 3D graph structures now. - Added an experiment module to process the COMPAS dataset of polybenzenes molecular property predictions
Modified
pyproject.toml
- The command line interface is now installed as the "vgd" command - moved from usingclick
for the command line interface to usingrich-click
which is a fork ofclick
that adds rich text support to the command line interface
- Added the
get_num_node_attributes
andget_num_edge_attributes
functions to theProcessing
base interface.
- Fixed the COGILES encoder. There was a bug in the cogiles encoder class which resulted in edges being duplicated in some cases. This is fixed now.
- There is a test case now for the COGILES encoder which tests the encoder and decoder for a large number of randomly generated graphs to check if there are any other edge cases where the encoding or decoding fails.
- Added the
node_atoms
andedge_bonds
properties to theMoleculeProcessing
class when returning the graph dict representation of a molecule. These properties return the atom and bond types respectively as human-readable strings. - Added an additional
encoder
class attribute to theColorProcessing
class which can be used to encode the (r,g,b) color values of the nodes and edges into a human-readable string representation.