Named graphs architecture #404

MTWhyte · 2021-02-10T16:28:16Z

This PR is a draft of Group 1's progress.

Edit: see comment below for an update on progress. This is no longer a draft.

also added a lowercase bugfix

…o named-graphs

@reiniscirpons

* Rename files and variables Graph6 -> DiSparse6 since this is @reiniscirpons' format for the sporadic graphs * Move all new functions to digraph.gi * Delete tst/standard/named.tst and move tests to digraph.tst

…o named-graphs

tomcontileslie · 2021-03-04T13:44:18Z

Update on this PR

The implementation is now pretty much ready to go, and thanks to Group 2's work (especially @reiniscirpons' filtering of the raw graph data yesterday) we now also have the named digraphs databases stored in binaries in /data/named-ds6.p.gz and /data/named-ds6-test.p.gz. There are over 450 graphs in the database.

Still to do:

Documentation for new functions Digraph and ListNamedDigraphs
Graph-by-graph documentation, maybe (@MTWhyte?)
Potentially extend ListNamedDigraphs to search not only for prefixes but general substrings and perhaps even some kind of fuzzy search (maybe something like ListNamedDigraphs(str, n) where n is in {0,1,2} with higher values of n denoting a looser search)

For anyone reviewing: the main things you can do with this new feature are:

Load a named digraph from the database using Digraph("graphname")
Explore the digraphs currently in the database by searching through names with ListNamedDigraphs("gr"), which will find all completions of the request string

Both of these have entries in the documentation in section 3.1.

There is also more info about the pickling/unpickling, and how to add new named digraphs for future contributors, on this wiki page.

(A string is a list so no need to declare anything new)

…med-graphs

wilfwilson

I've had a bit of a look, and this seems good and reasonable, BUT I do have a rather large worry...

Question

What is the motivation for the database to be stored in a pickled, compressed file?

My thinking...

My (perhaps naive) view is that this technique is used to save on storage space. But the compressed files are only 94 KB and 8 KB respectively, so surely the unpicked uncompressed files wouldn't be that big in the grand scheme of things?

For comparison, I just downloaded digraphs-1.4.0; after uncompressing it for installation, it was about 12 MB, and then after compiling the package and its manual, the digraphs-1.4.0 took up about 20 MB on my computer. So even if the databases would be as big as 1 or 2 MB, I don't think we'd need to worry about the size.

On the other hand, having everything be pickled and compressed comes with several downsides, as opposed to just having them in plain text files with GAP code that is read in with Read.

It becomes much harder to edit the database. That's why you've had to add functions for adding (and overwriting) entries to the database, although, I note that (as far as you can tell) you haven't added anything to remove them, yet. You've also had to write a Wiki page to help explain some of these steps. It's an extra layer of friction, that's irrelevant with plain text files.
It's harder to see what is in the database(s). You have to read the file into GAP and interact with it in that way, you can't just load up the database file in a text editor and have a poke around.
Git can't handle gzipped pickled files well. It will handle them as binary files. That means that every time the database is modified, it becomes an essentially brand new file to git, and the storage size of the repository increases, and there are no nice diffs.
It will be hard to see, in future PRs, exactly what has been added to the database. This is because, again, about git not being able to see into the database files and give us a diff. For example, I can't actually see what has been added to the database in this PR in GitHub. I have to actually check out the PR, and have a play around in GAP, and try to remember what was different from the previous version.

So, unless there's a compelling reason that I'm missing, I think that at the expense of a bit more storage (which I think we're able to pay), there are significant benefits to using plain text files for the databases, as detailed above.

What does anyone/everyone think?

doc/digraph.xml

gap/digraph.gi

tomcontileslie · 2021-03-12T18:51:46Z

Thanks for the review @wilfwilson. Your suggested edits are all very reasonable and I will sort them ASAP.

As for your comments on compression, I think you make a very good point. The original motivation for gzipping, I believe, was that we expected the database to contain thousands, rather than hundreds, of digraphs (although arguably even that size is fine as a .g file). This was the impression when we first looked at @reiniscirpons' database, but it turned out the vast majority of the entries actually filtered into the 20-odd infinite families that have been turned into separate PRs. So with this smaller database size that will remain reasonable even if it gets added to extensively, and all the other benefits of storing in a .g, I am personally very happy to remove the zipping process - depending on what others who were working on this think, @james-d-mitchell @MTWhyte ?

Co-authored-by: Wilf Wilson <wilf@wilf-wilson.net>

…med-graphs

tomcontileslie · 2021-03-15T15:11:39Z

Okay @wilfwilson that's the code review updated - thanks for your suggestions there. I have created a working version of the named digraphs project that stores the records as .g files, that get read when the "database load" function is run. While we're waiting on confirmation from some other contributors about this format change, I've tucked it away at tomcontileslie/named-graphs-no-gz if you want to take a look. The changes w.r.t this branch are:

There are now global variables for the test database as well as the main database (previously I had the test files unpickle the test database directly since it is not needed by the user). This just makes the code a bit cleaner for testing.
For consistency, I have renamed all files and variable names to things reminiscent of "named digraphs main database" and "named digraphs test database". So, the record with all the ds6 strings is now at DIGRAPHS_NamedDigraphsMainDatabase, and can be loaded from the file data/named-digraphs-main-database.g by calling DIGRAPHS_LoadNamedDigraphsMainDatabase(). Exact analogues replacing "main" with "test". I thought this would lead to less confusion for future users, and it makes clear the fact that there are two analogous databases, one for the ds6 strings, and one for the test values.
The "database" files are now just plain .g files that sequentially add components to the global variables DIGRAPHS_NamedDigraphsMainDatabase and DIGRAPHS_NamedDigraphsTestDatabase when run.
I've credited everyone that I know was in Group 1 or Group 2 in the new .g database files.

(if you want to take a look, these modifications are all done in the one commit at the top of the log on my branch).

Having gone through with this, I agree that it's much clearer for potential new contributors and it's an easier way to debug, avoid mistakes, and avoid buildup of a large .git folder if there are multiple updates of the database. Keen to hear what other contributors have to say about efficiency and database size before I merge this into the PR.

wilfwilson · 2021-03-16T10:02:10Z

Thanks for making those specific changes that I asked for.

One big question I've just thought of: why are you using DiSparse6String? Was there a particular reason? Since all your digraphs are symmetric digraphs (and presumably always will be ? maybe it's that you don't want to make that assumption), you could use Sparse6String instead, which should result in significantly shorter strings (and therefore less storage).

As for your other branch: I like the new non-pickled, non-compressed approach, and I'm pleased to see that each .g file is under 200 KB.

You could save some moreKB by using the global variable DIGRAPHS_NamedDigraphs rather than DIGRAPHS_NamedDigraphsMainDatabase - I think the MainDatabase here, and in the loading function, and in the file name is not really necessary.

In the test DB, is there a particular reason you took this format:

temp := rec();
temp.("DigraphNrEdges") := 216;
temp.("DigraphDiameter") := 8;
temp.("DigraphNrVertices") := 72;
temp.("IsRegularDigraph") := true;
temp.("DigraphUndirectedGirth") := 3;
DIGRAPHS_NamedDigraphsTestDatabase.("truncatednauru") := temp;

rather than

DIGRAPHS_NamedDigraphsTestDatabase.("truncatednauru") := rec(
  DigraphNrEdges := 216,
  DigraphDiameter := 8,
  (...etc...)
);

? The second way also has the benefit of using less code (and hence storage) 🙂

tomcontileslie · 2021-03-16T11:47:56Z

@wilfwilson, the answer to the first question is that I don't think we want to exclude the possibility of named non-symmetric digraphs being added to the database later on. It just so happens that the sources we've used are for graphs, but named digraphs could be of interest too. So the question is more between DiSparse6 and Digraph6, and DiSparse6 was the format that @reiniscirpons provided the csv file in. I'm not sure how much work it would be to convert between one and the other, but we could move to non-sparse if preferable. I think the format needs to be consistent across the database so that we can store symmetric and non-symmetric digraphs in the same place (unless there's an easy way of distinguishing between g6/d6/s6/ds6).

It's true that a long variable name adds up when repeated over multiple lines and I'm happy to pick shorter names. However, choosing the substitution DIGRAPHS_NamedDigraphsMainDatabase -> DIGRAPHS_NamedDigraphs removes the main <-> test duality in filenames and variables that I hoped would make things a bit more clear - maybe I'm overthinking and it doesn't really matter. Arguably calling things "named digraphs" and "named digraphs test" is more user-friendly than "named digraphs main" and "named digraphs test", and this is just semantics. But given they're two records with the same component names I thought it would make more sense for developers to see them as two versions of the same database.

I think the other option to save space, because of how memory works for records, is something like this:

_d := DIGRAPHS_NamedDigraphsMainDatabase;
_d.("diamond") := ".Cg@_Qq_E@J";
_d.("fork") := ".DkgC|kgC~";

etc. I guess it just depends to what extent you think the variable names should be changed.

As for the comments on the test DB, no, there is no good reason and I will correct that to save space.

wilfwilson · 2021-03-17T10:39:01Z

Arguably calling things "named digraphs" and "named digraphs test" is more user-friendly than "named digraphs main" and "named digraphs test", and this is just semantics.

This was my thinking. But I really don't mind what the things are named, it was just an idea.

As for which format to use: I think it's fine to keep using DiSparse6 for now, I think it's going to be the best general-purpose choice. But making this more flexible could be a nice future improvement for someone who is interested.

Notes on this topic: I believe that you can tell the difference between the formats by looking at the first characters of the strings. @mtorpey was the mastermind behind implementing this stuff, so he will know better. For example, I think DiSparse6Strings start with a ., Sparse6Strings start with a :, Digraph6Strings start with an &, and Graph6Strings don't.

Moreover, there exist functions in Digraphs (in particular String(D) for a digraph D, see also WriteDigraphs) that choose the most appropriate encoder to use for a given digraph. This function, or at least the logic behind it, could be helpful when adding new entries to the database, and choosing which format to use.

Also for naming consistency, change all variables, functions and filenames to "named digraphs" and "named digraphs tests".

tomcontileslie · 2021-03-17T16:43:03Z

After discussion at today's meeting I have updated the PR to store the database in .g files referred to as "named digraphs main database" and "main digraphs test database". The new DIGRAPHS global variables are:

DIGRAPHS_LoadNamedDigraphs()         # Generates the variable below
DIGRAPHS_NamedDigraphs               # Record of assignments name -> ds6 string
DIGRAPHS_LoadNamedDigraphsTests()    # Generates the variable below
DIGRAPHS_NamedDigraphsTests          # Record of assignments name -> test properties

I think that's all the comments from @wilfwilson's review addressed.

…med-graphs

wilfwilson

Great feature!

…med-graphs

tomcontileslie · 2021-03-18T10:45:45Z

@wilfwilson sorry about that, not sure why your review was dismissed. I just pushed a minor change to the comments in one of the test files to bump the testing which had frozen.

wilfwilson · 2021-03-18T11:14:41Z

That's because of the settings that we've got applied to the Digraphs repo, which means that an approving review on a PR automatically gets dismissed when further changes are made. Don't worry, I'll approve again.

Note that you can also re-run the tests by closing and reopening the PR, which takes about 2 seconds so is probably the quickest route. Also, if it's the GitHub Actions jobs that are hanging, hopefully you're able to restart them by going to the relevant page in the Actions tab of this repository, or going to the 'Checks' tab at the top of this PR, and choosing to 'Re-run jobs' on the relevant bit:

wilfwilson · 2021-03-18T11:22:05Z

Note to @james-d-mitchell: I've just turned off the branch protection rule for master which dismissed PR approvals when further changes are made to a PR. This was setting was already turned off for the stable branches, so I assume it wasn't a deeply considered policy choice, but let me know if you're unhappy.

My thinking is that we we're probably careful enough already with approving and merging that we're probably not going to miss problems causing by people making further changes between approval and merge (especially as the tests are still required to pass).

MTWhyte added the WIP Label of PRs that are a Work In Progress (WIP) label Feb 10, 2021

MTWhyte force-pushed the named-graphs branch from 2a9e327 to cc28b11 Compare February 17, 2021 15:31

Intermediate progress

86f8056

MTWhyte force-pushed the named-graphs branch from cc28b11 to 86f8056 Compare February 17, 2021 15:37

tomcontileslie added 6 commits February 17, 2021 16:57

Named Graphs: change tst file copyright date

45b65f5

Outsource unpickler to DIGRAPHS_LoadNamedGraph6Strings

366341a

I forgot to declare local variables

b6cea6c

also added a lowercase bugfix

Named Digraphs: Implement ListNamedDigraphs as search function

ed0249b

Merge branch 'master' of https://github.com/gap-packages/Digraphs int…

1f80e09

…o named-graphs

Named Graphs: lint

d918fb1

tomcontileslie force-pushed the named-graphs branch from a4be617 to d918fb1 Compare February 24, 2021 14:42

tomcontileslie mentioned this pull request Feb 24, 2021

Implement further "named" graphs #383

Open

8 tasks

tomcontileslie added 7 commits February 24, 2021 16:10

Named Graphs: choose better name for test database

69a0b81

Named Graphs: add standard test and change error message

99df611

Named Graphs: add placeholder databases

faefc69

More standard tests and ListNamed bugfix

2bcbac2

Create final database

a4c5836

* Rename files and variables Graph6 -> DiSparse6 since this is @reiniscirpons' format for the sporadic graphs * Move all new functions to digraph.gi * Delete tst/standard/named.tst and move tests to digraph.tst

Merge branch 'master' of https://github.com/gap-packages/Digraphs int…

ae742c2

…o named-graphs

Merge branch 'master' of https://github.com/gap-packages/Digraphs int…

8ddf7ec

…o named-graphs

digraphs deleted a comment from codecov bot Mar 10, 2021

wilfwilson self-requested a review March 10, 2021 20:01

tomcontileslie added 7 commits March 10, 2021 23:13

Remove some DeclareOperation lines

9c64a11

(A string is a list so no need to declare anything new)

Named Digraphs: add flexibility level to ListNamedDigraphs

f9d3603

Named Graphs: add test

e82453b

Merge branch 'master' of https://github.com/digraphs/Digraphs into na…

ec54531

…med-graphs

Named Digraphs: update error message and improve search speed

3a519e5

Named Digraphs: documentation

15a4d70

Merge branch 'master' of https://github.com/digraphs/Digraphs into na…

a334dae

…med-graphs

tomcontileslie removed the WIP Label of PRs that are a Work In Progress (WIP) label Mar 11, 2021

tomcontileslie added the new-feature A label for new features. label Mar 11, 2021

Merge branch 'master' into named-graphs

47e5c95

wilfwilson removed their request for review March 11, 2021 21:59

wilfwilson reviewed Mar 12, 2021

View reviewed changes

doc/digraph.xml Outdated Show resolved Hide resolved

gap/digraph.gi Outdated Show resolved Hide resolved

gap/digraph.gi Outdated Show resolved Hide resolved

gap/digraph.gi Outdated Show resolved Hide resolved

gap/digraph.gi Outdated Show resolved Hide resolved

Apply suggestions from code review

d3745bb

Co-authored-by: Wilf Wilson <wilf@wilf-wilson.net>

tomcontileslie force-pushed the named-graphs branch from 6479a9b to d3745bb Compare March 15, 2021 13:38

Merge branch 'master' of https://github.com/digraphs/Digraphs into na…

936d4df

…med-graphs

Named Digraphs: store records in uncompressed .g files

2bf78e2

Also for naming consistency, change all variables, functions and filenames to "named digraphs" and "named digraphs tests".

tomcontileslie force-pushed the named-graphs branch from 3b2a97a to 2bf78e2 Compare March 17, 2021 16:39

Merge branch 'master' of https://github.com/digraphs/Digraphs into na…

17c1b4f

…med-graphs

wilfwilson previously approved these changes Mar 17, 2021

View reviewed changes

tomcontileslie added 2 commits March 17, 2021 23:22

Merge branch 'master' of https://github.com/digraphs/Digraphs into na…

d70b4df

…med-graphs

Named Digraphs: rename graph -> digraph in comments

ef45cb9

tomcontileslie dismissed wilfwilson’s stale review via ef45cb9 March 18, 2021 10:44

wilfwilson approved these changes Mar 18, 2021

View reviewed changes

Merge branch 'master' into named-graphs

b69c144

wilfwilson merged commit 3026304 into digraphs:master Mar 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Named graphs architecture #404

Named graphs architecture #404

MTWhyte commented Feb 10, 2021 •

edited by tomcontileslie

Loading

tomcontileslie commented Mar 4, 2021 •

edited

Loading

wilfwilson left a comment

tomcontileslie commented Mar 12, 2021

tomcontileslie commented Mar 15, 2021 •

edited

Loading

wilfwilson commented Mar 16, 2021

tomcontileslie commented Mar 16, 2021

wilfwilson commented Mar 17, 2021

tomcontileslie commented Mar 17, 2021

wilfwilson left a comment

tomcontileslie commented Mar 18, 2021

wilfwilson commented Mar 18, 2021 •

edited

Loading

wilfwilson commented Mar 18, 2021 •

edited

Loading

Named graphs architecture #404

Named graphs architecture #404

Conversation

MTWhyte commented Feb 10, 2021 • edited by tomcontileslie Loading

tomcontileslie commented Mar 4, 2021 • edited Loading

Update on this PR

wilfwilson left a comment

Choose a reason for hiding this comment

Question

My thinking...

tomcontileslie commented Mar 12, 2021

tomcontileslie commented Mar 15, 2021 • edited Loading

wilfwilson commented Mar 16, 2021

tomcontileslie commented Mar 16, 2021

wilfwilson commented Mar 17, 2021

tomcontileslie commented Mar 17, 2021

wilfwilson left a comment

Choose a reason for hiding this comment

tomcontileslie commented Mar 18, 2021

wilfwilson commented Mar 18, 2021 • edited Loading

wilfwilson commented Mar 18, 2021 • edited Loading

MTWhyte commented Feb 10, 2021 •

edited by tomcontileslie

Loading

tomcontileslie commented Mar 4, 2021 •

edited

Loading

tomcontileslie commented Mar 15, 2021 •

edited

Loading

wilfwilson commented Mar 18, 2021 •

edited

Loading

wilfwilson commented Mar 18, 2021 •

edited

Loading