-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Named graphs architecture #404
Conversation
2a9e327
to
cc28b11
Compare
cc28b11
to
86f8056
Compare
also added a lowercase bugfix
…o named-graphs
a4be617
to
d918fb1
Compare
* Rename files and variables Graph6 -> DiSparse6 since this is @reiniscirpons' format for the sporadic graphs * Move all new functions to digraph.gi * Delete tst/standard/named.tst and move tests to digraph.tst
…o named-graphs
…o named-graphs
Update on this PRThe implementation is now pretty much ready to go, and thanks to Group 2's work (especially @reiniscirpons' filtering of the raw graph data yesterday) we now also have the named digraphs databases stored in binaries in Still to do:
For anyone reviewing: the main things you can do with this new feature are:
Both of these have entries in the documentation in section 3.1. There is also more info about the pickling/unpickling, and how to add new named digraphs for future contributors, on this wiki page. |
(A string is a list so no need to declare anything new)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've had a bit of a look, and this seems good and reasonable, BUT I do have a rather large worry...
Question
What is the motivation for the database to be stored in a pickled, compressed file?
My thinking...
My (perhaps naive) view is that this technique is used to save on storage space. But the compressed files are only 94 KB and 8 KB respectively, so surely the unpicked uncompressed files wouldn't be that big in the grand scheme of things?
For comparison, I just downloaded digraphs-1.4.0
; after uncompressing it for installation, it was about 12 MB, and then after compiling the package and its manual, the digraphs-1.4.0
took up about 20 MB on my computer. So even if the databases would be as big as 1 or 2 MB, I don't think we'd need to worry about the size.
On the other hand, having everything be pickled and compressed comes with several downsides, as opposed to just having them in plain text files with GAP code that is read in with Read
.
- It becomes much harder to edit the database. That's why you've had to add functions for adding (and overwriting) entries to the database, although, I note that (as far as you can tell) you haven't added anything to remove them, yet. You've also had to write a Wiki page to help explain some of these steps. It's an extra layer of friction, that's irrelevant with plain text files.
- It's harder to see what is in the database(s). You have to read the file into GAP and interact with it in that way, you can't just load up the database file in a text editor and have a poke around.
- Git can't handle gzipped pickled files well. It will handle them as binary files. That means that every time the database is modified, it becomes an essentially brand new file to
git
, and the storage size of the repository increases, and there are no nice diffs. - It will be hard to see, in future PRs, exactly what has been added to the database. This is because, again, about
git
not being able to see into the database files and give us a diff. For example, I can't actually see what has been added to the database in this PR in GitHub. I have to actually check out the PR, and have a play around in GAP, and try to remember what was different from the previous version.
So, unless there's a compelling reason that I'm missing, I think that at the expense of a bit more storage (which I think we're able to pay), there are significant benefits to using plain text files for the databases, as detailed above.
What does anyone/everyone think?
Thanks for the review @wilfwilson. Your suggested edits are all very reasonable and I will sort them ASAP. As for your comments on compression, I think you make a very good point. The original motivation for gzipping, I believe, was that we expected the database to contain thousands, rather than hundreds, of digraphs (although arguably even that size is fine as a .g file). This was the impression when we first looked at @reiniscirpons' database, but it turned out the vast majority of the entries actually filtered into the 20-odd infinite families that have been turned into separate PRs. So with this smaller database size that will remain reasonable even if it gets added to extensively, and all the other benefits of storing in a .g, I am personally very happy to remove the zipping process - depending on what others who were working on this think, @james-d-mitchell @MTWhyte ? |
Co-authored-by: Wilf Wilson <wilf@wilf-wilson.net>
6479a9b
to
d3745bb
Compare
Okay @wilfwilson that's the code review updated - thanks for your suggestions there. I have created a working version of the named digraphs project that stores the records as .g files, that get read when the "database load" function is run. While we're waiting on confirmation from some other contributors about this format change, I've tucked it away at tomcontileslie/named-graphs-no-gz if you want to take a look. The changes w.r.t this branch are:
(if you want to take a look, these modifications are all done in the one commit at the top of the log on my branch). Having gone through with this, I agree that it's much clearer for potential new contributors and it's an easier way to debug, avoid mistakes, and avoid buildup of a large .git folder if there are multiple updates of the database. Keen to hear what other contributors have to say about efficiency and database size before I merge this into the PR. |
Thanks for making those specific changes that I asked for. One big question I've just thought of: why are you using DiSparse6String? Was there a particular reason? Since all your digraphs are symmetric digraphs (and presumably always will be ? maybe it's that you don't want to make that assumption), you could use As for your other branch: I like the new non-pickled, non-compressed approach, and I'm pleased to see that each You could save some moreKB by using the global variable In the test DB, is there a particular reason you took this format:
rather than
? The second way also has the benefit of using less code (and hence storage) 🙂 |
@wilfwilson, the answer to the first question is that I don't think we want to exclude the possibility of named non-symmetric digraphs being added to the database later on. It just so happens that the sources we've used are for graphs, but named digraphs could be of interest too. So the question is more between DiSparse6 and Digraph6, and DiSparse6 was the format that @reiniscirpons provided the csv file in. I'm not sure how much work it would be to convert between one and the other, but we could move to non-sparse if preferable. I think the format needs to be consistent across the database so that we can store symmetric and non-symmetric digraphs in the same place (unless there's an easy way of distinguishing between g6/d6/s6/ds6). It's true that a long variable name adds up when repeated over multiple lines and I'm happy to pick shorter names. However, choosing the substitution I think the other option to save space, because of how memory works for records, is something like this:
etc. I guess it just depends to what extent you think the variable names should be changed. As for the comments on the test DB, no, there is no good reason and I will correct that to save space. |
This was my thinking. But I really don't mind what the things are named, it was just an idea. As for which format to use: I think it's fine to keep using Notes on this topic: I believe that you can tell the difference between the formats by looking at the first characters of the strings. @mtorpey was the mastermind behind implementing this stuff, so he will know better. For example, I think Moreover, there exist functions in Digraphs (in particular |
Also for naming consistency, change all variables, functions and filenames to "named digraphs" and "named digraphs tests".
3b2a97a
to
2bf78e2
Compare
After discussion at today's meeting I have updated the PR to store the database in .g files referred to as "named digraphs main database" and "main digraphs test database". The new DIGRAPHS global variables are:
I think that's all the comments from @wilfwilson's review addressed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great feature!
@wilfwilson sorry about that, not sure why your review was dismissed. I just pushed a minor change to the comments in one of the test files to bump the testing which had frozen. |
That's because of the settings that we've got applied to the Digraphs repo, which means that an approving review on a PR automatically gets dismissed when further changes are made. Don't worry, I'll approve again. Note that you can also re-run the tests by closing and reopening the PR, which takes about 2 seconds so is probably the quickest route. Also, if it's the GitHub Actions jobs that are hanging, hopefully you're able to restart them by going to the relevant page in the Actions tab of this repository, or going to the 'Checks' tab at the top of this PR, and choosing to 'Re-run jobs' on the relevant bit: |
Note to @james-d-mitchell: I've just turned off the branch protection rule for My thinking is that we we're probably careful enough already with approving and merging that we're probably not going to miss problems causing by people making further changes between approval and merge (especially as the tests are still required to pass). |
This PR is a draft of Group 1's progress.
Edit: see comment below for an update on progress. This is no longer a draft.