Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

local scanned images #10

Closed
funderburkjim opened this issue Oct 23, 2019 · 32 comments
Closed

local scanned images #10

funderburkjim opened this issue Oct 23, 2019 · 32 comments
Milestone

Comments

@funderburkjim
Copy link
Contributor

This issue is to deal with an enhancement to the local dictionary installation process (as described in the readme.md at csl-pywork/v02.
The feature regards installation of local copies of the scanned images for each dictionary; this feature was mentioned in #6 comments.

@funderburkjim
Copy link
Contributor Author

size estimations

Since the scanned images take a lot of disk space, let's get some statistics on the current actual disk space at Cologne devoted to scanned images.

Here is a listing of the 34 publicly available dictionaries, along with the space taken up by the scanned images used in the displays, and the number of image files.

acc     110MB   1216 
ae      146MB    518 
ap90    334MB   1211 
ben      66MB   1127 
bhs      73MB    634 
bop      29MB    421 
bor      67MB    808 
bur      94MB    394 
cae      89MB    677 
ccs     141MB    541 
gra     110MB    893 
gst      39MB    334 
ieg      22MB    580 
inm     161MB    852 
krm     536MB   1489 
mci     399MB   1024 
md      141MB    395 
mw      488MB   1370 
mw72    184MB   1212 
mwe     331MB    860 
pe       52MB    929 
pgn      19MB    420 
pui      83MB   2232 
pw      612MB   2141 
pwg    1546MB   4737 
sch      62MB    406 
shs     602MB    842 
skd     503MB   3164 
snp      29MB    135 
stc      90MB    904 
vcp     543MB   5447 
vei      73MB   1155 
wil     336MB    988 
yat      91MB    928 
TOT    8217MB  40984

So, all in all there is about 8.2GB of space used and about 41,000 individual scanned images.

Notes

  • This listing was made by program size_pdfpages.py in scans/awork/misc/misc folder on Cologne server.
  • The Cologne directories were taken from the _cologne_pdfpages_url method in
    csl-websanlexicon/v02/makotemplates/web/webtc/dictinfo.php
  • The for each dictionary's directory, the files with an image suffix ('.pdf','jpg','png') were considered.
  • The size for each image was got via the Python standard library function os.path.getsize
  • Another method, using the bash du -s <directory-name>, was also used as a check. The two
    size estimations were noticed to be the same or almost the same for each dictionary.

@funderburkjim
Copy link
Contributor Author

Github is a viable location for keeping the images

Github has this to say about repository size limits (reference]):

File and repository size limitations
We recommend repositories be kept under 1GB each. Repositories have a hard limit of 100GB. If you reach 75GB you'll receive a warning from Git in your terminal when you push. This limit is easy to stay within if large files are kept out of the repository. If your repository exceeds 1GB, you might receive a polite email from GitHub Support requesting that you reduce the size of the repository to bring it back down.
In addition, we place a strict limit of files exceeding 100 MB in size. For more information, see "Working with large files."

None of the image files exceeds 100MB ; average size is 0.2MB per image ( 8217MB / 40984 files).
So we're ok per file.

Based on the 'hard limit', all of the images could be kept in one repository (8.2GB < 100GB).

@funderburkjim
Copy link
Contributor Author

Proposed 34 repository solution

I think there should be one repository for the images for each dictionary. This would allow
user flexibility in installation; each user could clone the repositories of just those dictionaries of
interest. [Some suggested intallation instructions will be provided in comments below.]

If the images for each dictionary were kept in a separate repository, then there would be 34 new repositories, and all but 1 (pwg) would take less space than the 1GB Github recommended repository maximum size,

Proposed repository naming convention

The repository names could be 'scans-xxx', were 'xxx' is one of the 34 (lower-case) dictionary
abbreviations; so 'scans-acc', 'scans-mw', 'scans-ap90', etc.

Proposed github project to contain the scan repositories

It might be simplest to add the 34 new repositories to the sanskrit-lexicon Github organization.
Currently there are 36 repositories in sanskrit-lexicon, so there would be 70 repositories after adding the 34 new ones.
Since Github imposes no limits on the number of repositories per organization (ref: About repositories), there would be no problem in having 70 or more repositories under sanskrit-lexicon.

The naming convention 'scans-xxx' would allow easy filtering of the image repositories from among all the sanskrit-lexicon repositories.

@funderburkjim
Copy link
Contributor Author

request feedback

Please provide feedback regarding the above suggestion!

In the meantime, I'll set up procedures along the lines indicated above, using one or two dictionaries for the prototyping.

@drdhaval2785
Copy link
Collaborator

The images are not going to change frequently.

So instead of one repository per dictionary,
I propose only one repository for all dictionaries. We can keep .zip / .tar / .tar.gz file like acc.zip / ae.zip etc in the repository.

The installation instruction may give a prompt "Do you want to download dictionary page images for local use? It will take roughly XXX MB of download and YYY MB of disc space."

If user says no, we don't download images. If he says yes, we download images.

@funderburkjim
Copy link
Contributor Author

I agree images will almost never change.

My experience with zip is that images compress very little.

If one repository for all dictionaries, then cloning that repository will require a user to download
8GB. 8GB is a lot! Its roughly equivalent to 4 copies of Windows 10 or MAC-OSX. Download would take several hours on lower bandwidth connections.

If the user only wants the images of, say, MW dictionary, then he would have to download an additional 7.5GB of unneeded stuff just to get the 500MB of images that he wants.

If user actually wants the images for all dictionaries, it will still take a long time -- about the same amount of time/space whether the images are in one repository or 34 repositories.

What are the downsides of separate repositories?

@drdhaval2785
Copy link
Collaborator

There are no downside of separate repositories, except too many repositories.
If we are OK with it, we can keep the images in separate repos.

@funderburkjim
Copy link
Contributor Author

sanskrit-lexicon-scans organization

To deal with the 'too many repositories' issue, we could put all the image repositories in another Github organization. As an experiment to this end, I've made a 'sanskrit-lexicon-scans' organization.
Currently it is owned by me (@funderburkjim). (How can ownership be transferred or shared?)
@drdhaval2785 , @gasyoun , and @YevgenJohn have been invited to be on the 'team' of the new organization.

Am currently working to automate process of initializing sanskrit-lexicon-scans/xxx repositories.

@funderburkjim
Copy link
Contributor Author

sanskrit-lexicon-scans/acc

This repository now exists, and is populated with the images.

  • There is a README.md file and a LICENSE file
    • Request feedback on the wording of the readme, and the choice of license.
  • The images are in the 'pdfpages' directory.
  • It has only one branch, gh-pages
    • This means that the images could be served from here in a web application. For example,
      page 1 of acc

Also request feedback on the choice of sanskrit-lexicon-scans organization

sanskrit-lexicon-scans/ae

Also populated.

Next steps

  • create scripts (in csl-pywork ?) for user downloads of scans
  • put the local images for dictionary xxx in cologne/scans/xxx/pdfpages
  • modify code in csl-websanlexicon so local web-app displays will know where images reside locally

@YevgenJohn
Copy link
Contributor

Fantastic!! Please let me look through it and I will try if I could do some of the listed as next steps, I might try csl-websanlexicon as well to see if I understand that enough to make working changes.
This is really important for a local VM to be self-sufficient, in case it works offline or in case of the main server DR situation, so a user can still refer to the scanned pages to make sure the digitized version is in sync with them.

@funderburkjim
Copy link
Contributor Author

csl-websanlexicon modified

The change is very brief. Just in dictinfo.php.

Here's how to see the change in action.

I'm assuming you already have a local machine or a server set up and populated with
the acc or ae dictionary installed (these are currently the only ones with scans on Github).
So you have 'cologne/acc', 'cologne/ae', 'cologne/csl-websanlexicon', 'cologne/csl-pywork',
'cologne/csl-orig'.

Before updating to local images

update local csl-websanlexicon

  • change to local version, (cd ... csl-websanlexicon)
  • git pull origin master

regenerate cologne/xxx/web

Install the new code at least for xxx=acc and ae.

  • change directory to csl-websanlexicon/v02
  • if you have local copies of all dictionaries:
    • sh redo_xampp_all.sh

set up for local scanned images

  • change to cologne directory
  • mkdir scans

get local images for acc and ae

test that local images are being used for acc, ae

Same steps as under 'Before updating to local images' above. But now, for example,
src="http://localhost/cologne/scans/ae/pdfpages/ae-119.pdf"
which proves you are using local scanned images.

If you were to use your local copy of mw, it would still show images from cologne, since
there are no local images yet for mw. (i.e., scans/mw/pdfpages is not there).

@funderburkjim
Copy link
Contributor Author

csl-apidev needs similar modification

csl-apidev is another piece that can be run locally. We haven't discussed it yet.
In order for local installations to use local images, It needs a modification similar to that of csl-websanlexicon.
I'll open an issue related to this.

@YevgenJohn
Copy link
Contributor

test that local images are being used for acc, ae

Same steps as under 'Before updating to local images' above. But now, for example,
src="http://localhost/cologne/scans/ae/pdfpages/ae-119.pdf"
which proves you are using local scanned images.

Great! It works on my local VM:
<embed id="plugin" type="application/x-google-chrome-pdf" src="http://localhost/cologne/scans/ae/pdfpages/ae-119.pdf" stream-url="chrome-

If you were to use your local copy of mw, it would still show images from cologne, since
there are no local images yet for mw. (i.e., scans/mw/pdfpages is not there).

I searched for 'karma' in MW, it points to:
<embed id="plugin" type="application/x-google-chrome-pdf" src="http://www.sanskrit-lexicon.uni-koeln.de/scans/MWScan/MWScanpdf/mw0258-kartRguptaka.pdf" stream-url="chrome-

Do we plan to have them all in Git so they can be pulled locally? Thank you!

@funderburkjim
Copy link
Contributor Author

Do we plan to have them all in Git ?

Yes. I wanted to get some feedback on the wording of the readme and the choice of license before
installing github repositories in sanskrit-lexicon-scans for all the dictionaries.

@drdhaval2785
Copy link
Collaborator

@funderburkjim

Regarding licence, I prefer GPLv3.

@drdhaval2785
Copy link
Collaborator

Readme should give installation instructions for local images.

@gasyoun
Copy link
Member

gasyoun commented Oct 29, 2019

@funderburkjim

Regarding licence, I prefer GPLv3.

Makes sense to me.

@funderburkjim
Copy link
Contributor Author

comparison between gplv3 and cc-by-sa.

There are several comparisons between these two licenses.

From this comparison,

GPL v3 and BY-SA 4.0 are similar licenses with similar aims. But because GPLv3 was written specifically for licensing software, it does have some differences from BY-SA ...

The main reasons I suggested the CC-BY-SA license for these scanned image repositories:

  • The content is not software, but data. CC-BY-SA seems more commonly used for data. For example, CC-BY-SA is said to be commonly used for Wikisource (ref.
  • The license for the Cologne digitizations is CC-NC-BY-SA 3.0. See the license for MW as an example.
    • Note. NC=non-commercial, is used in the digitization licenses
    • I dropped the 'NC' clause for these scan repositories on purpose, after experimenting with the Creative Commons License Picker. Answering 'no' to 'Allow commercial uses of your work?' , changes 'This is a Free Culture License' to 'This is not a Free Culture License'. I thought it sounded friendly to be a Free Culture License, so dropped the NC.

Given the above, I still have a slight preference for CC-BY-SA license for these repositories. Currently our software repositories (e.g. csl-pywork and a couple of others) do not have a license; if we add a license, GPLv3 might be a good choice. Another option would be MIT license.

@drdhaval2785 and @gasyoun : In light of these comments, do you have any further thoughts on the choice of license? Do you have a strong preference for the GPLv3 license for these scanned image repositories?

@drdhaval2785
Copy link
Collaborator

Based on your comments, I am OK with CC for images and GPLv3 for csl-pywork, apidev andd websanlexicon

@funderburkjim
Copy link
Contributor Author

Peter's suggestion

I asked Thomas Malten and Peter Scharf their opinion regarding license.

Thomas is fine with CC BY-SA.

Peter prefers CC BY-NC-SA. His reason:

...the scanning work was included under grants from the NEH and DFG, the license should include non-commercial as well. ... Otherwise, if someone collects money from the use of the images, these granting institutions may take offence if they don’t get a cut. The same is my feeling about the text and XML.

Here is a link to cc by-nc-sa

Here is an excerpt from https://wiki.creativecommons.org/wiki/NonCommercial_interpretation;
the sentence marked off by double-asterisks is part of what I think Peter has in mind.

Like all CC licenses, the NC licenses are non-exclusive. This means that an NC licensor is free to offer the material under other terms, including on commercial terms.
**A frequently discussed use case for the NC licenses is a creator who wishes to allow NonCommercial use but also authorizes commercial uses in exchange for payment. **
(Additional permissions such as this may always be offered; licensors may also use our CC+ protocol to offer these in a standardized manner.) Also, licensees are always free to contact licensors to ask permission to use the work for commercial purposes.

My own opinion is that it doesn't matter much. I'm fine to go with cc by-nc-sa.

What do others think?

@drdhaval2785
Copy link
Collaborator

CC BY-NC-SA is fine to me too.

@funderburkjim
Copy link
Contributor Author

funderburkjim commented Nov 1, 2019

Will proceed with the scanned image installations under CC BY-NC-SA

Thomas also concurs with NC.
I will revise the acc and ae licenses to BY-NC-SA, and then continue with the installation
of the rest of the images.

The other thing that needs to be done (@drdhaval2785 requested above) is installation instructions (i.e. how to use the scanned images in a local installation).

I'll make a 'sanskrit-lexicon-scans/documentation' repository, and make a link in the README.MD for each dictionary to the README.md in the documentation repository.

@funderburkjim
Copy link
Contributor Author

Scanned images for all dictionaries

All repositories sanskrit-lexicon-scans/xxx have now been populated with the images.

sanskrit-lexicon-scans/documentation/README.md exists, but is currently incomplete.

Maybe someone else could work on this README.md. If needed, I'll provide some content next week.

@gasyoun
Copy link
Member

gasyoun commented Nov 4, 2019

I thought it sounded friendly to be a Free Culture License, so dropped the NC.

Exactly.

Currently our software repositories (e.g. csl-pywork and a couple of others) do not have a license

Time to add.

Do you have a strong preference for the GPLv3 license for these scanned image repositories?

No, no strong preferences. MIT is good as well.

Based on your comments, I am OK with CC for images and GPLv3 for csl-pywork, apidev andd websanlexicon

So am I.

sanskrit-lexicon-scans/xxx

It's owned only by you, Jim, right? Thinking about a case of emergency and is why I ask.

Maybe someone else could work on this README.md.

@YevgenJohn give it a try?

@funderburkjim
Copy link
Contributor Author

It's owned only by you, Jim, right?

I think I 'invited' @drdhaval2785 , @YevgenJohn , and you (@gasyoun ) to the 'team' for the 'sanskrit-lexicon-scans' organization. Did you receive invitation?

Although I created the organization, my intent was to have it jointly 'owned' by all 4.

Do I need to do something in settings regarding ownership, so that I am not the only 'owner'?

@YevgenJohn
Copy link
Contributor

Scanned images for all dictionaries

All repositories sanskrit-lexicon-scans/xxx have now been populated with the images.

Thank you very much! I'm trying to make a standalone VM with images, disconnect its network interfaces and see if links to the pictures work (as it won't be able to reach out to Cologne server).

Apologies for not contributing to the licenses discussion, as I don't know that subject well enough.
Thank you!

@gasyoun
Copy link
Member

gasyoun commented Nov 6, 2019

Did you receive invitation?

view

Only by accident now I see it. Others are here by now.

owner

Do I need to do something in settings regarding ownership, so that I am not the only 'owner'?

Yes, for each person you set them to be a non-member, but owner.

https://github.com/orgs/sanskrit-lexicon-scans/people here?

@funderburkjim
Copy link
Contributor Author

How do I change @drdhaval2785 (and others) from Member to Owner?

@gasyoun
Copy link
Member

gasyoun commented Nov 7, 2019

@funderburkjim
Copy link
Contributor Author

make a standalone VM

@YevgenJohn Why don't you start an issue regarding this standalone VM. It would be interesting to better understand what is meant by a standalone VM, and how it would be used.

@YevgenJohn
Copy link
Contributor

Absolutely, very good idea! I wonder how much space the VM image would take with all scanned pages uploaded. I just added another disk to the VM to accommodate it. My goal is to provide a ready product linguists can plug in and use (when offline, or if they want to run heavy query which would otherwise slow shared server down, so we can remove upper limit on number of results), as asking them to do Linux commands to set it up locally seems a bit of impractical to me. Thank you!

@drdhaval2785
Copy link
Collaborator

Local scanned images have stabilized. Closing the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants