Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Epic] Importing of arXiv data to Wikibase/MediaWiki #338

Open
12 tasks
Hyper-Node opened this issue Nov 29, 2022 · 0 comments
Open
12 tasks

[Epic] Importing of arXiv data to Wikibase/MediaWiki #338

Hyper-Node opened this issue Nov 29, 2022 · 0 comments
Assignees
Labels

Comments

@Hyper-Node
Copy link
Contributor

Hyper-Node commented Nov 29, 2022

Epic description in words:

Importing and parsing arXiv Metadata and Full- and Sourcetexts.
The Metadata will be imported the MaRDI Portal knowledge graph in Wikibase and/or to Mediawiki Pages.
Full- and Sourcetexts will be parsed and formula will be extracted to the KG or MathSearch-indexing.

Current state is to harvest Metadata with OAI-PMH and obtain the fulltexts and sourcetexts through S3-Buckets.
https://arxiv.org/help/bulk_data

This is defined mostly, some steps are still draft.

Epic issues:

  • Create OAI PMH Prototype in Python which harvests arXive Metadata (Johannes)
  • Create S3 Client which is able to obtain Fulltext and Source-Text data in python on Mardi0X. This involves synchronization with ZIB-Admins. (Eloi)
  • Check on Metadata, Fulltext and Source-Text data to define our data model and decide if caching is necessary (Eloi, Johannes)
  • (if caching necessary) Create a cache like database and deploy it to our ecosystem (Johannes)
  • (if caching necessary) Upgrade OAI-PMH component to write to cache (Johannes)
  • (if caching necessary) Upgrade Full-/sourcetext component to write to cache (Eloi)
  • (if caching necessary) Write a reader component for the cache in Wikibase-Integrator (Eloi)
  • (to be defined more accurately) Write a document parser for full-texts and source-texts in python (Johannes)
  • (to be defined more accurately) Write an importer component for arXiv Metadata to Wikibase (Eloi)

Initial questions

Additional Info:

Corresponding Milestones:

  • A4b, A4a, F1, F2, C3, D2, D3

Related bugs:

Epic acceptance criteria:

  • first criterion

Checklist for this epic:

  • the main MaRDI project has been assigned as project
  • report has been created
@Hyper-Node Hyper-Node self-assigned this Nov 29, 2022
@Hyper-Node Hyper-Node changed the title [Epic][Draft] Importing of ArXiVe Data [Epic][Draft] Importing of ArXiv Data Nov 29, 2022
@Hyper-Node Hyper-Node changed the title [Epic][Draft] Importing of ArXiv Data [Epic][Draft] Importing of ArXiv Document-Data to Wikibase/MediaWiki Jan 27, 2023
@Hyper-Node Hyper-Node changed the title [Epic][Draft] Importing of ArXiv Document-Data to Wikibase/MediaWiki [Epic] Importing of arXiv data to Wikibase/MediaWiki Jan 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants