Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adapt scraper to new Phets Web site #150

Merged
merged 1 commit into from
Dec 1, 2022
Merged

Conversation

pavel-karatsiuba
Copy link
Collaborator

@pavel-karatsiuba pavel-karatsiuba commented Nov 28, 2022

Fixes are touched downloading of the simulation start page. Because phet uses the async loader to load page content scrapper should use lib which can wait until the content is loaded. So in this PR used the puppeteer library to load simulations. It downloads simultaneously 5 pages. If set to more than 5 pages then phet site will respond with a 404 or a timeout exception.

Fixes #148

@pavel-karatsiuba pavel-karatsiuba marked this pull request as draft November 29, 2022 20:02
@pavel-karatsiuba pavel-karatsiuba marked this pull request as ready for review November 29, 2022 21:02
@kelson42 kelson42 changed the title issue#148 Adapt scraper to new Phets Web site Nov 30, 2022
@kelson42
Copy link
Contributor

kelson42 commented Nov 30, 2022

@pavel-karatsiuba It's important you describe here what are the impact from the user perspective. You told me separatly, but not here, that at least the topics and descriptions are not scrapped anymore. But the topics are here listed for example https://phet.colorado.edu/en/simulations/build-a-nucleus/about. The description is available on the very same about page in <meta name="description">.

Do we have other impacts? Otherwise do the ZIM files look like the same like the latest successfuly scrape available at https://library.kiwix.org?

Copy link
Contributor

@kelson42 kelson42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix with proper "Description"

@kelson42 kelson42 force-pushed the issue-148-not-generating-zim branch from 84b5e30 to 7551886 Compare December 1, 2022 18:38
@kelson42 kelson42 merged commit be2dcf3 into master Dec 1, 2022
@kelson42 kelson42 deleted the issue-148-not-generating-zim branch December 1, 2022 18:56
@kelson42 kelson42 mentioned this pull request Dec 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scraper does not generate proper ZIM files
2 participants