-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce Web-scraping inside JabRef #11093
Comments
Works now, was probably a temporary glitch |
I checked the Bib Desk code: |
When it comes to scrapping, I have seen JSoup being mentioned a lot: https://jsoup.org/ |
At JabRef#695, I tried out HtmlUnit, JCEF, and jbrowserdriver - nothing really worked. |
Currently, our web search sends out search strings to API endpoints and then interprets the results. In other words: We have fetchers with API key and screen scraping. For the screen scapers, they mostly don't work. We should switch to a browser-based screen-scraping. Mostly because of CloudFlare.
JabRef should display the HTML page inside JabRef and offer scraping the citations directly from the page. Similar as BibDesk does.
Maybe the Java Chromium Embedded Framework (JCEF) helps. The test class https://github.com/chromiumembedded/java-cef/blob/master/java/tests/detailed/handler/RequestHandler.java seems to guide one to the usage.
The PR #7075 attempted to display the Google Scholar captchas in JabRef. The PR was not completed. -- This issue says: Rewrite the fetchers not to use
URLDownload
, but JCEF.Note that this is different from #11093. There, a new UI is demanded.
Here, it should be allowed that the fetchers run stand-alone without user interaction.
Affected fetchers:
Sometimes, the API used. Then
findFullText
is the method handling HTML only.The text was updated successfully, but these errors were encountered: