Introduce Web-scraping inside JabRef #11093

koppor · 2024-03-25T12:21:51Z

Currently, our web search sends out search strings to API endpoints and then interprets the results. In other words: We have fetchers with API key and screen scraping. For the screen scapers, they mostly don't work. We should switch to a browser-based screen-scraping. Mostly because of CloudFlare.

JabRef should display the HTML page inside JabRef and offer scraping the citations directly from the page. Similar as BibDesk does.

316482562-b4a3d1e7-bd0a-4475-ae52-71120ae0d1fe

316482726-6a80130f-f920-44a4-8689-f420fa459226

Maybe the Java Chromium Embedded Framework (JCEF) helps. The test class https://github.com/chromiumembedded/java-cef/blob/master/java/tests/detailed/handler/RequestHandler.java seems to guide one to the usage.

The PR #7075 attempted to display the Google Scholar captchas in JabRef. The PR was not completed. -- This issue says: Rewrite the fetchers not to use URLDownload, but JCEF.

Note that this is different from #11093. There, a new UI is demanded.

Here, it should be allowed that the fetchers run stand-alone without user interaction.

Affected fetchers:

ACS: org.jabref.logic.importer.fetcher.ACS
Google Scholar: org.jabref.logic.importer.fetcher.GoogleScholar)
Icar: org.jabref.logic.importer.fetcher.IacrEprintFetcher
JStor: org.jabref.logic.importer.fetcher.JstorFetcher
ResearchGate: org.jabref.logic.importer.fetcher.ResearchGate
ScienceDirect: org.jabref.logic.importer.fetcher.ScienceDirect
SpringerLink: org.jabref.logic.importer.fetcher.SpringerLink

Sometimes, the API used. Then findFullText is the method handling HTML only.

The text was updated successfully, but these errors were encountered:

Siedlerchr · 2024-03-25T12:49:41Z

Works now, was probably a temporary glitch

Siedlerchr · 2024-05-17T22:56:14Z

I checked the Bib Desk code:
They basically use a Safari based View Control and use a simple XPath query to check for matching links in the document's dom. The parsing itself is very similar to our existing fetcher infrastructure.
I experimented a bit with using javafx's WebView, while that can display websites and even captchas e.g. on google scholar,
I was not yet able to get the correct DOM after clicking on some page. This would require some further testing.

koppor · 2024-05-27T21:10:48Z

Related work: https://github.com/HtmlUnit/htmlunit?tab=readme-ov-file#getting-started

ThiloteE · 2024-05-27T21:17:51Z

When it comes to scrapping, I have seen JSoup being mentioned a lot: https://jsoup.org/
See also https://stackoverflow.com/questions/2835505/how-to-scan-a-website-or-page-for-info-and-bring-it-into-my-program

koppor · 2024-12-05T07:20:17Z

At JabRef#695, I tried out HtmlUnit, JCEF, and jbrowserdriver - nothing really worked.

koppor added component: ui component: fetcher labels Mar 25, 2024

koppor added this to Candidates for University Projects Mar 25, 2024

github-project-automation bot moved this to Free to take in Candidates for University Projects Mar 25, 2024

koppor added this to JabRef UI Improvements Mar 25, 2024

github-project-automation bot moved this to Normal priority in JabRef UI Improvements Mar 25, 2024

koppor mentioned this issue Mar 25, 2024

Make fetchers web-based JabRef/jabref-koppor#683

Closed

This was referenced Jun 13, 2024

feature request: add new websites to web search (Google Scholar, Nature, Science) #10263

Open

"Download linked file" option creates an html file instead of downloading the pdf on Windows 10. #10149

Closed

Siedlerchr mentioned this issue Aug 27, 2024

ACS Jsoup fetch runs into 403: Forbidden #11682

Open

2 tasks

koppor mentioned this issue Sep 7, 2024

Enable JCEF JabRef/jabref-koppor#695

Draft

calixtus added this to Prioritization Nov 13, 2024

github-project-automation bot moved this to Normal priority in Prioritization Nov 13, 2024

calixtus removed this from JabRef UI Improvements Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce Web-scraping inside JabRef #11093

Introduce Web-scraping inside JabRef #11093

koppor commented Mar 25, 2024 •

edited

Loading

Siedlerchr commented Mar 25, 2024

Siedlerchr commented May 17, 2024

koppor commented May 27, 2024

ThiloteE commented May 27, 2024 •

edited

Loading

koppor commented Dec 5, 2024

Introduce Web-scraping inside JabRef #11093

Introduce Web-scraping inside JabRef #11093

Comments

koppor commented Mar 25, 2024 • edited Loading

Siedlerchr commented Mar 25, 2024

Siedlerchr commented May 17, 2024

koppor commented May 27, 2024

ThiloteE commented May 27, 2024 • edited Loading

koppor commented Dec 5, 2024

koppor commented Mar 25, 2024 •

edited

Loading

ThiloteE commented May 27, 2024 •

edited

Loading