Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Google Scholar fetcher for downloading a single entry #7075

Closed
wants to merge 16 commits into from

Conversation

koppor
Copy link
Member

@koppor koppor commented Nov 4, 2020

GoogleScholar changed their page (IMHO)

First page:

grafik

Second "page"

Click on cited loads the content of that thing:

grafik

There, BibTeX can be downloaded as usual.

grafik

Example URL: https://scholar.google.ch/scholar?q=info:RExzBa3OlkQJ:scholar.google.com/&output=cite&scirp=0&hl=en

Third page

Example URL: https://scholar.googleusercontent.com/scholar.bib?q=info:RExzBa3OlkQJ:scholar.google.com/&output=citation&scisdr=CgVYZoPbEOvJ014vy3E:AAGBfm0AAAAAX6Mq03ED_BBuflXyRuQujflFTqExM8uU&scisig=AAGBfm0AAAAAX6Mq0_wSs1k5gywcNDtaUBn0PeTKsRGQ&scisf=4&ct=citation&cd=-1&hl=en

Block notice:

grafik

Summary

While programming, I came to the last step. The issue is that after 10 tries, I am banned and cannot continue.

  • Change in CHANGELOG.md described (if applicable)
  • Tests created for changes (if applicable)
  • Manually tested changed features in running JabRef (always required)
  • Screenshots added in PR description (for UI changes)
  • Checked documentation: Is the information available and up to date? If not created an issue at https://github.com/JabRef/user-documentation/issues or, even better, submitted a pull request to the documentation repository.

@koppor
Copy link
Member Author

koppor commented Nov 5, 2020

  • I think, the underlying http connection should be reused in order to send the cookies at the subsequent requests
  • Try out with proxy so that one is not blocked

@@ -14,6 +14,17 @@ Fetchers are the implementation of the [search using online services](https://do

On Windows, you have to log-off and log-on to let IntelliJ know about the environment variable change. Execute the gradle task "processResources" in the group "others" within IntelliJ to ensure the values have been correctly written. Now, the fetcher tests should run without issues.

## Change the log levels to enable debugging
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also just start JabRef with -debug as program argument.


String infoPageUrl = BASIC_SEARCH_URL + "q=info:" + matcher.group(1) + ":scholar.google.com/&output=cite&scirp=0&hl=en";
LOGGER.debug("Using infoPageUrl {}", infoPageUrl);
URLDownload infoPageUrlDownload = new URLDownload(infoPageUrl);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If yu want to reuse the connection you should use unirest or jsoup

@koppor
Copy link
Member Author

koppor commented Dec 7, 2020

Refs #6369

koppor and others added 6 commits December 13, 2020 12:41
# Conflicts:
#	src/main/java/org/jabref/logic/importer/fetcher/GoogleScholar.java
Co-authored-by: Dominik Voigt <dominik.ingo.voigt@gmail.com>
…(and log URL)

Co-authored-by: Dominik Voigt <dominik.ingo.voigt@gmail.com>
Co-authored-by: Dominik Voigt <dominik.ingo.voigt@gmail.com>
Co-authored-by: Dominik Voigt <dominik.ingo.voigt@gmail.com>
…new one)

Co-authored-by: Dominik Voigt <dominik.ingo.voigt@gmail.com>
@koppor
Copy link
Member Author

koppor commented Dec 13, 2020

Liebes Tagebuch,

bei Google Scholar ist die Reihenfolge der Wörter entscheidend.

grafik

Und jetzt hau ich "In" nach vorne:

grafik

@Siedlerchr
Copy link
Member

You can either search by author or by title, for title you need to put in quotes: I don't understand your search query above.
https://scholar.google.com/intl/de/scholar/help.html#searching

@koppor
Copy link
Member Author

koppor commented Dec 13, 2020

Google Scholar also works when not using quotes. #convenience. Not sure whether our Google Scholar implementaiton should behave differently than when using their web page.

koppor and others added 5 commits December 13, 2020 14:46
Co-authored-by: Dominik Voigt <dominik.ingo.voigt@gmail.com>
Co-authored-by: Dominik Voigt <dominik.ingo.voigt@gmail.com>
Co-authored-by: Dominik Voigt <dominik.ingo.voigt@gmail.com>
Co-authored-by: Dominik Voigt <dominik.ingo.voigt@gmail.com>
Co-authored-by: Dominik Voigt <dominik.ingo.voigt@gmail.com>
@koppor koppor marked this pull request as draft December 13, 2020 15:35
Co-authored-by: Dominik Voigt <dominik.ingo.voigt@gmail.com>
entry.setField(StandardField.YEAR, "2013");
entry.setField(StandardField.PAGES, "41--44");
BibEntry entry = new BibEntry(StandardEntryType.InProceedings)
.withCitationKey("geiger2013detecting")
Copy link
Member

@tobiasdiez tobiasdiez Dec 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is nothing wrong with the set... methods. The with... methods were added to quickly add one or two field values, mostly in lambda expressions, e.g. map(entry -> entry.withField(...))

koppor and others added 3 commits December 22, 2020 16:36
Co-authored-by: Dominik Voigt <dominik.ingo.voigt@gmail.com>
Co-authored-by: Dominik Voigt <dominik.ingo.voigt@gmail.com>
@koppor koppor added this to the v5.3 milestone Dec 22, 2020
@Override
public String solve(String queryURL) {
// slim implementation of https://news.kynosarges.org/2014/05/01/simulating-platform-runandwait/
final CountDownLatch doneLatch = new CountDownLatch(1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to listen for the web engine ready event, see the preview Tab viewer where we add this highlight ja stuff

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

   previewView.getEngine().getLoadWorker().stateProperty().addListener((observable, oldValue, newValue) -> {

            if (newValue != Worker.State.SUCCEEDED) {
                return;
            }

See https://openjfx.io/javadoc/11/javafx.web/javafx/scene/web/WebEngine.html

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to listen for the web engine ready event, see the preview Tab viewer where we add this highlight ja stuff

Is this happen synchronously? The interface for the Captcha solver is designed in a synchronous way. Otherwise all fetchers need to be changed.

I'll be away anyway for the next days. Thus, you are free to experiment 😅

@Siedlerchr Siedlerchr marked this pull request as ready for review December 22, 2020 20:58
@koppor koppor mentioned this pull request Dec 22, 2020
5 tasks
@koppor koppor self-assigned this Jan 18, 2021
@koppor koppor marked this pull request as draft January 21, 2021 23:27
@koppor
Copy link
Member Author

koppor commented Jan 22, 2021

This directly competes with #5943, where the browser is used to communicate with Google Scholar. We should write an ADR ^^.

@tobiasdiez tobiasdiez added the status: changes required Pull requests that are not yet complete label Mar 10, 2021
@k3KAW8Pnf7mkmdSMPHz27
Copy link
Sponsor Member

Since this is on the 5.3 milestones, are we, at least for now, taking this approach?

@koppor
Copy link
Member Author

koppor commented Jun 6, 2021

Since this is on the 5.3 milestones, are we, at least for now, taking this approach?

Yes, we take this approach here. The other one heavily relies on JabRef's internal save handling. This is currently handled by @Siedlerchr in #6694. Hope, we will make progress somehow.

@koppor koppor removed this from the v5.3 milestone Jun 7, 2021
@koppor
Copy link
Member Author

koppor commented Jun 7, 2021

DevCall decision: Work-around using our browser extension exists. Thus, this is not a high-priority any more.

@koppor
Copy link
Member Author

koppor commented Sep 21, 2023

Other implementation hint: https://github.com/JetBrains/jcef

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fetcher status: changes required Pull requests that are not yet complete status: freeze Issues posponed to a (much) later future
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants