Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/enable lucene query parsing #6799

Merged
merged 25 commits into from
Aug 31, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
c8ebc91
Squashed 'src/main/resources/csl-styles/' changes from bf698acec7..82…
Aug 1, 2020
f5f02cf
Merge commit 'c8ebc91d4799e59518ac887f69e584040cf29994'
Aug 1, 2020
37c2d64
Squashed 'src/main/resources/csl-styles/' changes from 827b986621..eb…
Aug 15, 2020
b85ebe3
Merge commit '37c2d64e449ef617bb98c0ad3b39a903d8f3d886' into master
Aug 15, 2020
f10ea5f
Merge remote-tracking branch 'upstream/master'
DominikVoigt Aug 25, 2020
172b113
Enable Query String into ComplexQuery parsing using lucene.
DominikVoigt Aug 26, 2020
90a629e
Add ADR
DominikVoigt Aug 27, 2020
3af3781
Add Changelog entry
DominikVoigt Aug 27, 2020
26851a7
Merge remote-tracking branch 'upstream/master' into feature/enable-lu…
DominikVoigt Aug 27, 2020
8468ad4
Move Changelog entry
DominikVoigt Aug 27, 2020
d592a47
Increment ADR number
DominikVoigt Aug 27, 2020
4610451
Remove Optional from Lists
DominikVoigt Aug 27, 2020
86c6576
Rename slipped occurences of converter
DominikVoigt Aug 27, 2020
e2447f6
Merge remote-tracking branch 'upstream/master' into feature/enable-lu…
DominikVoigt Aug 28, 2020
0642d49
Mapping of unknown fields onto default field
DominikVoigt Aug 30, 2020
11afbe9
Ignores unknown fields
DominikVoigt Aug 30, 2020
6fde881
Merge branch 'master' into feature/enable-lucene-query-parsing
koppor Aug 30, 2020
e27ebcc
Merge remote-tracking branch 'upstream/master' into feature/enable-lu…
DominikVoigt Aug 31, 2020
239095e
Fix checkstyle issues
DominikVoigt Aug 31, 2020
467cbfc
Merge branch 'master' into feature/enable-lucene-query-parsing
koppor Aug 31, 2020
ff6a7d3
Merge branch 'feature/enable-lucene-query-parsing' of https://github.…
DominikVoigt Aug 31, 2020
542d3a7
Apply suggestions from code review
DominikVoigt Aug 31, 2020
4b2ab98
Handle year-range parsing in a more robust way
DominikVoigt Aug 31, 2020
8b3b7cd
Move extracted method into builder
DominikVoigt Aug 31, 2020
ec60734
Merge branch 'feature/enable-lucene-query-parsing' of https://github.…
DominikVoigt Aug 31, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ Note that this project **does not** adhere to [Semantic Versioning](http://semve

### Added

- We added a query parser and mapping layer to enable conversion of queries formulated in simplified lucene syntax by the user into api queries. [#6799](https://github.com/JabRef/jabref/pull/6799)

### Changed

### Fixed
Expand Down
4 changes: 4 additions & 0 deletions build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -137,6 +137,10 @@ dependencies {
antlr4 'org.antlr:antlr4:4.8-1'
implementation 'org.antlr:antlr4-runtime:4.8-1'

implementation (group: 'org.apache.lucene', name: 'lucene-queryparser', version: '8.6.1') {
exclude group: 'org.apache.lucene', module: 'lucene-sandbox'
}

implementation group: 'org.mariadb.jdbc', name: 'mariadb-java-client', version: '2.6.2'

implementation 'org.postgresql:postgresql:42.2.16'
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Query syntax design

## Context and Problem Statement

All libraries use their own query syntax for advanced search options. To increase usability, users should be able to formulate their (abstract) search queries in a query syntax that can be mapped to the library specific search queries. To achieve this, the query has to be parsed into an AST.

Which query syntax should be used for the abstract queries?
Which features should the syntax support?

## Considered Options

* Use a simplified syntax that is derived of the [lucene](https://lucene.apache.org/core/8_6_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html) query syntax
* Formulate a own query syntax

## Decision Outcome

Chosen option: "Use a syntax that is derived of the lucene query syntax", because only option that is already known, and easy to implemenent.
Furthermore parsers for lucene already exist and are tested.
For simplicitly, and lack of universal capabilities across fetchers, only basic query features and therefor syntax is supported:

* All terms in the query are whitespace separated and will be ANDed
* Default and certain fielded terms are supported
* Fielded Terms:
* `author`
* `title`
* `journal`
* `year` (for single year)
* `year-range` (for range e.g. `year-range:2012-2015`)
* The `journal`, `year`, and `year-range` fields should only be populated once in each query
* Example:
* `author:"Igor Steinmacher" author:"Christoph Treude" year:2017` will be converted to
* `author:"Igor Steinmacher" AND author:"Christoph Treude" AND year:2017`

### Positive Consequences

* Already tested
* Well known
* Easy to implement
* Can use an existing parser

## Pros and Cons of the Options

### Use a syntax that is derived of the lucene query syntax

* Good, because already exists
* Good, because already well known
* Good, because there already exists a [parser for lucene syntax](https://lucene.apache.org/core/8_0_0/queryparser/org/apache/lucene/queryparser/flexible/standard/StandardQueryParser.html)
* Good, because capabilities of query conversion can easily be extended using the [flexible lucene framework](https://lucene.apache.org/core/8_0_0/queryparser/org/apache/lucene/queryparser/flexible/core/package-summary.html)

### Formulate a own query syntax

* Good, because allows for flexibility
* Bad, because needs a new parser (has to be decided whether to use [ANTLR](https://www.antlr.org/), [JavaCC](https://javacc.github.io/javacc/), or [LogicNG](https://github.com/logic-ng/LogicNG))
* Bad, because has to be tested
* Bad, because syntax is not well known
* Bad, because the design should be easily extensible, requires an appropriate design (high effort)
2 changes: 2 additions & 0 deletions src/main/java/module-info.java
Original file line number Diff line number Diff line change
Expand Up @@ -89,4 +89,6 @@
requires flexmark.util.ast;
requires flexmark.util.data;
requires com.h2database.mvstore;
requires lucene.queryparser;
requires lucene.core;
}
53 changes: 53 additions & 0 deletions src/main/java/org/jabref/logic/importer/QueryParser.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
package org.jabref.logic.importer;

import java.util.ArrayList;
import java.util.Comparator;
import java.util.HashSet;
import java.util.List;
import java.util.Optional;
import java.util.Set;

import org.jabref.logic.importer.fetcher.ComplexSearchQuery;

import org.apache.lucene.index.Term;
import org.apache.lucene.queryparser.flexible.core.QueryNodeException;
import org.apache.lucene.queryparser.flexible.standard.StandardQueryParser;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.QueryVisitor;

/**
* This class converts a query string written in lucene syntax into a complex search query.
*
* For simplicity this is limited to fielded data and the boolean AND operator.
*/
public class QueryParser {

/**
* Parses the given query string into a complex query using lucene.
* Note: For unique fields, the alphabetically first instance in the query string is used in the complex query.
*
* @param queryString The given query string
* @return A complex query containing all fields of the query string
* @throws QueryNodeException Error during parsing
*/
public Optional<ComplexSearchQuery> parseQueryStringIntoComplexQuery(String queryString) {
try {
ComplexSearchQuery.ComplexSearchQueryBuilder builder = ComplexSearchQuery.builder();

StandardQueryParser parser = new StandardQueryParser();
Query luceneQuery = parser.parse(queryString, "default");
Set<Term> terms = new HashSet<>();
// This implementation collects all terms from the leaves of the query tree independent of the internal boolean structure
// If further capabilities are required in the future the visitor and ComplexSearchQuery has to be adapted accordingly.
QueryVisitor visitor = QueryVisitor.termCollector(terms);
luceneQuery.visit(visitor);

List<Term> sortedTerms = new ArrayList<>(terms);
sortedTerms.sort(Comparator.comparing(Term::text));
builder.terms(sortedTerms);
return Optional.of(builder.build());
} catch (QueryNodeException | IllegalStateException ex) {
return Optional.empty();
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,8 @@ public interface SearchBasedFetcher extends WebFetcher {
* @return a list of {@link BibEntry}, which are matched by the query (may be empty)
*/
default List<BibEntry> performComplexSearch(ComplexSearchQuery complexSearchQuery) throws FetcherException {
// Default Implementation behaves like perform search using the default field as query
return performSearch(complexSearchQuery.getDefaultField().orElse(""));
// Default implementation behaves as performSearch using the default field phrases as query
List<String> defaultPhrases = complexSearchQuery.getDefaultFieldPhrases();
return performSearch(String.join(" ", defaultPhrases));
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -83,8 +83,9 @@ private List<BibEntry> getBibEntries(URL urlForQuery) throws FetcherException {
}

default URL getComplexQueryURL(ComplexSearchQuery complexSearchQuery) throws URISyntaxException, MalformedURLException, FetcherException {
// Default Implementation behaves like getURLForQuery using the default field as query
return this.getURLForQuery(complexSearchQuery.getDefaultField().orElse(""));
// Default implementation behaves as getURLForQuery using the default field phrases as query
List<String> defaultPhrases = complexSearchQuery.getDefaultFieldPhrases();
return this.getURLForQuery(String.join(" ", defaultPhrases));
}

/**
Expand Down
6 changes: 3 additions & 3 deletions src/main/java/org/jabref/logic/importer/fetcher/ArXiv.java
Original file line number Diff line number Diff line change
Expand Up @@ -264,12 +264,12 @@ public List<BibEntry> performSearch(String query) throws FetcherException {
@Override
public List<BibEntry> performComplexSearch(ComplexSearchQuery complexSearchQuery) throws FetcherException {
List<String> searchTerms = new ArrayList<>();
complexSearchQuery.getAuthors().ifPresent(authors -> authors.forEach(author -> searchTerms.add("au:" + author)));
complexSearchQuery.getTitlePhrases().ifPresent(title -> searchTerms.add("ti:" + title));
complexSearchQuery.getAuthors().forEach(author -> searchTerms.add("au:" + author));
complexSearchQuery.getTitlePhrases().forEach(title -> searchTerms.add("ti:" + title));
complexSearchQuery.getJournal().ifPresent(journal -> searchTerms.add("jr:" + journal));
// Since ArXiv API does not support year search, we ignore the year related terms
complexSearchQuery.getToYear().ifPresent(year -> searchTerms.add(year.toString()));
complexSearchQuery.getDefaultField().ifPresent(defaultField -> searchTerms.add(defaultField));
searchTerms.addAll(complexSearchQuery.getDefaultFieldPhrases());
String complexQueryString = String.join(" AND ", searchTerms);
return performSearch(complexQueryString);
}
Expand Down
139 changes: 116 additions & 23 deletions src/main/java/org/jabref/logic/importer/fetcher/ComplexSearchQuery.java
Original file line number Diff line number Diff line change
@@ -1,23 +1,26 @@
package org.jabref.logic.importer.fetcher;

import java.util.ArrayList;
import java.util.Collection;
import java.util.List;
import java.util.Objects;
import java.util.Optional;

import org.jabref.model.strings.StringUtil;

import org.apache.lucene.index.Term;

public class ComplexSearchQuery {
// Field for non-fielded search
private final String defaultField;
private final List<String> defaultField;
private final List<String> authors;
private final List<String> titlePhrases;
private final Integer fromYear;
private final Integer toYear;
private final Integer singleYear;
private final String journal;

private ComplexSearchQuery(String defaultField, List<String> authors, List<String> titlePhrases, Integer fromYear, Integer toYear, Integer singleYear, String journal) {
private ComplexSearchQuery(List<String> defaultField, List<String> authors, List<String> titlePhrases, Integer fromYear, Integer toYear, Integer singleYear, String journal) {
this.defaultField = defaultField;
this.authors = authors;
this.titlePhrases = titlePhrases;
Expand All @@ -28,16 +31,32 @@ private ComplexSearchQuery(String defaultField, List<String> authors, List<Strin
this.singleYear = singleYear;
}

public Optional<String> getDefaultField() {
return Optional.ofNullable(defaultField);
public static ComplexSearchQuery fromTerms(Collection<Term> terms) {
ComplexSearchQueryBuilder builder = ComplexSearchQuery.builder();
terms.forEach(term -> {
String termText = term.text();
switch (term.field().toLowerCase()) {
case "author" -> builder.author(termText);
case "title" -> builder.titlePhrase(termText);
case "journal" -> builder.journal(termText);
case "year" -> builder.singleYear(Integer.valueOf(termText));
case "year-range" -> builder.parseYearRange(termText);
case "default" -> builder.defaultFieldPhrase(termText);
}
});
return builder.build();
}

public List<String> getDefaultFieldPhrases() {
return defaultField;
}

public Optional<List<String>> getAuthors() {
return Optional.ofNullable(authors);
public List<String> getAuthors() {
return authors;
}

public Optional<List<String>> getTitlePhrases() {
return Optional.ofNullable(titlePhrases);
public List<String> getTitlePhrases() {
return titlePhrases;
}

public Optional<Integer> getFromYear() {
Expand All @@ -60,23 +79,69 @@ public static ComplexSearchQueryBuilder builder() {
return new ComplexSearchQueryBuilder();
}

@Override
public boolean equals(Object o) {
if (this == o) {
return true;
}
if (o == null || getClass() != o.getClass()) {
return false;
}

ComplexSearchQuery that = (ComplexSearchQuery) o;

// Just check for set equality, order does not matter
if (!(getDefaultFieldPhrases().containsAll(that.getDefaultFieldPhrases()) && that.getDefaultFieldPhrases().containsAll(getDefaultFieldPhrases()))) {
return false;
}
if (!(getAuthors().containsAll(that.getAuthors()) && that.getAuthors().containsAll(getAuthors()))) {
return false;
}
if (!(getTitlePhrases().containsAll(that.getTitlePhrases()) && that.getTitlePhrases().containsAll(getTitlePhrases()))) {
return false;
}
if (getFromYear().isPresent() ? !getFromYear().equals(that.getFromYear()) : that.getFromYear().isPresent()) {
return false;
}
if (getToYear().isPresent() ? !getToYear().equals(that.getToYear()) : that.getToYear().isPresent()) {
return false;
}
if (getSingleYear().isPresent() ? !getSingleYear().equals(that.getSingleYear()) : that.getSingleYear().isPresent()) {
return false;
}
return getJournal().isPresent() ? getJournal().equals(that.getJournal()) : !that.getJournal().isPresent();
}

@Override
public int hashCode() {
int result = defaultField != null ? defaultField.hashCode() : 0;
result = 31 * result + (getAuthors() != null ? getAuthors().hashCode() : 0);
result = 31 * result + (getTitlePhrases() != null ? getTitlePhrases().hashCode() : 0);
result = 31 * result + (getFromYear().isPresent() ? getFromYear().hashCode() : 0);
result = 31 * result + (getToYear().isPresent() ? getToYear().hashCode() : 0);
result = 31 * result + (getSingleYear().isPresent() ? getSingleYear().hashCode() : 0);
result = 31 * result + (getJournal().isPresent() ? getJournal().hashCode() : 0);
return result;
}

public static class ComplexSearchQueryBuilder {
private String defaultField;
private List<String> authors;
private List<String> titlePhrases;
private List<String> defaultFieldPhrases = new ArrayList<>();
private List<String> authors = new ArrayList<>();
private List<String> titlePhrases = new ArrayList<>();
private String journal;
private Integer fromYear;
private Integer toYear;
private Integer singleYear;

public ComplexSearchQueryBuilder() {
private ComplexSearchQueryBuilder() {
}

public ComplexSearchQueryBuilder defaultField(String defaultField) {
if (Objects.requireNonNull(defaultField).isBlank()) {
public ComplexSearchQueryBuilder defaultFieldPhrase(String defaultFieldPhrase) {
if (Objects.requireNonNull(defaultFieldPhrase).isBlank()) {
throw new IllegalArgumentException("Parameter must not be blank");
}
this.defaultField = defaultField;
// Strip all quotes before wrapping
this.defaultFieldPhrases.add(String.format("\"%s\"", defaultFieldPhrase.replace("\"", "")));
return this;
}

Expand All @@ -87,9 +152,6 @@ public ComplexSearchQueryBuilder author(String author) {
if (Objects.requireNonNull(author).isBlank()) {
throw new IllegalArgumentException("Parameter must not be blank");
}
if (Objects.isNull(authors)) {
this.authors = new ArrayList<>();
}
// Strip all quotes before wrapping
this.authors.add(String.format("\"%s\"", author.replace("\"", "")));
return this;
Expand All @@ -102,9 +164,6 @@ public ComplexSearchQueryBuilder titlePhrase(String titlePhrase) {
if (Objects.requireNonNull(titlePhrase).isBlank()) {
throw new IllegalArgumentException("Parameter must not be blank");
}
if (Objects.isNull(titlePhrases)) {
this.titlePhrases = new ArrayList<>();
}
// Strip all quotes before wrapping
this.titlePhrases.add(String.format("\"%s\"", titlePhrase.replace("\"", "")));
return this;
Expand Down Expand Up @@ -135,6 +194,21 @@ public ComplexSearchQueryBuilder journal(String journal) {
return this;
}

public ComplexSearchQueryBuilder terms(Collection<Term> terms) {
DominikVoigt marked this conversation as resolved.
Show resolved Hide resolved
terms.forEach(term -> {
String termText = term.text();
switch (term.field().toLowerCase()) {
case "author" -> this.author(termText);
case "title" -> this.titlePhrase(termText);
case "journal" -> this.journal(termText);
case "year" -> this.singleYear(Integer.valueOf(termText));
case "year-range" -> this.parseYearRange(termText);
case "default" -> this.defaultFieldPhrase(termText);
}
});
return this;
}

/**
* Instantiates the AdvancesSearchConfig from the provided Builder parameters
* If all text fields are empty an empty optional is returned
Expand All @@ -147,11 +221,30 @@ public ComplexSearchQuery build() throws IllegalStateException {
if (textSearchFieldsAndYearFieldsAreEmpty()) {
throw new IllegalStateException("At least one text field has to be set");
}
return new ComplexSearchQuery(defaultField, authors, titlePhrases, fromYear, toYear, singleYear, journal);
return new ComplexSearchQuery(defaultFieldPhrases, authors, titlePhrases, fromYear, toYear, singleYear, journal);
}

void parseYearRange(String termText) {
String[] split = termText.split("-");
int fromYear = 0;
int toYear = 9999;
try {
fromYear = Integer.parseInt(split[0]);
} catch (NumberFormatException e) {
// default value already set
}
if (split.length > 1) {
try {
toYear = Integer.parseInt(split[1]);
} catch (NumberFormatException e) {
// default value already set
}
}
this.fromYearAndToYear(fromYear, toYear);
}

private boolean textSearchFieldsAndYearFieldsAreEmpty() {
return StringUtil.isBlank(defaultField) && this.stringListIsBlank(titlePhrases) &&
return this.stringListIsBlank(defaultFieldPhrases) && this.stringListIsBlank(titlePhrases) &&
this.stringListIsBlank(authors) && StringUtil.isBlank(journal) && yearFieldsAreEmpty();
}

Expand Down
Loading