Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CVOC : Indexed field accuracy (Ontoportal integration) #10505

Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions doc/release-notes/9276-doc-cvoc-index-in.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
## Release Highlights

### Updates on Support for External Vocabulary Services

#### Indexed field accuracy
Copy link
Member

@qqmyers qqmyers Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change:

Updates on Support for External Vocabulary Services

Multiple extensions of the External Vocabulary mechanism have been added. These extensions allow interaction with services based on the Ontoportal software and are expected to be generally useful for other service types.

These changes include:

Improved Indexing with Compound Fields

When using an external vocabulary service with compound fields, you can now specify which field(s) will include additional indexed information, such as translations of an entry into other languages. This is done by adding the indexIn in retrieval-filtering.
For more information, please check GDCC/dataverse-external-vocab-support documentation.

Broader Support for Indexing Service Responses

Indexing of the results from 'retrieval-filtering' responses can now handle additional formats including Json Arrays of Strings and values from arbitrary keys within a JSON Object.

(+ entries from the other Ontoportal PRs)


For more relevant indexing, you can now map external vocabulary values to a `managed-fields` of a [:CVocConf setting](https://guides.dataverse.org/en/6.3/installation/config.html#cvocconf) by adding the key `indexIn` in `retrieval-filtering`.
For more information, please check [GDCC/dataverse-external-vocab-support documentation](https://github.com/gdcc/dataverse-external-vocab-support/tree/main/docs).
93 changes: 62 additions & 31 deletions src/main/java/edu/harvard/iq/dataverse/DatasetFieldServiceBean.java
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@

import org.apache.commons.codec.digest.DigestUtils;
import org.apache.commons.httpclient.HttpException;
import org.apache.commons.lang3.StringUtils;
import org.apache.http.HttpResponse;
import org.apache.http.HttpResponseInterceptor;
import org.apache.http.client.methods.HttpGet;
Expand Down Expand Up @@ -321,14 +322,15 @@ public Map<Long, JsonObject> getCVocConf(boolean byTermUriField){
+ jo.getString("term-uri-field"));
}
}
if (jo.containsKey("child-fields")) {
JsonArray childFields = jo.getJsonArray("child-fields");
for (JsonString elm : childFields.getValuesAs(JsonString.class)) {
dft = findByNameOpt(elm.getString());
logger.info("Found: " + dft.getName());
if (jo.containsKey("managed-fields")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just confirming - this was a bug? - the schema requires managed-fields these days

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a bug, I would say old code, child-fields does not exists anymore so I replaced it with managed-fields. I think this code is not useful but at least the key is not obsolete.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right - looks like it is just for logging purposes/to warn about incorrect field choices.

JsonObject managedFields = jo.getJsonObject("managed-fields");
for (String s : managedFields.keySet()) {
dft = findByNameOpt(managedFields.getString(s));
if (dft == null) {
logger.warning("Ignoring External Vocabulary setting for non-existent child field: "
+ elm.getString());
+ managedFields.getString(s));
} else {
logger.info("Found: " + dft.getName());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change logger.info() to fine()

}
}
}
Expand All @@ -345,7 +347,7 @@ public Map<Long, JsonObject> getCVocConf(boolean byTermUriField){
* @param df - the primitive/parent compound field containing a newly saved value
*/
public void registerExternalVocabValues(DatasetField df) {
DatasetFieldType dft =df.getDatasetFieldType();
DatasetFieldType dft = df.getDatasetFieldType();
logger.fine("Registering for field: " + dft.getName());
JsonObject cvocEntry = getCVocConf(true).get(dft.getId());
if (dft.isPrimitive()) {
Expand All @@ -366,38 +368,48 @@ public void registerExternalVocabValues(DatasetField df) {
}
}
}

/**
* Retrieves indexable strings from a cached externalvocabularyvalue entry.
*
* This method assumes externalvocabularyvalue entries have been filtered and
* the externalvocabularyvalue entry contain a single JsonObject whose "personName" or "termName" values
* are either Strings or an array of objects with "lang" and ("value" or "content") keys. The
* string, or the "value/content"s for each language are added to the set.
*
* Retrieves indexable strings from a cached externalvocabularyvalue entry filtered through retrieval-filtering configuration.
* <p>
* This method externalvocabularyvalue entries have been filtered and contains a single JsonObject.
Copy link
Member

@qqmyers qqmyers Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change:
This method assumes externalvocabularyvalue entries have been filtered and that they contain a single JsonObject.

* Is handled : Strings, Array of Objects with "lang" and ("value" or "content") keys, Object with Strings as value or Object with Array of Strings as value.
Copy link
Member

@qqmyers qqmyers Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change: Is handled : - > Cases Handled : A String, an Array of Strings, an Array of Objects with "value" or "content" keys, an Object with one or more entries that have String values or Array values with a set of String values.

(Trying to wordsmith as well as removing the "lang" requirement which isn't in the code.)

* The string, or the "value/content"s for each language are added to the set.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change:
The string(s), or ...

* This method can retrieve string values to be indexed in term-uri-field (parameter defined in CVOC configuration) or in "indexIn" field (optional parameter of retrieval-filtering defined in CVOC configuration).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change:
Retrieved string values are indexed in the term-uri-field (parameter defined in CVOC configuration) by default, or in the field specified by an optional "indexIn" parameter in the retrieval-filtering defined in the CVOC configuration).

* <p>
* Any parsing error results in no entries (there can be unfiltered entries with
* unknown structure - getting some strings from such an entry could give fairly
* random info that would be bad to addd for searches, etc.)
*
* @param termUri
*
* @param termUri unique identifier to search in database
* @param cvocEntry related cvoc configuration
* @param indexingField name of solr field that will be filled with getStringsFor while indexing
* @return - a set of indexable strings
*/
public Set<String> getStringsFor(String termUri) {
Set<String> strings = new HashSet<String>();
public Set<String> getIndexableStringsByTermUri(String termUri, JsonObject cvocEntry, String indexingField) {
Set<String> strings = new HashSet<>();
JsonObject jo = getExternalVocabularyValue(termUri);
JsonObject filtering = cvocEntry.getJsonObject("retrieval-filtering");
String termUriField = cvocEntry.getJsonString("term-uri-field").getString();

if (jo != null) {
try {
for (String key : jo.keySet()) {
if (key.equals("termName") || key.equals("personName")) {
String indexIn = filtering.getJsonObject(key).getString("indexIn", null);
// Either we are in mapping mode so indexingField (solr field) equals indexIn (cvoc config)
// Or we are in default mode indexingField is termUriField, indexIn is not defined then only termName and personName keys are used
if (indexingField.equals(indexIn) ||
(indexIn == null && termUriField.equals(indexingField) && (key.equals("termName")) || key.equals("personName"))) {
JsonValue jv = jo.get(key);
if (jv.getValueType().equals(JsonValue.ValueType.STRING)) {
logger.fine("adding " + jo.getString(key) + " for " + termUri);
strings.add(jo.getString(key));
} else {
if (jv.getValueType().equals(JsonValue.ValueType.ARRAY)) {
JsonArray jarr = jv.asJsonArray();
for (int i = 0; i < jarr.size(); i++) {
} else if (jv.getValueType().equals(JsonValue.ValueType.ARRAY)) {
JsonArray jarr = jv.asJsonArray();
for (int i = 0; i < jarr.size(); i++) {
if (jarr.get(i).getValueType().equals(JsonValue.ValueType.STRING)) {
strings.add(jarr.getString(i));
} else if (jarr.get(i).getValueType().equals(ValueType.OBJECT)) { // This condition handles SKOMOS format like [{"lang": "en","value": "non-apis bee"},{"lang": "fr","value": "abeille non apis"}]
JsonObject entry = jarr.getJsonObject(i);
if (entry.containsKey("value")) {
logger.fine("adding " + entry.getString("value") + " for " + termUri);
Expand All @@ -409,6 +421,22 @@ public Set<String> getStringsFor(String termUri) {
}
}
}
} else if (jv.getValueType().equals(JsonValue.ValueType.OBJECT)) {
JsonObject joo = jv.asJsonObject();
for (Map.Entry<String, JsonValue> entry : joo.entrySet()) {
if (entry.getValue().getValueType().equals(JsonValue.ValueType.STRING)) { // This condition handles format like { "fr": "association de quartier", "en": "neighborhood associations"}
logger.fine("adding " + joo.getString(entry.getKey()) + " for " + termUri);
strings.add(joo.getString(entry.getKey()));
} else if (entry.getValue().getValueType().equals(ValueType.ARRAY)) { // This condition handles format like {"en": ["neighbourhood societies"]}
JsonArray jarr = entry.getValue().asJsonArray();
for (int i = 0; i < jarr.size(); i++) {
if (jarr.get(i).getValueType().equals(JsonValue.ValueType.STRING)) {
logger.fine("adding " + jarr.getString(i) + " for " + termUri);
strings.add(jarr.getString(i));
}
}
}
}
}
}
}
Expand All @@ -420,7 +448,7 @@ public Set<String> getStringsFor(String termUri) {
}
logger.fine("Returning " + String.join(",", strings) + " for " + termUri);
return strings;
}
}

/**
* Perform a query to retrieve a cached value from the externalvocabularvalue table
Expand Down Expand Up @@ -454,10 +482,11 @@ public JsonObject getExternalVocabularyValue(String termUri) {
public void registerExternalTerm(JsonObject cvocEntry, String term) {
String retrievalUri = cvocEntry.getString("retrieval-uri");
String prefix = cvocEntry.getString("prefix", null);
if(term.isBlank()) {
logger.fine("Ingoring blank term");
if(StringUtils.isBlank(term)) {
logger.fine("Ignoring blank term");
return;
}

boolean isExternal = false;
JsonObject vocabs = cvocEntry.getJsonObject("vocabs");
for (String key: vocabs.keySet()) {
Expand Down Expand Up @@ -512,7 +541,7 @@ public void process(HttpResponse response, HttpContext context) throws HttpExcep
if (statusCode == 200) {
logger.fine("Returned data: " + data);
try (JsonReader jsonReader = Json.createReader(new StringReader(data))) {
String dataObj =filterResponse(cvocEntry, jsonReader.readObject(), term).toString();
String dataObj = filterResponse(cvocEntry, jsonReader.readObject(), term).toString();
evv.setValue(dataObj);
evv.setLastUpdateDate(Timestamp.from(Instant.now()));
logger.fine("JsonObject: " + dataObj);
Expand Down Expand Up @@ -543,7 +572,7 @@ public void process(HttpResponse response, HttpContext context) throws HttpExcep
* Parse the raw value returned by an external service for a give term uri and
* filter it according to the 'retrieval-filtering' configuration for this
* DatasetFieldType, creating a Json value with the specified structure
*
*
* @param cvocEntry - the config for this DatasetFieldType
* @param readObject - the raw response from the service
* @param termUri - the term uri
Expand Down Expand Up @@ -602,6 +631,8 @@ private JsonObject filterResponse(JsonObject cvocEntry, JsonObject readObject, S
if (pattern.equals("{0}")) {
if (vals.get(0) instanceof JsonArray) {
job.add(filterKey, (JsonArray) vals.get(0));
} else if (vals.get(0) instanceof JsonObject) {
job.add(filterKey, (JsonObject) vals.get(0));
} else {
job.add(filterKey, (String) vals.get(0));
}
Expand Down Expand Up @@ -639,7 +670,7 @@ Object processPathSegment(int index, String[] pathParts, JsonValue curPath, Stri
String[] keyVal = pathParts[index].split("=");
logger.fine("Looking for object where " + keyVal[0] + " is " + keyVal[1]);
String expected = keyVal[1];

if (!expected.equals("*")) {
if (expected.equals("@id")) {
expected = termUri;
Expand Down Expand Up @@ -668,7 +699,7 @@ Object processPathSegment(int index, String[] pathParts, JsonValue curPath, Stri
}
return parts.build();
}

} else {
curPath = ((JsonObject) curPath).get(pathParts[index]);
logger.fine("Found next Path object " + curPath.toString());
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -922,6 +922,20 @@ public SolrInputDocuments toSolrDocs(IndexableDataset indexableDataset, Set<Long

Set<String> langs = settingsService.getConfiguredLanguages();
Map<Long, JsonObject> cvocMap = datasetFieldService.getCVocConf(true);
Map<Long, Set<String>> cvocManagedFieldMap = new HashMap<>();
for (Map.Entry<Long, JsonObject> cvocEntry : cvocMap.entrySet()) {
if(cvocEntry.getValue().containsKey("managed-fields")) {
JsonObject managedFields = cvocEntry.getValue().getJsonObject("managed-fields");
Set<String> managedFieldValues = new HashSet<>();
for (String s : managedFields.keySet()) {
managedFieldValues.add(managedFields.getString(s));
}
cvocManagedFieldMap.put(cvocEntry.getKey(), managedFieldValues);
}
}



Set<String> metadataBlocksWithValue = new HashSet<>();
for (DatasetField dsf : datasetVersion.getFlatDatasetFields()) {

Expand Down Expand Up @@ -996,19 +1010,39 @@ public SolrInputDocuments toSolrDocs(IndexableDataset indexableDataset, Set<Long
}
solrInputDocument.addField(SearchFields.NAME_SORT, dsf.getValues());
}


// If there is a CVOCConf for the field
if(cvocMap.containsKey(dsfType.getId())) {
List<String> vals = dsf.getValues_nondisplay();
Set<String> searchStrings = new HashSet<String>();
Set<String> searchStrings = new HashSet<>();
for (String val: vals) {
searchStrings.add(val);
searchStrings.addAll(datasetFieldService.getStringsFor(val));
// Try to get string values from externalvocabularyvalue using val as termUri
searchStrings.addAll(datasetFieldService.getIndexableStringsByTermUri(val, cvocMap.get(dsfType.getId()), dsfType.getName()));

if(dsfType.getParentDatasetFieldType()!=null) {
List<DatasetField> childDatasetFields = dsf.getParentDatasetFieldCompoundValue().getChildDatasetFields();
for (DatasetField df : childDatasetFields) {
if(cvocManagedFieldMap.get(dsfType.getId()).contains(df.getDatasetFieldType().getName())) {
String solrManagedFieldSearchable = df.getDatasetFieldType().getSolrField().getNameSearchable();
// Try to get string values from externalvocabularyvalue but for a managed fields of the CVOCConf
Set<String> stringsForManagedField = datasetFieldService.getIndexableStringsByTermUri(val, cvocMap.get(dsfType.getId()), df.getDatasetFieldType().getName());
logger.fine(solrManagedFieldSearchable + " filled with externalvocabularyvalue : " + stringsForManagedField);
//.addField works as addition of value not a replace of value
// it allows to add mapped values by CVOCConf before or after indexing real DatasetField value(s) of solrManagedFieldSearchable
solrInputDocument.addField(solrManagedFieldSearchable, stringsForManagedField);
}
}
}
}
logger.fine(solrFieldSearchable + " filled with externalvocabularyvalue : " + searchStrings);
solrInputDocument.addField(solrFieldSearchable, searchStrings);
if (dsfType.getSolrField().isFacetable()) {
logger.fine(solrFieldFacetable + " gets " + vals);
solrInputDocument.addField(solrFieldFacetable, vals);
}
}

if (dsfType.isControlledVocabulary()) {
/** If the cvv list is empty but the dfv list is not then it is assumed this was harvested
* from an installation that had controlled vocabulary entries that don't exist in our this db
Expand Down
Loading
Loading