Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solr cloud results "Collapse/Expand" bug #1049

Closed
syefimov opened this issue Apr 5, 2023 · 1 comment
Closed

Solr cloud results "Collapse/Expand" bug #1049

syefimov opened this issue Apr 5, 2023 · 1 comment

Comments

@syefimov
Copy link
Contributor

syefimov commented Apr 5, 2023

[ ] Bug report
storm-crawler-solr 2.8
Class: com.digitalpebble.stormcrawler.solr.persistence.SolrSpout
Method: populateBuffer()
Solr: Solr 8,8.2 (cloud mode)

Issue: Collapse and Expand Results. In results same host exists multiple times (number of shards). Solrj in this case creates invalid expandedResults object and ClassCastException (SolrDocumentList to SolrDocument) in line 161 docs.addAll(expandedResults.get(key));

Solution: Use grouping query to fix logic and ClassCastException bug.

    protected void populateBuffer() {

        SolrQuery query = new SolrQuery();

        if (lastNextFetchDate == null) {
            lastNextFetchDate = Instant.now();
            lastStartOffset = 0;
            lastTimeResetToNOW = Instant.now();
        }
        // reset the value for next fetch date if the previous one is too
        // old
        else if (resetFetchDateAfterNSecs != -1) {
            Instant changeNeededOn =
                    Instant.ofEpochMilli(
                            lastTimeResetToNOW.toEpochMilli() + (resetFetchDateAfterNSecs * 1000));
            if (Instant.now().isAfter(changeNeededOn)) {
                LOG.info(
                        "lastDate reset based on resetFetchDateAfterNSecs {}",
                        resetFetchDateAfterNSecs);
                lastNextFetchDate = Instant.now();
                lastStartOffset = 0;
            }
        }

        query.setQuery("*:*")
                .addFilterQuery("nextFetchDate:[* TO " + lastNextFetchDate + "]")
                .setStart(lastStartOffset)
                .setRows(this.maxNumResults);

        if (StringUtils.isNotBlank(diversityField) && diversityBucketSize > 0) {
        	query.set("indent", "true").set("group", "true").set("group.field", diversityField)
			.set("group.limit", diversityBucketSize).set("group.sort", "nextFetchDate asc");
        }

        LOG.debug("QUERY => {}", query.toString());

        try {
            long startQuery = System.currentTimeMillis();
            QueryResponse response = connection.getClient().query(query);
            long endQuery = System.currentTimeMillis();

            queryTimes.addMeasurement(endQuery - startQuery);

            SolrDocumentList docs = new SolrDocumentList();

            LOG.debug("Response : {}", response.toString());

            // add the main results
			if (response.getResults() != null) {
				docs.addAll(response.getResults());
			}
			int groupsTotal = 0;
            // get groups
			if (response.getGroupResponse() != null) {
				for (GroupCommand groupCommand : response.getGroupResponse().getValues()) {
					for (Group group : groupCommand.getValues()) {
						groupsTotal++;
						LOG.debug("Group : {}", group);
						docs.addAll(group.getResult());
					}
				}
			}

            int numhits = (response.getResults()!=null)?response.getResults().size():groupsTotal;

            // no more results?
            if (numhits == 0) {
                lastStartOffset = 0;
                lastNextFetchDate = null;
            } else {
                lastStartOffset += numhits;
            }

            String prefix = mdPrefix.concat(".");

            int alreadyProcessed = 0;
            int docReturned = 0;
            
            for (SolrDocument doc : docs) {
                String url = (String) doc.get("url");

                docReturned++;

                // is already being processed - skip it!
                if (beingProcessed.containsKey(url)) {
                    alreadyProcessed++;
                    continue;
                }

                Metadata metadata = new Metadata();

                Iterator<String> keyIterators = doc.getFieldNames().iterator();
                while (keyIterators.hasNext()) {
                    String key = keyIterators.next();

                    if (key.startsWith(prefix)) {
                        Collection<Object> values = doc.getFieldValues(key);

                        key = key.substring(prefix.length());
                        Iterator<Object> valueIterator = values.iterator();
                        while (valueIterator.hasNext()) {
                            String value = (String) valueIterator.next();
                            metadata.addValue(key, value);
                        }
                    }
                }

                buffer.add(url, metadata);
            }

            LOG.info(
                    "SOLR returned {} results from {} buckets in {} msec including {} already being processed",
                    docReturned,
                    numhits,
                    (endQuery - startQuery),
                    alreadyProcessed);

        } catch (Exception e) {
            LOG.error("Exception while querying Solr", e);
        }
    }
@jnioche
Copy link
Contributor

jnioche commented Apr 6, 2023

Could you please contribute a PR instead and link it to this issue? It will make it easier to see the difference in code and comment on your suggestions. Thanks!

syefimov added a commit to syefimov/storm-crawler that referenced this issue Apr 6, 2023
Solr cloud results "Collapse/Expand" bug apache#1049
jnioche pushed a commit that referenced this issue May 2, 2023
…rouping" query. (#1053)

* Update SolrSpout.java

Solr cloud results "Collapse/Expand" bug #1049

* Create DeletionBolt.java

storm-crawler-solr bug. Missing DeletionBolt bolt code. #1050

* ynch

* Change to results grouping in solr query

* Update SolrSpout.java
@jnioche jnioche closed this as completed May 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants