-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes error indexing (very) large records from PDC Describe #724
Conversation
pdc_describe_json_node = doc.at('pdc_describe_json') | ||
pdc_describe_json_node.add_child(cdata) | ||
doc.to_s | ||
parsed.to_xml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the biggest change in this PR. Instead of adding a CDATA XML element with the PDC record as JSON here...we use the JSON that is available in Traject as-is (see below)
# only once per record and save it to the context so that we can re-use it. | ||
each_record do |record, context| | ||
xml = record.xpath("/hash").first.to_xml | ||
context.clipboard[:record_json] = Hash.from_xml(xml).to_json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is where we create the JSON string with the PDC data (we are taking it from the XML just as the code did before) but this prevents from having the data twice in the XML (one as XML elements and one as giant CDATA with a string that represents the JSON)
datacite = record.xpath("/hash/pdc_describe_json/text()").first.content | ||
accumulator.concat [datacite] | ||
to_field 'pdc_describe_json_ss' do |_record, accumulator, context| | ||
accumulator.concat [context.clipboard[:record_json]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the value that we set in the each_record
block.
…oughout the code base
# only once per record and save it to the context so that we can re-use it. | ||
each_record do |record, context| | ||
xml = record.xpath("/hash").first.to_xml | ||
context.clipboard[:record_json] = Hash.from_xml(xml)["hash"].to_json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same data we were adding to the XML CDATA element above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @hectorcorrea! I'm surprised this change didn't impact any tests. Looks like maybe we have a testing hole. I'm not asking fo this PR to do that though...
@carolyncole No test were impacted because I made the new data be completely compatible with what was there before. There were several broken tests in the process while I tweaked the data to match 100%. |
Got the indexing to work even for very large records.
This is an example record indexed in staging with 60,000 files: https://pdc-discovery-staging.princeton.edu/discovery/catalog/doi-10-34770-n42z-hb72
There is still one record that causes problems but that would be handled in a separate PR when we stop saving the file list to
files_ss
- we should be able to pick up that data from thepdc_describe_json_ss
field instead.The indexing process was taking almost 1/2 hr to run so it was also colliding with the scheduled job that runs every half an hour. Therefore I updated the cron job to run once an hour instead.
Closes #711