Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes error indexing (very) large records from PDC Describe #724

Merged
merged 8 commits into from
Jan 13, 2025

Conversation

hectorcorrea
Copy link
Member

@hectorcorrea hectorcorrea commented Nov 21, 2024

Got the indexing to work even for very large records.

This is an example record indexed in staging with 60,000 files: https://pdc-discovery-staging.princeton.edu/discovery/catalog/doi-10-34770-n42z-hb72

There is still one record that causes problems but that would be handled in a separate PR when we stop saving the file list to files_ss - we should be able to pick up that data from the pdc_describe_json_ss field instead.

The indexing process was taking almost 1/2 hr to run so it was also colliding with the scheduled job that runs every half an hour. Therefore I updated the cron job to run once an hour instead.

Closes #711

pdc_describe_json_node = doc.at('pdc_describe_json')
pdc_describe_json_node.add_child(cdata)
doc.to_s
parsed.to_xml
Copy link
Member Author

@hectorcorrea hectorcorrea Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the biggest change in this PR. Instead of adding a CDATA XML element with the PDC record as JSON here...we use the JSON that is available in Traject as-is (see below)

# only once per record and save it to the context so that we can re-use it.
each_record do |record, context|
xml = record.xpath("/hash").first.to_xml
context.clipboard[:record_json] = Hash.from_xml(xml).to_json
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is where we create the JSON string with the PDC data (we are taking it from the XML just as the code did before) but this prevents from having the data twice in the XML (one as XML elements and one as giant CDATA with a string that represents the JSON)

datacite = record.xpath("/hash/pdc_describe_json/text()").first.content
accumulator.concat [datacite]
to_field 'pdc_describe_json_ss' do |_record, accumulator, context|
accumulator.concat [context.clipboard[:record_json]]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the value that we set in the each_record block.

# only once per record and save it to the context so that we can re-use it.
each_record do |record, context|
xml = record.xpath("/hash").first.to_xml
context.clipboard[:record_json] = Hash.from_xml(xml)["hash"].to_json
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same data we were adding to the XML CDATA element above.

@hectorcorrea hectorcorrea marked this pull request as ready for review January 13, 2025 20:06
Copy link
Member

@carolyncole carolyncole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hectorcorrea! I'm surprised this change didn't impact any tests. Looks like maybe we have a testing hole. I'm not asking fo this PR to do that though...

@carolyncole carolyncole merged commit f6f00d6 into main Jan 13, 2025
5 checks passed
@carolyncole carolyncole deleted the 711-import-error branch January 13, 2025 20:39
@hectorcorrea
Copy link
Member Author

@carolyncole No test were impacted because I made the new data be completely compatible with what was there before. There were several broken tests in the process while I tweaked the data to match 100%.

@hectorcorrea hectorcorrea restored the 711-import-error branch January 13, 2025 20:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Import process from PDC Describe is crashing
3 participants