Fixes error indexing (very) large records from PDC Describe #724

hectorcorrea · 2024-11-21T23:09:42Z

Got the indexing to work even for very large records.

This is an example record indexed in staging with 60,000 files: https://pdc-discovery-staging.princeton.edu/discovery/catalog/doi-10-34770-n42z-hb72

There is still one record that causes problems but that would be handled in a separate PR when we stop saving the file list to files_ss - we should be able to pick up that data from the pdc_describe_json_ss field instead.

The indexing process was taking almost 1/2 hr to run so it was also colliding with the scheduled job that runs every half an hour. Therefore I updated the cron job to run once an hour instead.

Closes #711

…PDC Describe

hectorcorrea · 2025-01-10T22:07:11Z

app/lib/describe_indexer.rb

-    pdc_describe_json_node = doc.at('pdc_describe_json')
-    pdc_describe_json_node.add_child(cdata)
-    doc.to_s
+    parsed.to_xml


This is the biggest change in this PR. Instead of adding a CDATA XML element with the PDC record as JSON here...we use the JSON that is available in Traject as-is (see below)

hectorcorrea · 2025-01-10T22:09:17Z

config/traject/pdc_describe_indexing_config.rb

+# only once per record and save it to the context so that we can re-use it.
+each_record do |record, context|
+  xml = record.xpath("/hash").first.to_xml
+  context.clipboard[:record_json] = Hash.from_xml(xml).to_json


Here is where we create the JSON string with the PDC data (we are taking it from the XML just as the code did before) but this prevents from having the data twice in the XML (one as XML elements and one as giant CDATA with a string that represents the JSON)

hectorcorrea · 2025-01-13T17:22:33Z

config/traject/pdc_describe_indexing_config.rb

-  datacite = record.xpath("/hash/pdc_describe_json/text()").first.content
-  accumulator.concat [datacite]
+to_field 'pdc_describe_json_ss' do |_record, accumulator, context|
+  accumulator.concat [context.clipboard[:record_json]]


Use the value that we set in the each_record block.

…oughout the code base

hectorcorrea · 2025-01-13T19:58:44Z

config/traject/pdc_describe_indexing_config.rb

+# only once per record and save it to the context so that we can re-use it.
+each_record do |record, context|
+  xml = record.xpath("/hash").first.to_xml
+  context.clipboard[:record_json] = Hash.from_xml(xml)["hash"].to_json


This is the same data we were adding to the XML CDATA element above.

carolyncole

Thanks @hectorcorrea! I'm surprised this change didn't impact any tests. Looks like maybe we have a testing hole. I'm not asking fo this PR to do that though...

hectorcorrea · 2025-01-13T20:43:07Z

@carolyncole No test were impacted because I made the new data be completely compatible with what was there before. There were several broken tests in the process while I tweaked the data to match 100%.

Started to diagnose the issue with importing very large records from …

af1bd84

…PDC Describe

kelynch mentioned this pull request Jan 8, 2025

Indexing strategy for large datasets #738

Open

hectorcorrea added 2 commits January 10, 2025 15:23

Got a working version. Still testing but it looks promising

1e3fe59

Use production describe as the source for indexing in staging

14f8382

pulbot temporarily deployed to staging January 10, 2025 20:28 Inactive

hectorcorrea commented Jan 10, 2025

View reviewed changes

Testing a few configuration options for indexing into Solr

4ddac75

pulbot temporarily deployed to staging January 13, 2025 15:50 Inactive

More tuning and tweaking

cb2e6c2

pulbot temporarily deployed to staging January 13, 2025 16:44 Inactive

Restore the indexing of files. Minor logging changes

6ce3111

pulbot temporarily deployed to staging January 13, 2025 17:21 Inactive

hectorcorrea commented Jan 13, 2025

View reviewed changes

Renable cronjob

9b712e1

pulbot temporarily deployed to staging January 13, 2025 19:06 Inactive

Made PDC JSON compatible with previous version to prevent changes thr…

683f020

…oughout the code base

hectorcorrea commented Jan 13, 2025

View reviewed changes

hectorcorrea marked this pull request as ready for review January 13, 2025 20:06

carolyncole approved these changes Jan 13, 2025

View reviewed changes

carolyncole merged commit f6f00d6 into main Jan 13, 2025
5 checks passed

carolyncole deleted the 711-import-error branch January 13, 2025 20:39

hectorcorrea restored the 711-import-error branch January 13, 2025 20:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes error indexing (very) large records from PDC Describe #724

Fixes error indexing (very) large records from PDC Describe #724

hectorcorrea commented Nov 21, 2024 •

edited

Loading

hectorcorrea Jan 10, 2025 •

edited

Loading

hectorcorrea Jan 10, 2025

hectorcorrea Jan 13, 2025

hectorcorrea Jan 13, 2025

carolyncole left a comment

hectorcorrea commented Jan 13, 2025

Fixes error indexing (very) large records from PDC Describe #724

Fixes error indexing (very) large records from PDC Describe #724

Conversation

hectorcorrea commented Nov 21, 2024 • edited Loading

hectorcorrea Jan 10, 2025 • edited Loading

Choose a reason for hiding this comment

hectorcorrea Jan 10, 2025

Choose a reason for hiding this comment

hectorcorrea Jan 13, 2025

Choose a reason for hiding this comment

hectorcorrea Jan 13, 2025

Choose a reason for hiding this comment

carolyncole left a comment

Choose a reason for hiding this comment

hectorcorrea commented Jan 13, 2025

hectorcorrea commented Nov 21, 2024 •

edited

Loading

hectorcorrea Jan 10, 2025 •

edited

Loading