Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Community Batch Imports endpoint #8122

Conversation

scottbarnes
Copy link
Collaborator

@scottbarnes scottbarnes commented Jul 24, 2023

Closes #7705

Feature.

This adds a (currently admin-only) endpoint at /import/batch/new for community batch imports.

Technical

This:

  • creates an endpoint at /import/batch/new that takes JSONL data via a multipart POST;
  • allows # as a comment in JSONL, just to make life a bit easier.
  • validates incoming data and requires the same things required to create a rec for import via load();
  • uses source_records[0] for the ia_id;
  • uses the Batch class for imports;
  • relies on there being a 'submitter' column in the import_item table;
  • only works for people with in the /usergroup/admin group; and
  • generates a hash based on the incoming JSON byte string to use as the name of the import batch (so that an attempted import of the same items will get the same name). Note: this isn't super sophisticated as the hash would change if the order changed.

Some outstanding questions:

  • We've mentioned using both /import/batch/new and /api/import/bulk; which do we want to use, if either?
  • If there should be a limit on the number of items in the batch, what should that limit be?
  • If there should be a limit on the batch size that's given to Batch.add_items(), what should that limit be?
  • Is the hash strategy for generating a batch name acceptable? I was trying to find something that would:
    • be unique for any given set of items (in a given order...); and
    • allow for different batch names to be generated within a few seconds of one another.
  • If the hash strategy is acceptable, is there any need to truncate the hash? I was just doing that to try to keep the table readable.

This was also tested against some of the data from @Billa05's work in #8551, found here: #8551 (comment)

Testing

Visit http://localhost:8080/import/batch/new
image

Try to upload a JSONL file with some errors:

{"title": "Blob Book 1", "source_records": "blob_source", "authors": [{"name": "Blob Author 1"}], "publishers": "Fail Publishers", "publish_date": "January 1, 2000", "isbn_10": "1111111111"}
{'blah': True}
{"title": "Blob Book 2", "source_records": "blob_source", "authors": [{"name": "Blob Author 2"}], "publishers": "Fail Publishers", "publish_date": "January 2, 2000", "isbn_10": "2222222222"}
{"source_records": ["blob_source"], "authors": [{"name": "Blob Author 2"}], "publishers": ["Not Fail Publishers"], "publish_date": "January 2, 2000", "isbn_10": "2222222222"}
{"title": "Blob Book 3", "source_records": ["blob_source"], "authors": [{"name": "Blob Author 2"}], "publishers": ["Not Fail Publishers"], "publish_date": "January 2, 2040", "isbn_10": "2222222222"}
{"identifiers": {"open_textbook_library": ["1581"]}, "source_records": ["open_textbook_library:1581"], "title": "Legal Fundamentals of Healthcare Law", "languages": ["eng"], "description": "Healthcare, a field dedicated to the well-being of individuals and communities, operates within an intricate web of legal principles. Understanding these laws is not simply a professional necessity for doctors, nurses, administrators, and researchers; it\u2019s also an ethical imperative for anyone who interacts with the healthcare system. This book is your compass, guiding you through the labyrinth of legal fundamentals that shape the landscape of healthcare.", "subjects": ["Medicine", "Law"], "publishers": ["University of West Florida Pressbooks"], "publish_date": "2024", "authors": [{"name": "Tiffany Jackman"}], "lc_classifications": ["RA440", "KF385.A4"]}
{"identifiers": {"open_textbook_library": ["1580"]}, "source_records": ["open_textbook_library:1580"], "title": "Introduction to Literature: Fairy Tales, Folk Tales, and How They Shape Us", "languages": ["eng"], "description": "Introduction to Literature: Fairy Tales, Folk Tales, and How They Shape Us introduces college freshmen to the study of literature through a focus on texts that, generally, they already know, or think they know, and how those texts aim to shape audiences to be compliant cultural objects. The book is organized around several prominent story groups, including various genres and forms, meant to promote discussion and discovery leading to students\u2019 understanding that these texts function as cultural sculptors of readers\u2019 principles and behaviors. Students develop the skill of analyzing texts and creating sound arguments about them through class discussions and a series of writing assignments. Ideally, they leave the course understanding how to create a sound argument and, more pointedly, that there is no such thing as \u201cjust a story.\u201d", "subjects": ["Humanities", "Literature, Rhetoric, and Poetry"], "publishers": ["University of West Florida Pressbooks"], "publish_date": "2023", "authors": [{"name": "Judy Young"}], "lc_classifications": ["PE1408"]}

See that the validation errors are displayed to the patrons:
image

Fix the errors by... commenting out the problematic lines:

# {"title": "Blob Book 1", "source_records": "blob_source", "authors": [{"name": "Blob Author 1"}], "publishers": "Fail Publishers", "publish_date": "January 1, 2000", "isbn_10": "1111111111"}
# {'blah': True}
# {"title": "Blob Book 2", "source_records": "blob_source", "authors": [{"name": "Blob Author 2"}], "publishers": "Fail Publishers", "publish_date": "January 2, 2000", "isbn_10": "2222222222"}
# {"source_records": ["blob_source"], "authors": [{"name": "Blob Author 2"}], "publishers": ["Not Fail Publishers"], "publish_date": "January 2, 2000", "isbn_10": "2222222222"}
# {"title": "Blob Book 3", "source_records": ["blob_source"], "authors": [{"name": "Blob Author 2"}], "publishers": ["Not Fail Publishers"], "publish_date": "January 2, 2040", "isbn_10": "2222222222"}
{"identifiers": {"open_textbook_library": ["1581"]}, "source_records": ["open_textbook_library:1581"], "title": "Legal Fundamentals of Healthcare Law", "languages": ["eng"], "description": "Healthcare, a field dedicated to the well-being of individuals and communities, operates within an intricate web of legal principles. Understanding these laws is not simply a professional necessity for doctors, nurses, administrators, and researchers; it\u2019s also an ethical imperative for anyone who interacts with the healthcare system. This book is your compass, guiding you through the labyrinth of legal fundamentals that shape the landscape of healthcare.", "subjects": ["Medicine", "Law"], "publishers": ["University of West Florida Pressbooks"], "publish_date": "2024", "authors": [{"name": "Tiffany Jackman"}], "lc_classifications": ["RA440", "KF385.A4"]}
{"identifiers": {"open_textbook_library": ["1580"]}, "source_records": ["open_textbook_library:1580"], "title": "Introduction to Literature: Fairy Tales, Folk Tales, and How They Shape Us", "languages": ["eng"], "description": "Introduction to Literature: Fairy Tales, Folk Tales, and How They Shape Us introduces college freshmen to the study of literature through a focus on texts that, generally, they already know, or think they know, and how those texts aim to shape audiences to be compliant cultural objects. The book is organized around several prominent story groups, including various genres and forms, meant to promote discussion and discovery leading to students\u2019 understanding that these texts function as cultural sculptors of readers\u2019 principles and behaviors. Students develop the skill of analyzing texts and creating sound arguments about them through class discussions and a series of writing assignments. Ideally, they leave the course understanding how to create a sound argument and, more pointedly, that there is no such thing as \u201cjust a story.\u201d", "subjects": ["Humanities", "Literature, Rhetoric, and Poetry"], "publishers": ["University of West Florida Pressbooks"], "publish_date": "2023", "authors": [{"name": "Judy Young"}], "lc_classifications": ["PE1408"]}

Verify the records are not in the import_item table:

openlibrary=# SELECT * FROM import_item;
 id | batch_id | added_time | import_time | status | error | ia_id | data | ol_key | comments | submitter 
----+----------+------------+-------------+--------+-------+-------+------+--------+----------+-----------
(0 rows)

Then try to upload again:
image

See that the items ended up in import_item:

openlibrary=# SELECT * FROM import_item;
-[ RECORD 1 ]-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
id          | 12
batch_id    | 3
added_time  | 2024-06-13 04:22:33.119869
import_time | 
status      | pending
error       | 
ia_id       | open_textbook_library:1581
data        | {"authors": [{"name": "Tiffany Jackman"}], "description": "Healthcare, a field dedicated to the well-being of individuals and communities, operates within an intricate web of legal principles. Understanding these laws is not simply a professional necessity for doctors, nurses, administrators, and researchers; it\u2019s also an ethical imperative for anyone who interacts with the healthcare system. This book is your compass, guiding you through the labyrinth of legal fundamentals that shape the landscape of healthcare.", "identifiers": {"open_textbook_library": ["1581"]}, "languages": ["eng"], "lc_classifications": ["RA440", "KF385.A4"], "publish_date": "2024", "publishers": ["University of West Florida Pressbooks"], "source_records": ["open_textbook_library:1581"], "subjects": ["Medicine", "Law"], "title": "Legal Fundamentals of Healthcare Law"}
ol_key      | 
comments    | 
submitter   | openlibrary
-[ RECORD 2 ]-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
id          | 13
batch_id    | 3
added_time  | 2024-06-13 04:22:33.119869
import_time | 
status      | pending
error       | 
ia_id       | open_textbook_library:1580
data        | {"authors": [{"name": "Judy Young"}], "description": "Introduction to Literature: Fairy Tales, Folk Tales, and How They Shape Us introduces college freshmen to the study of literature through a focus on texts that, generally, they already know, or think they know, and how those texts aim to shape audiences to be compliant cultural objects. The book is organized around several prominent story groups, including various genres and forms, meant to promote discussion and discovery leading to students\u2019 understanding that these texts function as cultural sculptors of readers\u2019 principles and behaviors. Students develop the skill of analyzing texts and creating sound arguments about them through class discussions and a series of writing assignments. Ideally, they leave the course understanding how to create a sound argument and, more pointedly, that there is no such thing as \u201cjust a story.\u201d", "identifiers": {"open_textbook_library": ["1580"]}, "languages": ["eng"], "lc_classifications": ["PE1408"], "publish_date": "2023", "publishers": ["University of West Florida Pressbooks"], "source_records": ["open_textbook_library:1580"], "subjects": ["Humanities", "Literature, Rhetoric, and Poetry"], "title": "Introduction to Literature: Fairy Tales, Folk Tales, and How They Shape Us"}
ol_key      | 
comments    | 
submitter   | openlibrary

Try to upload the same file again and see that the duplicates are caught and reported, and that the total tried, added, and not queued are correct, along with the duplicate items:
image

@scottbarnes scottbarnes force-pushed the 7705/feature/create-endpoint-for-community-members-to-submit-import-batches branch from c721508 to fb0f567 Compare July 24, 2023 05:43
@mekarpeles mekarpeles added the Needs: Community Discussion This issue is to be brought up in the next community call. [managed] label Jul 24, 2023
@mekarpeles mekarpeles self-assigned this Jul 24, 2023
@cdrini
Copy link
Collaborator

cdrini commented Nov 16, 2023

Converting this to a draft for now since it's kind of in limbo!

@cdrini cdrini marked this pull request as draft November 16, 2023 22:45
@scottbarnes scottbarnes force-pushed the 7705/feature/create-endpoint-for-community-members-to-submit-import-batches branch from fb0f567 to f0ce028 Compare June 12, 2024 02:54
This:
- still has debugging lines that need to be removed;
- creates an endpoint at /api/import?batch=true that reads JSON data;
- uses an ISBN (preferring ISBN 13) for the ia_id;
- uses the `Batch` class for imports;
- relies on there being a 'submitter' column in thte import_item table;
- only works for people with `can_write()` privileges; and
- generates a hash based on the incoming JSON byte string to use as the
  name of the import batch.

To use:
curl -X POST http://localhost:8080/api/import\?batch\=true -H \
"Content-Type: application/json" -H "Cookie: $OL_COOKIE" -d \
'[{"title": "test book 1", "isbn_10": "test_1"}, {"title": "test book 2", "isbn_13": "test_2"}]'
@scottbarnes scottbarnes force-pushed the 7705/feature/create-endpoint-for-community-members-to-submit-import-batches branch from f0ce028 to 23a7dd8 Compare June 13, 2024 04:27
@scottbarnes scottbarnes marked this pull request as ready for review June 13, 2024 04:30
@cdrini cdrini removed the Needs: Community Discussion This issue is to be brought up in the next community call. [managed] label Jun 13, 2024
@cdrini cdrini assigned cdrini and unassigned mekarpeles Jun 13, 2024
@cdrini cdrini added Priority: 1 Do this week, receiving emails, time sensitive, . [managed] Needs: Special Deploy This PR will need a non-standard deploy to production labels Jun 13, 2024
Copy link
Collaborator

@cdrini cdrini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This looks great! A few code fixes, and one annoying rename :P

]

# Create the batch
batch = Batch.find(batch_name) or Batch.new(name=batch_name, submitter=username)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in the future we should move away from treating batch_name like a unqiue string; it should just be effectively a comment. The batch id is what should be... the id :P But future PR problem, I think there were some good reasons why we did this.

@cdrini cdrini added the Needs: Submitter Input Waiting on input from the creator of the issue/pr [managed] label Jun 13, 2024
@github-actions github-actions bot removed the Needs: Submitter Input Waiting on input from the creator of the issue/pr [managed] label Jun 13, 2024
@scottbarnes scottbarnes added Needs: Review This issue/PR needs to be reviewed in order to be closed or merged (see comments). [managed] labels Jun 14, 2024
@@ -31,17 +30,28 @@


class Batch(web.storage):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Todo in a future PR: make this not extend web.storage

@cdrini cdrini added Needs: Submitter Input Waiting on input from the creator of the issue/pr [managed] On testing.openlibrary.org and removed Needs: Review This issue/PR needs to be reviewed in order to be closed or merged (see comments). [managed] labels Jun 18, 2024
@github-actions github-actions bot removed the Needs: Submitter Input Waiting on input from the creator of the issue/pr [managed] label Jun 18, 2024
Adding feedback from CR

Co-authored-by: Drini Cami <cdrini@gmail.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci
@scottbarnes scottbarnes force-pushed the 7705/feature/create-endpoint-for-community-members-to-submit-import-batches branch 2 times, most recently from 042d27f to ce52c90 Compare June 18, 2024 18:57
@scottbarnes scottbarnes force-pushed the 7705/feature/create-endpoint-for-community-members-to-submit-import-batches branch from ec84040 to fdcf2c4 Compare June 18, 2024 18:58
@cdrini cdrini removed the Needs: Special Deploy This PR will need a non-standard deploy to production label Jun 18, 2024
Copy link
Collaborator

@cdrini cdrini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice lgtm! Tested it erroring correctly; and tested it actually queueing items 🥳 Note I didn't see it actually get imported; caused around this time of a month we have a big import that goes through, so it'll take a whiiile for it to actually go through (note not ideal UX :/ but problem for another time and should only affect ~15-20th of every month).

@cdrini cdrini merged commit 88d903e into internetarchive:master Jun 18, 2024
4 checks passed
@scottbarnes scottbarnes deleted the 7705/feature/create-endpoint-for-community-members-to-submit-import-batches branch June 18, 2024 20:03
@scottbarnes scottbarnes changed the title Community batch import endpoint Batch import endpoint Jun 27, 2024
@scottbarnes scottbarnes mentioned this pull request Jun 27, 2024
@mekarpeles mekarpeles changed the title Batch import endpoint Community Batch Imports endpoint Nov 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: 1 Do this week, receiving emails, time sensitive, . [managed]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create endpoint for community members to submit import batches
3 participants