Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLI importers are broken #1238

Open
jcoyne opened this issue Jun 13, 2017 · 13 comments
Open

CLI importers are broken #1238

jcoyne opened this issue Jun 13, 2017 · 13 comments
Labels

Comments

@jcoyne
Copy link
Member

jcoyne commented Jun 13, 2017

This commit removes the actor that handled the :files attribute: samvera/hyrax@3f1b581

The ObjectFactory and its descendants expect the :files attribute to be handled by the work actor.

https://github.com/samvera-labs/hyku/blob/master/lib/importer/factory/object_factory.rb#L133

This probably affects the CSVImporter and MODSImporter but not the PURLImporter since the latter uses remote files: https://github.com/samvera-labs/hyku/blob/master/app/jobs/import_work_from_purl_job.rb#L47

@jcoyne jcoyne added the bug label Jun 13, 2017
@jcoyne
Copy link
Member Author

jcoyne commented Jun 13, 2017

@dunn I think you'll have this problem if you upgrade to Hyrax.

@dunn
Copy link

dunn commented Jun 13, 2017

Thanks for the heads-up!

@atz atz self-assigned this Aug 16, 2017
@atz
Copy link
Contributor

atz commented Aug 17, 2017

Not sure I quite follow this.

The files attrib from object_factory.rb#L136 goes through:

def file_attributes
  files_directory.present? && files.present? ? { files: file_paths } : {}
end
...
def transform_attributes
  StringLiteralProcessor.process(attributes.slice(*permitted_attributes)).merge(file_attributes)
end

to end up used in create and update. update always uses work_actor, create does unless the thing being created is a Collection.

work_actor is Hyrax::CurationConcern.actor, defaulting to:

Hyrax::DefaultMiddlewareStack.build_stack.build(Actors::Terminator.new)

which has to be one of the least intelligible pieces of indirection in our stack, layering on 17 other middleware actors. @jcoyne, your contention is that none of these actually does what we need anymore? Not even CreateWithFilesActor?

@mjgiarlo What is the right direction to fix this:

  • Revert the Hyrax commit?
  • Add another actor similar to the old one in Hyku and append to the @work_middleware_stack?
  • Convert raw files to Hyrax::UploadedFile objects and rely on an existing actor?
  • Amend existing actor to catch :files attribute again?

@mjgiarlo
Copy link
Member

@atz I'd be most inclined in favor of Convert raw files to Hyrax::UploadedFile objects and rely on an existing actor. You?

@atz
Copy link
Contributor

atz commented Aug 17, 2017

@mjgiarlo That would be fine for me (assuming it would work). We would need to decide how to handle the "identity" of files that start as just local filesystem files. Like, do we look for a matching UploadedFile first or just create new UploadedFiles each run? If we look first, do we consider the path an identifier and update the binary content?

Note: if the point of the CLI is "slurp these local files into fedora", but the CarrierWave config is backed remotely (like S3), then we are introducing cost. I don't see a way around that if ingest is going to trigger async jobs anyway though. I think anybody who cares about that would have to use shared mounts and configure CarrierWave to use them.

@atz
Copy link
Contributor

atz commented Aug 17, 2017

@mjgiarlo, @jcoyne: trying to use bin/import_from_csv suggests all manner of limitations ahead of the Hyrax integration issues forecast here. First, there is no documentation of what the CSV should include.

Importer::CSVImporter has:

# Import a csv file with one work per row. The first row of the csv should be a
# header row. The model for each row can either be specified in a column called
# 'type' or globally by passing the model attribute

This is almost useless as a governing instruction to end users. If the sum total total of the instructions are "Give us some CSV - with a header row, maybe include type", then I'm afraid I have to wash my hands of this as an unsupportable proto-feature. It isn't a bug to be fixed.

What is the relationship between the data and the images in the files_directory, for example? What data is supported? What quote syntaxes are supported? Excel-style exports? What delimiters are supported? How do I use a different delimiter? What encodings are supported? And so on. I have implemented CSV imports before, and the informality of the existing code around it suggests it never actually saw much use.

There is an example CSV spec/fixtures/csv/gse_metadata.csv with header row:

id,type,title,description,subject,subject,subject,subject,subject,subject,resource_type,contributor,contributor,contributor,contributor,contributor,contributor,contributor,contributor,contributor,contributor,contributor,date_created,file

So from that, we might be able to deduce some answers (field names, repeatability!), but also raise other questions: how is id used?

Anwyay, when I attempt even a test run on the fixture CSV, I get:

bin/import_from_csv foobar.localhost spec/fixtures/csv/gse_metadata.csv path/to/dir
...
Starting import...
bundler: failed to load command: bin/import_from_csv (bin/import_from_csv)
KeyError: key not found: :collection
  /Users/atz/repos/hyku/lib/importer/factory/with_associated_collection.rb:21:in `fetch'
  /Users/atz/repos/hyku/lib/importer/factory/with_associated_collection.rb:21:in `collection'
  /Users/atz/repos/hyku/lib/importer/factory/with_associated_collection.rb:9:in `create_attributes'
  /Users/atz/repos/hyku/lib/importer/factory/object_factory.rb:65:in `create'
  /Users/atz/repos/hyku/lib/importer/factory/object_factory.rb:22:in `block in run'
  /Users/atz/.rvm/gems/ruby-2.4.1/gems/activesupport-5.1.3/lib/active_support/notifications.rb:166:in `block in instrument'
  /Users/atz/.rvm/gems/ruby-2.4.1/gems/activesupport-5.1.3/lib/active_support/notifications/instrumenter.rb:21:in `instrument'
  /Users/atz/.rvm/gems/ruby-2.4.1/gems/activesupport-5.1.3/lib/active_support/notifications.rb:166:in `instrument'
  /Users/atz/repos/hyku/lib/importer/factory/object_factory.rb:22:in `run'
  /Users/atz/repos/hyku/lib/importer/csv_importer.rb:48:in `create_fedora_objects'
  /Users/atz/repos/hyku/lib/importer/csv_importer.rb:20:in `block in import_all'
  /Users/atz/repos/hyku/lib/importer/csv_parser.rb:16:in `block in each'
  /Users/atz/.rvm/rubies/ruby-2.4.1/lib/ruby/2.4.0/csv.rb:1771:in `each'
  /Users/atz/.rvm/rubies/ruby-2.4.1/lib/ruby/2.4.0/csv.rb:1148:in `block in foreach'
  /Users/atz/.rvm/rubies/ruby-2.4.1/lib/ruby/2.4.0/csv.rb:1299:in `open'
  /Users/atz/.rvm/rubies/ruby-2.4.1/lib/ruby/2.4.0/csv.rb:1147:in `foreach'
  /Users/atz/repos/hyku/lib/importer/csv_parser.rb:13:in `each'
  /Users/atz/repos/hyku/lib/importer/csv_importer.rb:19:in `import_all'
  bin/import_from_csv:36:in `main'
  bin/import_from_csv:45:in `<top (required)>'

My interpretation is that even our fixture CSV is not "valid enough" to actually be processed. And we expect admins/users to do better than that?

@atz
Copy link
Contributor

atz commented Aug 17, 2017

If I add a collection field, I get:

RuntimeError: Invalid headers: collection

None of this even gets into the Hyrax changes.

@atz
Copy link
Contributor

atz commented Aug 17, 2017

OK, cherry-picked 6e4ccf3 to get some of Mike's fixes. Now I get:

ActiveFedora::ActiveFedoraError: Model mismatch. Expected GenericWork. Got: Collection

This is after reverting back to the unaltered fixture CSV.... OR from using @mjgiarlo's simple.csv from that commit. I get this error even if I checkout his branch.

@atz
Copy link
Contributor

atz commented Aug 18, 2017

OK, I isolated the exception to be:

> ActiveFedora::Base.exists? "wg827ks1643"
true
> Collection.exists? "wg827ks1643"
true
> GenericWork.exists? "wg827ks1643"
*** ActiveFedora::ActiveFedoraError Exception: Model mismatch. Expected GenericWork. Got: Collection

To me, this doesn't fit the method sig/behavior of exists? at all. I'm not asking if the ID is taken (in which case I would ask ActiveFedora::Base). I'm asking: "Does an object of this class exist with the given ID?" If the object does not match the class, then the response should be false. The answer should never be an Exception.

If you can't safely ask whether a thing exists, the rest of your program logic is going to get ornate or fall apart.

@atz
Copy link
Contributor

atz commented Aug 22, 2017

Filed a PR on AF: samvera/active_fedora#1277

@atz
Copy link
Contributor

atz commented Aug 22, 2017

Reassigned to another project, so this will fall to somebody else to pick up.

@atz atz removed their assignment Aug 22, 2017
@dunn
Copy link

dunn commented Aug 23, 2017

I haven't really been following this and #1164 so I may not have the full picture, but I've been working on some things that may be relevant.

Looks like the Hyku CLI was based partly on work DCE did for UCSB; we've rewritten much of it and now have a single bin/ingest script that works pretty well: https://github.com/ucsblibrary/alexandria/blob/master/bin/ingest

As for the problems around CSV format that @atz mentioned, I've been talking with @mcritchlow about working toward some sort of standardized schema that would allow us to do validations (and would also make code more shareable). The beginnings of that is here: https://github.com/ucsblibrary/metadata-ci

@ghost
Copy link

ghost commented Sep 12, 2017

Related issue logged in Hyrax (for 2.0.0.beta): samvera/hyrax#1654

I've been using the ObjectFactory for my own eprints importer in Hyku, bypassing :files and using :remote_files to download files from a url.

I'm having problems passing :remote_files using latest Hyrax and sidekiq bc the temporary file has been cleaned up before the job is processed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants