-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to ingest track files for video service files? #373
Comments
In create or add_media tasks, you should be able to use the option described at https://mjordan.github.io/islandora_workbench_docs/adding_multiple_media/ to do this. |
Thank you for the quick response. Please note that this is not an additional media, pointing to the repository item. In this case, it is a file field to the video media (https://github.com/Islandora/islandora_defaults/blob/2.x/config/install/field.field.media.audio.field_track.yml). It is not clear, how to add the field_track to the video media. |
Does putting that field's values in the CSV not work? I know they are not simple strings, but if they are field values, they should be added via CSV as any other value is. It would be possible to add a feature to Workbench that reads a value from a file instead of from a CSV field. If that sounds like it might be more useful in this case, let me know. |
MediaTrackItem is a custom field provided by islandora: https://github.com/Islandora/islandora/blob/2.x/src/Plugin/Field/FieldType/MediaTrackItem.php. In addition the file, there are 5 additional pieces of information that can be provided for MediaTrackItem. We should develop something general to support multi file media. MediaTrackItem can be a plugin :). Not sure how best to represent/support multi-file related values in (i.e track language, track label) in a single spreadsheet, as there can be multiple track items as well. |
@Natkeeran I had forgotten that MediaTrackItem was an Islandora-specific field type. Workbench strives to accommodate as many field types as possible, so I'll add this one to the list.
Yes, this is the challenge of using CSV as an input format: how to represent structured data within CSV so that humans and common software used by humans can express those structures. So far Workbench has resorted to using colons to create ordered subfields within CSV, but that is not optimal. |
@Natkeeran can you provide the raw JSON returned for one of those media from a |
@mjordan
|
Thanks, that's useful. Assuming that we want the input Workbench data to contain values for 'label', 'kind', 'srclang', and 'url' (that is, they are user-defined), we could express a single field's values in CSV like this:
Presumably the .vtt file is being ingested during a Does that make sense? |
Another question - I don't see a cardinality in the YAML that creates this file. Is it possible that the user can configure the cardinality between single, max occurances, unlimited? |
For MediaTrackItem, there is not required fields other than the file. The cardinality is unlimited. |
So far, the convention for "structured" values in CSV that Workbench has used is based on the order of the subparts/subfields. If we introduced the names, that would be a new convention. I'm not necessarily saying it's a bad idea, just that it breaks an existing pattern, FWIW. Having non-required subfields complicates things a bit, and would be an argument for including the subfield names. If we do stick with order vs. naming them, we might need to make the optional subfields required in order to preserve the order. |
@Natkeeran I've built a Vagrant that has this media type preconfigured and can start looking into this in more depth. Any chance you can point me to some sample vtt files I can use? |
@mjordan We are hoping to use this for the new sandbox: https://test.islandora.ca/ |
Hi @mjordan - we have a client who will need this as well. I'm happy to test anything you might have available. |
Hi @mjordan |
I will proceed to add a new Workbench field type "media track". |
Most of the prep work to support the new (to Workbench) I am thinking that the CSV headers need to look like this: |
Another approach could be to add field_track as another column, and put the file path as the value. Then, check if that column exists, and if so, put the file, and patch the media. An extra check could be to see if field_track exists for the given media. This could be generalized, so any additional field belonging to a media could added this way. |
Hi @Natkeeran sorry I wasn't clear. What you describe is also what I wast trying to describe. But by prepending Also, the check to see if field_track exists on the target media bundle should not be extra, Workbench currently checks for the existence of every field in the incoming CSV, and also validates the structure of the data in the corresponding CSV column. We need to do that for every field. I'll be moving on to do some work on the required Sorry if I'm not clear on clear on where I'm going with this. |
Examples illustrating how the track file information will be expressed in the input CSV. In the first one, we are ingesting nodes and their accompanying video and image media in a "create" task. (In the examples below, "video001", etc. in the the VTT file path is for illustration purposes only, the path is just a path.)
In the second example, we are ingesting nodes and their accompanying video, audio, and image media. We include separate columns indicating the track files for each of the video and audio media:
In both cases, Workbench will get the field configuration for the indicated field in the target media bundle (e.g., "field_track" in the video media bundle), and if the structure in the CSV values for those fields validates against the configuration, everything is good. If the values don't validate, Workbench will tell you. The media bundle type could probably be inferred from the value of the |
Hold off on testing for a bit, I need to do some additional work. |
OK, It's ready to test now. Left to do (once smoke testing is done): add integration tests for new |
Testing this against the sandbox endpoint. It does seem to have uploaded it properly: https://sandbox.islandora.ca/media/263/ https://sandbox.islandora.ca/islandora/video-my-cat-0
Some things to note, though maybe more to do with the the sandbox than this branch.
Maybe related to this WARNING in workbench:
|
@Natkeeran thanks very much for the thorough testing. Responses below.
Excellent!
I assumed that the language codes needed to be valid Drupal language codes. Sorry about that. I can make Workbench parse the first line of the file, to get the language and kind. Is there a use case for allowing the user to override what's in the file by including the language and kind in the CSV data, as currently in the case? Or should we assume that we'll never need to do that?
Yes, that's in the draft documentation.
That's a Drupal config issue. The
Yes, that explains the workbench warning.
I'll note that in the documentation. Thanks again! Let me know what you think about needing to override the language and kind. |
A far as I can tell from https://www.w3.org/TR/webvtt1/#file-structure, the first line in a VTT files that defines the kind and language is not required. So if we wanted to avoid the complex CSV data I've concocted so far, and we didn't need to allow per-CSV record overriding of the kind and language, we could provide a configuration setting that allowed global definition of the kind and language to use in case that info wasn't in the VTT file. The middle ground (parse file first, then use per-CSV record data, then use configured values) sounds increasingly complex and I am concerned about the accompanying UX and code complexity. How about Workbench only parses the VTT files for the kind and language, and if it doesn't find them, issues a warning and doesn't ingest them? In that case, having configurable defaults would still be an option, but we would not provide in-CSV-record overrides. This would allow for a much simpler CSV data structure. |
The common use case I can see is that one would have different language transcripts for a given media. If it is easy to implement, configurable defaults seems acceptable. Can extend the functionality later to per-csv record overwrite. However, need to able to ingest languages not enabled in Drupal. |
Thanks. The per-record indication of language and kind is already in place, so let's leave it for now. Validating language codes against Drupal's was a misunderstanding on my part; I can simply remove the validation, which should allow other lanaguage codes. However, that begs the question about whether the language codes should be validated against some other list. The W3C spec does't mention anything about this, but since WebVTT is an HTML5 technology, I assume the authoritative list of languages is the same as the one for HTML itself. Any thoughts? Edit: Further scouring the web suggests that ISO 639-1 is the authoritative list of language codes for HTML. Maybe that's the list to use. You'd think it would be easier to find which is the authoritative list. Second edit: Looks like ISO 639-1 is derived from the IANA list linked above. Let's go with that one. |
The AblePlayer (https://www.drupal.org/project/ableplayer, https://tamil.digital.utsc.utoronto.ca/61220/utsc34374) makes use of the first line in VTT for language. But, for kind it uses another field itself. AblePlayer has good accessibility support, and has rich features such as transcription UI. For the current scope (i.e sandbox), the dropdown for language seems to be powered by languages enabled in Drupal. Thus, maybe good to leave the validation in. However, having multiple tracks does not seem to work as expected in the sandbox: https://sandbox.islandora.ca/islandora/video-my-cat-3-test. It only has one transcript option. I don't think the language code matters from the video player point of view. If needed, IANA seems good. |
sure, sounds good, thank you |
@mjordan Thanks! I've just come back from my holiday vacation. We'll look at this next week. I guess you've merged the branch (providing ability to ingest track files for video) into main, haven't you? So to test this, I only need to get the latest version of main branch? |
Yes, it's now in main. Thanks for taking a look. |
@mjordan "Workbench cannot add track files to service files generated by Islandora. Since track files are part of the media's service file, Workbench will only add track files to media that are tagged as "Service File" in your Workbench CSV" |
Not in its initial implementation. We'll need to add the ability to ingest track files on their own, using the service files' media IDs in the CSV, or as "additional files". At the moment I'm not sure how this will work since track files are not a "media type" in Islandora, they are an additional file attached to a media type that is generated asynchronously. |
Agree! Yeah, thinking about that, I'm not sure either :) |
Yes, that will work as long as they are tagged as "Service file" when you ingest them (could be tagged as both "Original file" and "Service file" if that applies in your case). |
@mjordan After converting original files to service files by command line, we loaded them within VTT track files. |
Closing, we can open new issues as necessary. |
Hey @mjordan I'm running into an error that seems related to track files - I can put this in another (or new) issue if you prefer.
and the rollback csv has a list of about 30 node id's This is a test set that I'm deleting so I can re-ingest w/ some changes - the only thing I did w/ these objects in Islandora was manually add a thumbnail to one object. when I run the job I get this output:
Edit: noting that WB was updated fairly recently - when I do git log the most recent commit is from Aug. 3rd |
@dmer yes, please, I'd prefer a new issue. |
Islandora provides track field in video to support captions (
Islandora/documentation#1003). How can we ingest those files using workbench, assuming we are ingesting service files. Thanks.
The text was updated successfully, but these errors were encountered: