Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to ingest track files for video service files? #373

Closed
Natkeeran opened this issue Jan 18, 2022 · 45 comments
Closed

How to ingest track files for video service files? #373

Natkeeran opened this issue Jan 18, 2022 · 45 comments
Assignees
Labels
enhancement New feature or request

Comments

@Natkeeran
Copy link

Islandora provides track field in video to support captions (
Islandora/documentation#1003). How can we ingest those files using workbench, assuming we are ingesting service files. Thanks.

@mjordan
Copy link
Owner

mjordan commented Jan 18, 2022

In create or add_media tasks, you should be able to use the option described at https://mjordan.github.io/islandora_workbench_docs/adding_multiple_media/ to do this.

@Natkeeran
Copy link
Author

Natkeeran commented Jan 18, 2022

Thank you for the quick response.

Please note that this is not an additional media, pointing to the repository item.

In this case, it is a file field to the video media (https://github.com/Islandora/islandora_defaults/blob/2.x/config/install/field.field.media.audio.field_track.yml). It is not clear, how to add the field_track to the video media.

@mjordan
Copy link
Owner

mjordan commented Jan 18, 2022

Does putting that field's values in the CSV not work? I know they are not simple strings, but if they are field values, they should be added via CSV as any other value is.

It would be possible to add a feature to Workbench that reads a value from a file instead of from a CSV field. If that sounds like it might be more useful in this case, let me know.

@Natkeeran
Copy link
Author

MediaTrackItem is a custom field provided by islandora: https://github.com/Islandora/islandora/blob/2.x/src/Plugin/Field/FieldType/MediaTrackItem.php. In addition the file, there are 5 additional pieces of information that can be provided for MediaTrackItem.

We should develop something general to support multi file media. MediaTrackItem can be a plugin :).

Not sure how best to represent/support multi-file related values in (i.e track language, track label) in a single spreadsheet, as there can be multiple track items as well.

@mjordan
Copy link
Owner

mjordan commented Jan 18, 2022

@Natkeeran I had forgotten that MediaTrackItem was an Islandora-specific field type. Workbench strives to accommodate as many field types as possible, so I'll add this one to the list.

Not sure how best to represent/support multi-file related values in (i.e track language, track label) in a single spreadsheet, as there can be multiple track items as well.

Yes, this is the challenge of using CSV as an input format: how to represent structured data within CSV so that humans and common software used by humans can express those structures. So far Workbench has resorted to using colons to create ordered subfields within CSV, but that is not optimal.

@mjordan
Copy link
Owner

mjordan commented Jan 19, 2022

@Natkeeran can you provide the raw JSON returned for one of those media from a /media/x?_format=json request so I can see the serialization of the MediaTrackItem field? That is the format that Workbench will need to push up to Islandora.

@Natkeeran
Copy link
Author

@mjordan
Pls see an example below:

{
  "mid": [
    {
      "value": 5
    }
  ],
  "uuid": [
    {
      "value": "1eb55245-b55c-49e9-8457-ad18c98ceb58"
    }
  ],
  "vid": [
    {
      "value": 7
    }
  ],
  "langcode": [
    {
      "value": "en"
    }
  ],
  "bundle": [
    {
      "target_id": "video",
      "target_type": "media_type",
      "target_uuid": "0563b80f-e6b3-4e80-a893-6871213049b2"
    }
  ],
  "revision_created": [
    {
      "value": "2022-01-19T18:16:49+00:00",
      "format": "Y-m-d\\TH:i:sP"
    }
  ],
  "revision_user": [
    {
      "target_id": 1,
      "target_type": "user",
      "target_uuid": "923b1205-7b19-48a8-9bb8-1f87219635df",
      "url": "/user/1"
    }
  ],
  "revision_log_message": [],
  "status": [
    {
      "value": true
    }
  ],
  "uid": [
    {
      "target_id": 1,
      "target_type": "user",
      "target_uuid": "923b1205-7b19-48a8-9bb8-1f87219635df",
      "url": "/user/1"
    }
  ],
  "name": [
    {
      "value": "video"
    }
  ],
  "thumbnail": [
    {
      "target_id": 7,
      "alt": "",
      "title": null,
      "width": 180,
      "height": 180,
      "target_type": "file",
      "target_uuid": "7eb0336f-717a-4c77-bfe2-cf7db15a6ea7",
      "url": "https://islandora.traefik.me/sites/default/files/media-icons/generic/video.png"
    }
  ],
  "created": [
    {
      "value": "2022-01-19T17:28:34+00:00",
      "format": "Y-m-d\\TH:i:sP"
    }
  ],
  "changed": [
    {
      "value": "2022-01-19T18:16:49+00:00",
      "format": "Y-m-d\\TH:i:sP"
    }
  ],
  "default_langcode": [
    {
      "value": true
    }
  ],
  "revision_translation_affected": [
    {
      "value": true
    }
  ],
  "metatag": {
    "value": {
      "canonical_url": "https://islandora.traefik.me/media/5",
      "title": "| Default"
    }
  },
  "content_translation_source": [
    {
      "value": "und"
    }
  ],
  "content_translation_outdated": [
    {
      "value": false
    }
  ],
  "field_access_terms": [],
  "field_file_size": [
    {
      "value": 85648900
    }
  ],
  "field_media_of": [
    {
      "target_id": 4,
      "target_type": "node",
      "target_uuid": "4e4963d6-884f-4162-ba63-a2419fca77ed",
      "url": "/islandora/video"
    }
  ],
  "field_media_use": [
    {
      "target_id": 18,
      "target_type": "taxonomy_term",
      "target_uuid": "25ec2453-f500-476d-b994-380b63f88127",
      "url": "/taxonomy/term/18"
    }
  ],
  "field_media_video_file": [
    {
      "target_id": 6,
      "display": true,
      "description": "",
      "target_type": "file",
      "target_uuid": "8009c371-db23-4008-a0a4-6fb8d21786b0",
      "url": "https://islandora.traefik.me/_flysystem/fedora/2022-01/Noah%20Smith%27s%20Docker%20Session.mp4"
    }
  ],
  "field_mime_type": [
    {
      "value": "video/mp4"
    }
  ],
  "field_original_name": [],
  "field_track": [
    {
      "target_id": 8,
      "label": "en",
      "kind": "subtitles",
      "srclang": "en",
      "default": false,
      "target_type": "file",
      "target_uuid": "1a74732f-fc96-42c4-823d-c1912e7768e6",
      "url": "https://islandora.traefik.me/_flysystem/fedora/2022-01/Noah%20Smith%27s%20Docker%20Session_AutoGeneratedCaption.vtt"
    }
  ]
}

@mjordan
Copy link
Owner

mjordan commented Jan 20, 2022

Thanks, that's useful. Assuming that we want the input Workbench data to contain values for 'label', 'kind', 'srclang', and 'url' (that is, they are user-defined), we could express a single field's values in CSV like this:

en:subtitles:en:/path/to/the/vtt/file

Presumably the .vtt file is being ingested during a create or add_media task, and if so, Workbench would need to ingest it as a file, get the resulting URL, and then PATCH the media with the JSON-structured field_track data incorporating the 'label', 'kind', 'scrlang', and resulting 'url' values. If any of those values are configured as "required" in the field definition, we'd need to pass in dummy values when we initially populate the field, then update them with the real values during the PATCH.

Does that make sense?

@mjordan
Copy link
Owner

mjordan commented Jan 20, 2022

Another question - I don't see a cardinality in the YAML that creates this file. Is it possible that the user can configure the cardinality between single, max occurances, unlimited?

@Natkeeran
Copy link
Author

en:subtitles:en:/path/to/the/vtt/file is a good idea. Wondering if we should generalize this at this time by appending the field name: label:en::kind::subtitles:srclang::en:filepath::/path/to/the/vtt/file?

For MediaTrackItem, there is not required fields other than the file. The cardinality is unlimited.

@mjordan
Copy link
Owner

mjordan commented Jan 20, 2022

So far, the convention for "structured" values in CSV that Workbench has used is based on the order of the subparts/subfields. If we introduced the names, that would be a new convention. I'm not necessarily saying it's a bad idea, just that it breaks an existing pattern, FWIW.

Having non-required subfields complicates things a bit, and would be an argument for including the subfield names. If we do stick with order vs. naming them, we might need to make the optional subfields required in order to preserve the order.

@mjordan
Copy link
Owner

mjordan commented Jan 26, 2022

@Natkeeran I've built a Vagrant that has this media type preconfigured and can start looking into this in more depth. Any chance you can point me to some sample vtt files I can use?

@Natkeeran
Copy link
Author

@mjordan
Yes:) We have sample objects here: https://github.com/digitalutsc/repo_islandora_demo_objects. The csv (also linked in the readme) https://docs.google.com/spreadsheets/d/1DV0Ka0ZafZq3RgCn0x_AetFtlmnMuNFltzck2QL0ApY/edit#gid=994664908

We are hoping to use this for the new sandbox: https://test.islandora.ca/

@dara2
Copy link

dara2 commented Mar 1, 2022

Hi @mjordan - we have a client who will need this as well. I'm happy to test anything you might have available.

@simonhm
Copy link

simonhm commented Dec 12, 2022

Hi @mjordan
We're also having customers asking for this. I look forward to testing this ASAP.
Thanks!
Simon.

@mjordan
Copy link
Owner

mjordan commented Dec 13, 2022

I will proceed to add a new Workbench field type "media track".

@mjordan
Copy link
Owner

mjordan commented Dec 13, 2022

Most of the prep work to support the new (to Workbench) media_track field type, other than completion of tests in "field_tests.py", is done. Now on to the --check and media population work.

I am thinking that the CSV headers need to look like this: media:video:field_track and the CSV values like en:subtitles:en:/path/to/the/vtt/file.vtt. This is consistent with the column header format for Paragraphs proposed over in #292. With the introduction of paragraph entities and now media entities in one CSV file, we need to distinguish between fields on nodes and fields on other entity types.

@Natkeeran
Copy link
Author

Another approach could be to add field_track as another column, and put the file path as the value. Then, check if that column exists, and if so, put the file, and patch the media. An extra check could be to see if field_track exists for the given media.

This could be generalized, so any additional field belonging to a media could added this way.

@mjordan
Copy link
Owner

mjordan commented Dec 14, 2022

Hi @Natkeeran sorry I wasn't clear. What you describe is also what I wast trying to describe. But by prepending media:video: I am telling Workbench what media bundle the CSV values should be applied to. I am also anticipating that we'll want to be able to include other values in the CSV that apply to media, so eventually we could have fields like media:video:name, media:video:somecustom_field, etc.

Also, the check to see if field_track exists on the target media bundle should not be extra, Workbench currently checks for the existence of every field in the incoming CSV, and also validates the structure of the data in the corresponding CSV column. We need to do that for every field. I'll be moving on to do some work on the required --checking next.

Sorry if I'm not clear on clear on where I'm going with this.

@mjordan mjordan mentioned this issue Dec 14, 2022
mjordan added a commit that referenced this issue Dec 28, 2022
@mjordan
Copy link
Owner

mjordan commented Dec 29, 2022

Examples illustrating how the track file information will be expressed in the input CSV. In the first one, we are ingesting nodes and their accompanying video and image media in a "create" task. (In the examples below, "video001", etc. in the the VTT file path is for illustration purposes only, the path is just a path.)

id,field_model,title,file,media:video:field_track
001,Video,First video,first.mp4,Transcript:subtitles:en:/path/to/video001/vtt/file.vtt
002,Image,An image,test.png,
003,Video,Second video,second.mp4,Transcript:subtitles:en:/path/to/video003/vtt/file.vtt

In the second example, we are ingesting nodes and their accompanying video, audio, and image media. We include separate columns indicating the track files for each of the video and audio media:

id,field_model,title,file,media:video:field_track,media:audio:field_track
001,Video,First video,first.mp4,Transcript:subtitles:en:/path/to/video001/vtt/file.vtt,
002,Image,An image,test.png,,
003,Video,Second video,second.mp4,Transcript:subtitles:en:/path/to/video003/vtt/file.vtt,
004,Audio,An audio,,Transcript:subtitles:en:/path/to/audio004/vtt/file.vtt

In both cases, Workbench will get the field configuration for the indicated field in the target media bundle (e.g., "field_track" in the video media bundle), and if the structure in the CSV values for those fields validates against the configuration, everything is good. If the values don't validate, Workbench will tell you.

The media bundle type could probably be inferred from the value of the file column, but using the column naming convention above will be consistent with the column naming convention Workbench will be using for Paragraphs, e.g. paragraph:myparagraphbundlemachinename:field_example.

mjordan added a commit that referenced this issue Jan 1, 2023
mjordan added a commit that referenced this issue Jan 1, 2023
mjordan added a commit that referenced this issue Jan 2, 2023
mjordan added a commit that referenced this issue Jan 2, 2023
mjordan added a commit that referenced this issue Jan 2, 2023
mjordan added a commit that referenced this issue Jan 2, 2023
@mjordan
Copy link
Owner

mjordan commented Jan 3, 2023

Hold off on testing for a bit, I need to do some additional work.

mjordan added a commit that referenced this issue Jan 4, 2023
@mjordan
Copy link
Owner

mjordan commented Jan 4, 2023

OK, It's ready to test now.

Left to do (once smoke testing is done): add integration tests for new get_all_representations_of_term() function.

@Natkeeran
Copy link
Author

Natkeeran commented Jan 7, 2023

@mjordan

Testing this against the sandbox endpoint.

It does seem to have uploaded it properly: https://sandbox.islandora.ca/media/263/

https://sandbox.islandora.ca/islandora/video-my-cat-0

  • How are we handling language right now? It seems that it can only process languages enabled in the Drupal. Further, not sure how to set the kind. Does it take this info from the webvtt itself (example: https://en.wikipedia.org/wiki/WebVTT: WEBVTT Kind: captions; Language: en)? It seems to be, it is best to define those info in the webvtt itself. That does mean, we would have to parse the file (first line) to get that info.

Some things to note, though maybe more to do with the the sandbox than this branch.

  • need to add vtt to the extensions allowed for media type file

  • I did get the following exception, which I am not sure why it is throwing.

Symfony\Component\HttpKernel\Exception\MethodNotAllowedHttpException: No route found for "PATCH /media/263/edit": Method Not Allowed (Allow: GET, POST, HEAD) in Symfony\Component\HttpKernel\EventListener\RouterListener->onKernelRequest() (line 140 of /var/www/drupal/vendor/symfony/http-kernel/EventListener/RouterListener.php).

Maybe related to this WARNING in workbench:

07-Jan-23 08:21:19 - WARNING - Media https://sandbox.islandora.ca/media/263/edit?_format=json Islandora Media Use terms not updated.
  • To see the closed captions, needs to change the display field formatter.

@mjordan
Copy link
Owner

mjordan commented Jan 7, 2023

@Natkeeran thanks very much for the thorough testing. Responses below.

Testing this against the sandbox endpoint.

It does seem to have uploaded it properly: https://sandbox.islandora.ca/media/263/

Excellent!

How are we handling language right now? It seems that it can only process languages enabled in the Drupal. Further, not sure how to set the kind. Does it take this info from the webvtt itself (example: https://en.wikipedia.org/wiki/WebVTT: WEBVTT Kind: captions; Language: en)? It seems to be, it is best to define those info in the webvtt itself. That does mean, we would have to parse the file (first line) to get that info.

I assumed that the language codes needed to be valid Drupal language codes. Sorry about that. I can make Workbench parse the first line of the file, to get the language and kind. Is there a use case for allowing the user to override what's in the file by including the language and kind in the CSV data, as currently in the case? Or should we assume that we'll never need to do that?

Some things to note, though maybe more to do with the the sandbox than this branch.

need to add vtt to the extensions allowed for media type file

Yes, that's in the draft documentation.

I did get the following exception, which I am not sure why it is throwing.

Symfony\Component\HttpKernel\Exception\MethodNotAllowedHttpException: No route found for "PATCH /media/263/edit": Method Not Allowed (Allow: GET, POST, HEAD) in Symfony\Component\HttpKernel\EventListener\RouterListener->onKernelRequest() (line 140 of /var/www/drupal/vendor/symfony/http-kernel/EventListener/RouterListener.php).

That's a Drupal config issue. The PATCH method will need to be enabled on the Sandbox. I'll need to update the Integration module to do that. I'll open an issue on that repo.

Maybe related to this WARNING in workbench:

07-Jan-23 08:21:19 - WARNING - Media https://sandbox.islandora.ca/media/263/edit?_format=json Islandora Media Use terms not updated.

Yes, that explains the workbench warning.

To see the closed captions, needs to change the display field formatter.

I'll note that in the documentation.

Thanks again! Let me know what you think about needing to override the language and kind.

@mjordan
Copy link
Owner

mjordan commented Jan 8, 2023

A far as I can tell from https://www.w3.org/TR/webvtt1/#file-structure, the first line in a VTT files that defines the kind and language is not required. So if we wanted to avoid the complex CSV data I've concocted so far, and we didn't need to allow per-CSV record overriding of the kind and language, we could provide a configuration setting that allowed global definition of the kind and language to use in case that info wasn't in the VTT file. The middle ground (parse file first, then use per-CSV record data, then use configured values) sounds increasingly complex and I am concerned about the accompanying UX and code complexity.

How about Workbench only parses the VTT files for the kind and language, and if it doesn't find them, issues a warning and doesn't ingest them? In that case, having configurable defaults would still be an option, but we would not provide in-CSV-record overrides. This would allow for a much simpler CSV data structure.

@Natkeeran
Copy link
Author

The common use case I can see is that one would have different language transcripts for a given media. If it is easy to implement, configurable defaults seems acceptable. Can extend the functionality later to per-csv record overwrite. However, need to able to ingest languages not enabled in Drupal.

@mjordan
Copy link
Owner

mjordan commented Jan 9, 2023

Thanks. The per-record indication of language and kind is already in place, so let's leave it for now. Validating language codes against Drupal's was a misunderstanding on my part; I can simply remove the validation, which should allow other lanaguage codes. However, that begs the question about whether the language codes should be validated against some other list. The W3C spec does't mention anything about this, but since WebVTT is an HTML5 technology, I assume the authoritative list of languages is the same as the one for HTML itself. Any thoughts?

Edit: Further scouring the web suggests that ISO 639-1 is the authoritative list of language codes for HTML. Maybe that's the list to use. You'd think it would be easier to find which is the authoritative list.

Second edit: Looks like ISO 639-1 is derived from the IANA list linked above. Let's go with that one.

@Natkeeran
Copy link
Author

The AblePlayer (https://www.drupal.org/project/ableplayer, https://tamil.digital.utsc.utoronto.ca/61220/utsc34374) makes use of the first line in VTT for language. But, for kind it uses another field itself. AblePlayer has good accessibility support, and has rich features such as transcription UI.

For the current scope (i.e sandbox), the dropdown for language seems to be powered by languages enabled in Drupal. Thus, maybe good to leave the validation in. However, having multiple tracks does not seem to work as expected in the sandbox: https://sandbox.islandora.ca/islandora/video-my-cat-3-test. It only has one transcript option.

I don't think the language code matters from the video player point of view. If needed, IANA seems good.

@mjordan
Copy link
Owner

mjordan commented Jan 9, 2023

OK, thanks. I'd like to propose that we leave everything as is for now and open separate issues for the unresolved parts of this. I'd like to merge soon since I included some work that another issue (#126) depends on. I've updated the Workbench docs as you suggested.

@Natkeeran
Copy link
Author

sure, sounds good, thank you

@simonhm
Copy link

simonhm commented Jan 12, 2023

@mjordan Thanks! I've just come back from my holiday vacation. We'll look at this next week. I guess you've merged the branch (providing ability to ingest track files for video) into main, haven't you? So to test this, I only need to get the latest version of main branch?
Simon.

@mjordan
Copy link
Owner

mjordan commented Jan 12, 2023

Yes, it's now in main. Thanks for taking a look.

@simonhm
Copy link

simonhm commented Jan 19, 2023

@mjordan "Workbench cannot add track files to service files generated by Islandora. Since track files are part of the media's service file, Workbench will only add track files to media that are tagged as "Service File" in your Workbench CSV"
<<< Oh no, in our libraries, we load the original files. We need Islandora to help us to generate the service files for original videos. Does that mean we can't use this "Creating media track files" function of Workbench?

@mjordan
Copy link
Owner

mjordan commented Jan 19, 2023

Not in its initial implementation. We'll need to add the ability to ingest track files on their own, using the service files' media IDs in the CSV, or as "additional files". At the moment I'm not sure how this will work since track files are not a "media type" in Islandora, they are an additional file attached to a media type that is generated asynchronously.

@simonhm
Copy link

simonhm commented Jan 19, 2023

Agree! Yeah, thinking about that, I'm not sure either :)
That's fine, we got a workaround for this. We can convert the original video files into service files by command line. And then ...

@mjordan
Copy link
Owner

mjordan commented Jan 19, 2023

Yes, that will work as long as they are tagged as "Service file" when you ingest them (could be tagged as both "Original file" and "Service file" if that applies in your case).

@simonhm
Copy link

simonhm commented Feb 2, 2023

@mjordan After converting original files to service files by command line, we loaded them within VTT track files.
Working like a charm, Mark!
Thank you!
Simon.

@mjordan
Copy link
Owner

mjordan commented Feb 12, 2023

Closing, we can open new issues as necessary.

@dmer
Copy link

dmer commented Sep 7, 2023

Hey @mjordan I'm running into an error that seems related to track files - I can put this in another (or new) issue if you prefer.
I'm just running a very basic rollback - the rollback.yml:

task: delete
host: https://site.edu
username: admin
password: pwd
input_dir: /mnt/ingest/
standalone_media_url: true
input_csv: rollback.csv

and the rollback csv has a list of about 30 node id's

This is a test set that I'm deleting so I can re-ingest w/ some changes - the only thing I did w/ these objects in Islandora was manually add a thumbnail to one object.

when I run the job I get this output:


"Delete" task started using config file rollback.yml.
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /mnt/ingest/islandora_workbench/./workbench:2172 in <module>                                     │
│                                                                                                  │
│   2169 │   if config['task'] == 'update':                                                        │
│   2170 │   │   update()                                                                          │
│   2171 │   if config['task'] == 'delete':                                                        │
│ ❱ 2172 │   │   delete()                                                                          │
│   2173 │   if config['task'] == 'add_media':                                                     │
│   2174 │   │   add_media()                                                                       │
│   2175 │   if config['task'] == 'delete_media':                                                  │
│                                                                                                  │
│ /mnt/ingest/islandora_workbench/./workbench:621 in delete                                        │
│                                                                                                  │
│    618 │   │   │   for media in media_response_body:                                             │
│    619 │   │   │   │   if 'mid' in media:                                                        │
│    620 │   │   │   │   │   media_id = media['mid'][0]['value']                                   │
│ ❱  621 │   │   │   │   │   media_delete_status_code = remove_media_and_file(config, media_id)    │
│    622 │   │   │   │   │   if media_delete_status_code == 204:                                   │
│    623 │   │   │   │   │   │   media_messages.append("+ Media " + config['host'] + '/media/' +   │
│    624                                                                                           │
│                                                                                                  │
│ /mnt/ingest/islandora_workbench/workbench_utils.py:3601 in remove_media_and_file                 │
│                                                                                                  │
│   3598 │   media_bundle_name = get_media_response_body['bundle'][0]['target_id']                 │
│   3599 │   if media_bundle_name in config['media_track_file_fields']:                            │
│   3600 │   │   track_file_field = config['media_track_file_fields'][media_bundle_name]           │
│ ❱ 3601 │   │   for track_file in get_media_response_body[track_file_field]:                      │
│   3602 │   │   │   track_file_id = track_file['target_id']                                       │
│   3603 │   │   │   track_file_endpoint = config['host'] + '/entity/file/' + str(track_file_id)   │
│   3604 │   │   │   track_file_response = issue_request(config, 'DELETE', track_file_endpoint)    │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
KeyError: 'field_track'

Edit: noting that WB was updated fairly recently - when I do git log the most recent commit is from Aug. 3rd

@mjordan
Copy link
Owner

mjordan commented Sep 7, 2023

@dmer yes, please, I'd prefer a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants