Skip to content

Latest commit

 

History

History
79 lines (59 loc) · 5.02 KB

File metadata and controls

79 lines (59 loc) · 5.02 KB

Data Umbrella Audio Transcription Project - README

Preview: The EDIT page:

EDIT page

This project's goal is to automate portions of the tasks needed to convert the captions of presentation videos in the Data Umbrella YouTube channel.
The pytube library is used to access the video and its metadata, e.g. audio track and auto-generated captions, which constitute the raw text for the formatted transcripts published by year in the Events Transcripts folder of the Data Umbrella (DU) repo. This automation is implemented via a notebook-base GUI built with the ipywidgets library, and can be found in ManageGUI.ipynb in the project notebooks folder (EventManagement/notebooks). Towards the aim of completing the most teadious task, i.e. editing the transcript, the audio of the video (or the video), is save locally so that it can be played, stopped, and restarted at will in a Jupyter notebook during the reviewing of the transcript. The GUI EDIT page combines all this functionality in a nifty header. Additionally, this page now allows editors to update the csv files used for propercasing and corrections and to re-process the transcript text.

Transcription by DU repo contributors

Volunteer transcribers indicate their intention to produce or edit the transcript (in part or in whole) by:

  • Placing an issue indicating which presentation they chose. Unprocessed presentation transcripts are indicated in the main Transcripts README file.
  • Forking the repo and creating a branch for the presentation identifier.

Here is a brief 'git contribution' commands workflow (amended from Astropy's documentation):

  1. If not done: Fork (copy) a GitHub repository in your GitHub space
    Clone the forked GitHub repository on your local system

  2. git fetch

gets the latest Repo version, which you will use as the basis for making your changes.

  1. git branch

makes a logically separate copy of Repo to keep track of your changes.

  1. git checkout -b

create a branch for each issue/feature you are working on, e.g. 'transcript_'

  1. git status

see a list of files that have been modified or created.

  1. git add

stages files you have changed or created for addition to git.

  1. git commit -m "<recommended: include a Status from the Status.Enum, e.g. "Partial (w.i.p.)"?>"

adds your staged changes to the repository.

  1. git push

copies the changes you committed to GitHub

Functional implementation

The project implements functions related to two roles, which are functional, not authorization-related:

  1. Admin
  2. Editor

1. Admin's tasks:

ADD: Add a new entry in the README main table, and setup of the starter transcript Mardkown file.

MODIFY: Update (change, correct, amend, etc) an entry in the README main table, or update any part of the header in the transcript Mardkown file.

=> This 'Admin.MODIFY' function DOES NOT modify any existing transcript text: the Editor-related functions do that.

2. Editor's tasks: EDIT

The task of the transcriber Editor is to further 'clean up' the initial transcript auto-generated by this project:

  • Provide appropriate headers, e.g. "Introduction [by host name] ".
  • Apply appropriate capitalization of words (for those not covered in the text cleaning function).
  • Add punctuation. The audio file was not processed at all, so the initial transcript inherits the 'punctuation-free' status of the Google auto-generated captions.
  • Modify paragraph breaks. The initial transcript is chunked into paragraphs corresponding to 4 minutes intervals, which is a user parameter, as is the wrapping width.
  • Provide additional sections relevant to the presentation, e.g. extra references.
  • Update the presentation transcript Markdown file via a pull request (PR).

EventManagement Summary:

You can:

  • Add a new event in the README table & create its associated / with a 'starter transcript'
  • Modify some data in the README table or header of the event /
  • Edit the starter transcript of the event /

Technical Note: To test whether your editing has introduced non-unicode characters, open the file in a notebook cell:

from IPython.display import Markdown

Markdown(filename=<filepath>)

The current implementation does not decode these characters because the files are directly opened with Markdown (as in the above example), either to render the file, or obtain the Mardkown code post rendering (e.g. Markdown().data. See UnicodeEncodeError.