Vimeo videos not replayable in SWAP #236

peterchanws · 2024-01-24T23:27:47Z

Here is the druid with wacz file:
https://argo.stanford.edu/view/druid:bc725wm6775

The seed in SWAP
https://swap.stanford.edu/was/20240118154547/https://eastwindezine.com/

You can find Vimeo videos in here
https://swap.stanford.edu/was/20240108161642/https://eastwindezine.com/utom-vibrant-new-music-from-florante-aguilar/

edsu · 2024-01-24T23:37:37Z

After some investigation it looks like the vimeo videos weren't archived because they were linked to instead of embedded, and the crawl scope didn't include them.

While it could be possible to include vimeo.com/\d+ in the crawl scope this could also pull in many unrelated videos that get picked up when crawling vimeo.com.

One thing we could do is write a program to read the existing 20GB WACZ and discover vimeo.com links to build a seed list with.

edsu · 2024-01-29T23:21:29Z

There didn't appear to be that many Vimeo URLs in the archive. Here's the code I ran to read the WARCS that were extracted by was-robots:

import csv
import re

from pathlib import Path
import warcio

out = csv.writer(open('vimeo.csv', 'w'))
out.writerow(['site_url', 'vimeo_url'])

for warc_file in Path('/web-archiving-stacks/data/collections/cr827qv5481/bc/725/wm/6775/').iterdir():
    for record in warcio.ArchiveIterator(open(warc_file, 'rb')):

        if record.rec_type != 'response':
            continue

        if 'text/html' not in record.http_headers.get('content-type', ''):
            continue

        site_url = record.rec_headers.get('WARC-Target-URI', '')
        if not site_url.startswith('https://eastwindezine.com'):
            continue

        encoding = record.http_headers.get('content-encoding') or 'utf-8'
        html = record.content_stream().read().decode(encoding)
        for vimeo_url in re.findall('https://vimeo.com/\d+', html):
            print(site_url, vimeo_url)
            out.writerow([site_url, vimeo_url])

Attached is a CSV of the page URL and Vimeo URL that was on it:

vimeo.csv

edsu · 2024-01-29T23:23:49Z

@peterchanws I think if you run a crawl of the following Vimeo URLs they should get archived?

Do you want to try that with browsertrix-crawler or browsertrix-cloud or should I?

edsu · 2024-02-01T12:41:50Z

I went ahead and did a browsertrix crawl for these, using the following configuration:

collection: eastwindezine-vimeo
workers: 1
generateWACZ: true
screencastPort: 9037
logging: stats,pywb,behaviors,behaviors-debug
seeds:
  - scopeType: page
    url: https://vimeo.com/15034444
  - scopeType: page
    url: https://vimeo.com/15615041
  - scopeType: page
    url: https://vimeo.com/27365808
  - scopeType: page
    url: https://vimeo.com/338308691
  - scopeType: page
    url: https://vimeo.com/447874997
  - scopeType: page
    url: https://vimeo.com/453850086
  - scopeType: page
    url: https://vimeo.com/72016814
  - scopeType: page
    url: https://vimeo.com/728134716
  - scopeType: page
    url: https://vimeo.com/804956307

I put the resulting WACZ in the Google Drive Web Archiving / Browsertrix crawls / Easwind if you want to test and/or accession it: https://drive.google.com/file/d/1spfluNDjjn9X16ihhTSHes51JW_zcsn9/view?usp=drive_link

peterchanws · 2024-02-06T19:51:07Z

Hi Ed, I accessioned the warc file in stage. I got a "Player error" in SWAP:
https://swap-stage.stanford.edu/was/20240108161642/https://vimeo.com/15034444

edsu · 2024-02-07T01:14:32Z

If that vimeo url works in archiveweb.page this most likely a bug in pywb.

edsu self-assigned this Jan 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vimeo videos not replayable in SWAP #236

Vimeo videos not replayable in SWAP #236

peterchanws commented Jan 24, 2024

edsu commented Jan 24, 2024

edsu commented Jan 29, 2024 •

edited

Loading

edsu commented Jan 29, 2024 •

edited

Loading

edsu commented Feb 1, 2024

peterchanws commented Feb 6, 2024

edsu commented Feb 7, 2024

Vimeo videos not replayable in SWAP #236

Vimeo videos not replayable in SWAP #236

Comments

peterchanws commented Jan 24, 2024

edsu commented Jan 24, 2024

edsu commented Jan 29, 2024 • edited Loading

edsu commented Jan 29, 2024 • edited Loading

edsu commented Feb 1, 2024

peterchanws commented Feb 6, 2024

edsu commented Feb 7, 2024

edsu commented Jan 29, 2024 •

edited

Loading

edsu commented Jan 29, 2024 •

edited

Loading