Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vimeo videos not replayable in SWAP #236

Open
peterchanws opened this issue Jan 24, 2024 · 6 comments
Open

Vimeo videos not replayable in SWAP #236

peterchanws opened this issue Jan 24, 2024 · 6 comments
Assignees

Comments

@peterchanws
Copy link
Collaborator

Here is the druid with wacz file:
https://argo.stanford.edu/view/druid:bc725wm6775

The seed in SWAP
https://swap.stanford.edu/was/20240118154547/https://eastwindezine.com/

You can find Vimeo videos in here
https://swap.stanford.edu/was/20240108161642/https://eastwindezine.com/utom-vibrant-new-music-from-florante-aguilar/

@edsu
Copy link
Contributor

edsu commented Jan 24, 2024

After some investigation it looks like the vimeo videos weren't archived because they were linked to instead of embedded, and the crawl scope didn't include them.

While it could be possible to include vimeo.com/\d+ in the crawl scope this could also pull in many unrelated videos that get picked up when crawling vimeo.com.

One thing we could do is write a program to read the existing 20GB WACZ and discover vimeo.com links to build a seed list with.

@edsu edsu self-assigned this Jan 24, 2024
@edsu
Copy link
Contributor

edsu commented Jan 29, 2024

There didn't appear to be that many Vimeo URLs in the archive. Here's the code I ran to read the WARCS that were extracted by was-robots:

import csv
import re

from pathlib import Path
import warcio

out = csv.writer(open('vimeo.csv', 'w'))
out.writerow(['site_url', 'vimeo_url'])

for warc_file in Path('/web-archiving-stacks/data/collections/cr827qv5481/bc/725/wm/6775/').iterdir():
    for record in warcio.ArchiveIterator(open(warc_file, 'rb')):

        if record.rec_type != 'response':
            continue

        if 'text/html' not in record.http_headers.get('content-type', ''):
            continue

        site_url = record.rec_headers.get('WARC-Target-URI', '')
        if not site_url.startswith('https://eastwindezine.com'):
            continue

        encoding = record.http_headers.get('content-encoding') or 'utf-8'
        html = record.content_stream().read().decode(encoding)
        for vimeo_url in re.findall('https://vimeo.com/\d+', html):
            print(site_url, vimeo_url)
            out.writerow([site_url, vimeo_url])

Attached is a CSV of the page URL and Vimeo URL that was on it:

vimeo.csv

@edsu
Copy link
Contributor

edsu commented Jan 29, 2024

@peterchanws I think if you run a crawl of the following Vimeo URLs they should get archived?

Do you want to try that with browsertrix-crawler or browsertrix-cloud or should I?

@edsu
Copy link
Contributor

edsu commented Feb 1, 2024

I went ahead and did a browsertrix crawl for these, using the following configuration:

collection: eastwindezine-vimeo
workers: 1
generateWACZ: true
screencastPort: 9037
logging: stats,pywb,behaviors,behaviors-debug
seeds:
  - scopeType: page
    url: https://vimeo.com/15034444
  - scopeType: page
    url: https://vimeo.com/15615041
  - scopeType: page
    url: https://vimeo.com/27365808
  - scopeType: page
    url: https://vimeo.com/338308691
  - scopeType: page
    url: https://vimeo.com/447874997
  - scopeType: page
    url: https://vimeo.com/453850086
  - scopeType: page
    url: https://vimeo.com/72016814
  - scopeType: page
    url: https://vimeo.com/728134716
  - scopeType: page
    url: https://vimeo.com/804956307

I put the resulting WACZ in the Google Drive Web Archiving / Browsertrix crawls / Easwind if you want to test and/or accession it: https://drive.google.com/file/d/1spfluNDjjn9X16ihhTSHes51JW_zcsn9/view?usp=drive_link

@peterchanws
Copy link
Collaborator Author

Hi Ed, I accessioned the warc file in stage. I got a "Player error" in SWAP:
https://swap-stage.stanford.edu/was/20240108161642/https://vimeo.com/15034444

@edsu
Copy link
Contributor

edsu commented Feb 7, 2024

If that vimeo url works in archiveweb.page this most likely a bug in pywb.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants