-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vimeo videos not replayable in SWAP #236
Comments
After some investigation it looks like the vimeo videos weren't archived because they were linked to instead of embedded, and the crawl scope didn't include them. While it could be possible to include One thing we could do is write a program to read the existing 20GB WACZ and discover vimeo.com links to build a seed list with. |
There didn't appear to be that many Vimeo URLs in the archive. Here's the code I ran to read the WARCS that were extracted by was-robots: import csv
import re
from pathlib import Path
import warcio
out = csv.writer(open('vimeo.csv', 'w'))
out.writerow(['site_url', 'vimeo_url'])
for warc_file in Path('/web-archiving-stacks/data/collections/cr827qv5481/bc/725/wm/6775/').iterdir():
for record in warcio.ArchiveIterator(open(warc_file, 'rb')):
if record.rec_type != 'response':
continue
if 'text/html' not in record.http_headers.get('content-type', ''):
continue
site_url = record.rec_headers.get('WARC-Target-URI', '')
if not site_url.startswith('https://eastwindezine.com'):
continue
encoding = record.http_headers.get('content-encoding') or 'utf-8'
html = record.content_stream().read().decode(encoding)
for vimeo_url in re.findall('https://vimeo.com/\d+', html):
print(site_url, vimeo_url)
out.writerow([site_url, vimeo_url]) Attached is a CSV of the page URL and Vimeo URL that was on it: |
@peterchanws I think if you run a crawl of the following Vimeo URLs they should get archived?
Do you want to try that with browsertrix-crawler or browsertrix-cloud or should I? |
I went ahead and did a browsertrix crawl for these, using the following configuration: collection: eastwindezine-vimeo
workers: 1
generateWACZ: true
screencastPort: 9037
logging: stats,pywb,behaviors,behaviors-debug
seeds:
- scopeType: page
url: https://vimeo.com/15034444
- scopeType: page
url: https://vimeo.com/15615041
- scopeType: page
url: https://vimeo.com/27365808
- scopeType: page
url: https://vimeo.com/338308691
- scopeType: page
url: https://vimeo.com/447874997
- scopeType: page
url: https://vimeo.com/453850086
- scopeType: page
url: https://vimeo.com/72016814
- scopeType: page
url: https://vimeo.com/728134716
- scopeType: page
url: https://vimeo.com/804956307 I put the resulting WACZ in the Google Drive Web Archiving / Browsertrix crawls / Easwind if you want to test and/or accession it: https://drive.google.com/file/d/1spfluNDjjn9X16ihhTSHes51JW_zcsn9/view?usp=drive_link |
Hi Ed, I accessioned the warc file in stage. I got a "Player error" in SWAP: |
If that vimeo url works in archiveweb.page this most likely a bug in pywb. |
Here is the druid with wacz file:
https://argo.stanford.edu/view/druid:bc725wm6775
The seed in SWAP
https://swap.stanford.edu/was/20240118154547/https://eastwindezine.com/
You can find Vimeo videos in here
https://swap.stanford.edu/was/20240108161642/https://eastwindezine.com/utom-vibrant-new-music-from-florante-aguilar/
The text was updated successfully, but these errors were encountered: