-
Notifications
You must be signed in to change notification settings - Fork 33
CLI docs
You are currently reading waybackpy docs to use it as a CLI tool. If you want to use waybackpy as a python library by importing it in a python module/file visit Python package docs.
- Installation
- Saving webpage
- Oldest archive URL
- Newest archive URL
- Archive near specified time
- Fetch all the known URLs for a host/domain
- CDX Server API
webpage: https://pypi.python.org/project/waybackpy/
pip install waybackpy -U
webpage: https://snapcraft.io/waybackpy
Use containerized snap package of waybackpy for using waybackpy as a CLI tool across many different Linux distributions.
webpage: https://aur.archlinux.org/packages/waybackpy
This feature uses Wayback Machine's Save API.
Often while saving a link on Wayback Machine, the link returned is cached and not recently saved. If cached save is False it implies that a new archive was created because of our save request and if cached save is True then the Wayback Machine returned an older archive that was saved before the made the request.
Waybackpy checks the timestamp of the returned archive to determine the cache status.
The archive URL is either parsed from the response header of the SavePageNow API or can also be the response URL itself, we employ three pattern matching checks to find the archive.
The following example does not print the save API response headers, to output the headers use --headers
flag.
waybackpy --url "https://en.wikipedia.org/wiki/Social_media" --user_agent "my-unique-user-agent" --save
Archive URL:
https://web.archive.org/web/20220101114012/https://en.wikipedia.org/wiki/Social_media
Cached save:
False
Headers (--headers
) flag in action
waybackpy --url "https://en.wikipedia.org/wiki/Social_media" --user_agent "my-unique-user-agent" --save --headers
Archive URL:
https://web.archive.org/web/20220102104608/https://en.wikipedia.org/wiki/Social_media
Cached save:
True
Save API headers:
{'Server': 'nginx/1.19.10', 'Date': 'Sun, 02 Jan 2022 10:54:09 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'x-archive-orig-date': 'Sun, 02 Jan 2022 10:46:06 GMT', 'x-archive-orig-server': 'mw1385.eqiad.wmnet', 'x-archive-orig-x-content-type-options': 'nosniff', 'x-archive-orig-p3p': 'CP="See https://en.wikipedia.org/wiki/Special:CentralAutoLogin/P3P for more info."', 'x-archive-orig-content-language': 'en', 'x-archive-orig-vary': 'Accept-Encoding,Cookie,Authorization', 'x-archive-orig-last-modified': 'Sun, 02 Jan 2022 09:30:45 GMT', 'x-archive-orig-content-encoding': 'gzip', 'x-archive-orig-age': '2', 'x-archive-orig-x-cache': 'cp4030 miss, cp4027 hit/1', 'x-archive-orig-x-cache-status': 'hit-front', 'x-archive-orig-server-timing': 'cache;desc="hit-front", host;desc="cp4027"', 'x-archive-orig-strict-transport-security': 'max-age=106384710; includeSubDomains; preload', 'x-archive-orig-report-to': '{ "group": "wm_nel", "max_age": 86400, "endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0" }] }', 'x-archive-orig-nel': '{ "report_to": "wm_nel", "max_age": 86400, "failure_fraction": 0.05, "success_fraction": 0.0}', 'x-archive-orig-permissions-policy': 'interest-cohort=()', 'x-archive-orig-x-client-ip': '207.241.232.35', 'x-archive-orig-cache-control': 'private, s-maxage=0, max-age=0, must-revalidate', 'x-archive-orig-accept-ranges': 'bytes', 'x-archive-orig-content-length': '164995', 'x-archive-orig-connection': 'keep-alive', 'x-archive-guessed-content-type': 'text/html', 'x-archive-guessed-charset': 'utf-8', 'memento-datetime': 'Sun, 02 Jan 2022 10:46:08 GMT', 'link': '<https://en.wikipedia.org/wiki/Social_media>; rel="original", <https://web.archive.org/web/timemap/link/https://en.wikipedia.org/wiki/Social_media>; rel="timemap"; type="application/link-format", <https://web.archive.org/web/https://en.wikipedia.org/wiki/Social_media>; rel="timegate", <https://web.archive.org/web/20051215000000/http://en.wikipedia.org/wiki/Social_media>; rel="first memento"; datetime="Thu, 15 Dec 2005 00:00:00 GMT", <https://web.archive.org/web/20220101114012/https://en.wikipedia.org/wiki/Social_media>; rel="prev memento"; datetime="Sat, 01 Jan 2022 11:40:12 GMT", <https://web.archive.org/web/20220102104608/https://en.wikipedia.org/wiki/Social_media>; rel="memento"; datetime="Sun, 02 Jan 2022 10:46:08 GMT", <https://web.archive.org/web/20220102104608/https://en.wikipedia.org/wiki/Social_media>; rel="last memento"; datetime="Sun, 02 Jan 2022 10:46:08 GMT"', 'content-security-policy': "default-src 'self' 'unsafe-eval' 'unsafe-inline' data: blob: archive.org web.archive.org analytics.archive.org pragma.archivelab.org", 'x-archive-src': 'spn2-20220102093111-wwwb-spn10.us.archive.org-8000.warc.gz', 'server-timing': 'captures_list;dur=275.334598, exclusion.robots;dur=0.096415, exclusion.robots.policy;dur=0.088356, RedisCDXSource;dur=1.634125, esindex;dur=0.008082, LoadShardBlock;dur=81.607259, PetaboxLoader3.datanode;dur=51.631773, CDXLines.iter;dur=18.885269, load_resource;dur=19.971806', 'x-app-server': 'wwwb-app204', 'x-ts': '200', 'x-tr': '910', 'X-location': 'All', 'X-Cache-Key': 'httpsweb.archive.org/web/20220102104608/https://en.wikipedia.org/wiki/Social_mediaIN', 'X-RL': '0', 'X-NA': '0', 'X-Page-Cache': 'MISS', 'X-NID': '-', 'Referrer-Policy': 'no-referrer-when-downgrade', 'Permissions-Policy': 'interest-cohort=()', 'Content-Encoding': 'gzip'}
Try this out in your browser @ https://repl.it/@akamhy/WaybackPyBashSave
This feature uses Wayback Machine's Availability API.
The oldest archive for a webpage can be very useful, to get the oldest archive use --oldest
flag.
waybackpy --url "https://en.wikipedia.org/wiki/SpaceX" --user_agent "my-unique-user-agent" --oldest
Archive URL:
https://web.archive.org/web/20040803000845/http://en.wikipedia.org:80/wiki/SpaceX
Try this out in your browser @ https://repl.it/@akamhy/WaybackPyBashOldest
This feature uses Wayback Machine's Availability API.
Get the latest(recent most) archive for an URL. Flag: --newest
waybackpy --url "https://en.wikipedia.org/wiki/YouTube" --user_agent "my-unique-user-agent" --newest
Archive URL:
https://web.archive.org/web/20220101184323/https://en.wikipedia.org/wiki/YouTube
Try this out in your browser @ https://repl.it/@akamhy/WaybackPyBashNewest
This feature uses Wayback Machine's Availability API.
Time used by the Internet Archive's Wayback Machine is in UTC.
waybackpy --url google.com --user_agent "my-unique-user-agent" --near --year 2008 --month 8 --day 8 --hour 8
Archive URL:
https://web.archive.org/web/20080808014003/http://www.google.com:80/
Try this out in your browser @ https://repl.it/@akamhy/WaybackPyBashNear
- You can add the '--subdomain' flag to add subdomains.
- All links will be saved in a file, and the file will be created in the current working directory.
pip install waybackpy
# Ignore the above installation line.
waybackpy --url akamhy.github.io --user_agent "my-user-agent" --known_urls
# Prints all known URLs under akamhy.github.io
waybackpy --url akamhy.github.io --user_agent "my-user-agent" --known_urls --subdomain
# Prints all known URLs under akamhy.github.io including subdomain
Try this out in your browser @ https://repl.it/@akamhy/WaybackpyKnownUrlsFromWaybackMachine#main.sh
This CDX server API doc is derived from the https://github.com/internetarchive/wayback/blob/master/wayback-cdx-server/README.md.
The following code snippet should print all archives with https://github.com/akamhy/ as prefix as we are using the wildcard "*".
waybackpy --url "https://github.com/akamhy/*" --user-agent "Your-user-agent" --cdx
com,github)/akamhy/akamhy/waybackpy 20220210225324 https://github.com/akamhy/akamhy/waybackpy text/html 404 7NTMXPAOO2NTAH3EDOYQOGQBBS7YTZVM 113680
com,github)/akamhy/antispam 20210113054521 https://github.com/akamhy/antispam text/html 404 DOVRV3NM56PCPIQ2IH2RUINLRDDFXXZO 17318
com,github)/akamhy/dhashpy 20211001180207 https://github.com/akamhy/dhashpy text/html 200 56W6EQISXHZ4PXBCRN7G7ZGWPV2YEMQG 37087
.
. # Many URLs redacted for readability
.
com,github)/akamhy/waybackpy/workflows/tests/badge.svg 20220310220909 https://github.com/akamhy/waybackpy/workflows/Tests/badge.svg image/svg+xml 200 YQ7L3MX5WXNUY4BZIL4INNDVZF4JXZXJ 2459
com,github)/akamhy/waybackpy/workflows/tests/badge.svg 20220315150044 https://github.com/akamhy/waybackpy/workflows/Tests/badge.svg warc/revisit - YQ7L3MX5WXNUY4BZIL4INNDVZF4JXZXJ 1375
com,github)/akamhy/waybackpy/workflows/tests/badge.svg 20220315194257 https://github.com/akamhy/waybackpy/workflows/Tests/badge.svg warc/revisit - YQ7L3MX5WXNUY4BZIL4INNDVZF4JXZXJ 1374
Try this out in your browser @ https://replit.com/@akamhy/Waybackpy-CDX-BASIC#main.sh
The default behavior is to return matches for an exact URL. However, the CDX server can also return results matching a certain prefix, a certain host, or all sub-hosts by using the --match-type param
.
-
--match-type exact
(default if omitted) will return results matching exactly archive.org/about/ -
--match-type prefix
will return results for all results under the path archive.org/about/ -
--match-type host
will return results from host archive.org -
--match-type domain
will return results from host archive.org and all sub-hosts *.archive.org
waybackpy --url "archive.org/about/" --user-agent "your-user-agent" --cdx --match-type "prefix" --cdx-print "archiveurl"
Try this out in your browser @ https://replit.com/@akamhy/Waybackpy-CDX-Url-Match-Scope#main.sh
Date Range: Results may be filtered by timestamp using --to and --from params. The ranges are inclusive and are specified in the same 1 to 14 digit format used for Wayback captures: yyyyMMddhhmmss
waybackpy --url google.com --user-agent Your-apps-user-agent --cdx --from 1998 --to 2000 --cdx-print archiveurl
Try this out in your browser @ https://replit.com/@akamhy/Waybackpy-CDX-Filtering-Date-Range#main.sh
-
It is possible to filter on a specific field or the entire CDX line (which is space-delimited). Filtering by specific field is often simpler. Any number of filter params of the following form may be specified: filters=["[!]field:regex"] may be specified.
-
field is one of the named cdx fields or an index of the field. It is often useful to filter by mimetype or statuscode
-
Optional: ! before the query inverts the match, that is, will return results that do NOT match the regex.
-
regex is any standard Java regex pattern (http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html)
-
-
Ex: Query for capture results with a non-200 status code:
waybackpy --url archive.org --user-agent user-agent-example --cdx --filter \!statuscode:200 --cdx-print archiveurl --cdx-print statuscode
Try this out in your browser @ https://repl.it/@akamhy/filtering1#main.py
- Ex: Query for capture results with non text/html mime type matching a specific digest:
waybackpy --url archive.org --user-agent user-agent-example --cdx --filter \!mimetype:text/html --filter digest:2WAXX5NUWNNCS2BDKCO5OVDQBJVNKIVV --cdx-print archiveurl --cdx-print mimetype --cdx-print digest
Try this out in your browser @ https://replit.com/@akamhy/WaybackPy-Cdx-filtering2#main.sh
A new form of filtering is the option to 'collapse' results based on a field, or a substring of a field. Collapsing is done on adjacent cdx lines where all captures after the first one that is duplicate and are filtered out. This is useful for filtering out captures that are 'too dense' or when looking for unique captures.
To use collapsing, add one or more field or field:N to 'collapses=[]' where the field is one of (urlkey, timestamp, original, mimetype, statuscode, digest, and length) and N is the first N characters of the field to test.
- Ex: Only show at most 1 capture per hour (compare the first 10 digits of the timestamp field). Given 2 captures 20130226010000 and 20130226010800, since the first 10 digits 2013022601 matches, the 2nd capture will be filtered out.
waybackpy --url "google.com" --user-agent "Your-apps-user-agent" --cdx --collapse "timestamp:10"
Try this out in your browser @ https://replit.com/@akamhy/WaybackPy-Cdx-collapsing-first#main.sh
- Ex: Only show unique captures by digest (note that only adjacent digest are collapsed, duplicates elsewhere in the cdx are not affected)
waybackpy --url "google.com" --user-agent "Your-apps-user-agent" --cdx --collapse "digest" --cdx-print "archiveurl"
Try this out in your browser @ https://replit.com/@akamhy/WaybackPy-Cdx-collapsing-second#main.sh
- Ex: Only show unique URLs in a prefix query (filtering out captures except for the first capture of a given URL). This is similar to the old prefix query in wayback (note: this query may be slow at the moment):
waybackpy --url archive.org --user-agent "i'm-user-agent" --cdx --match-type prefix --collapse urlkey --cdx-print archiveurl
Try this out in your browser @ https://replit.com/@akamhy/WaybackPy-Cdx-collapsing-last#main.sh