Skip to content

Latest commit

 

History

History
125 lines (84 loc) · 5 KB

README.md

File metadata and controls

125 lines (84 loc) · 5 KB

xA-Scraper

This is a automated tool for scraping content from a number of art sites:

  • DeviantArt
  • Patreon
  • FurAffinity
  • HentaFoundry
  • Pixiv
  • InkBunny
  • SoFurry
  • Weaslyl
  • Newgrounds art galleries

To Add:

Decrepit:

  • Tumblr art blogs

Checked so far:

  • hf, df, wy, ng, ib, fa

Todo:

  • da, pat, px

It also has grown a lot of other functions over time. It has a fairly complex, interactive web-interface for browsing the local gallery mirrors.

Dependencies:

  • Linux
  • Postgres >= 9.3 or Sqlite
  • CherryPy
  • Pyramid
  • Mako
  • BeautifulSoup 4
  • others
  • google-chrome (for da)

The backend can either use a local sqlite database (which has poor performance, particularly when cold, but is very easy to set up), or a full postgresql instance.

Configuration is done via a file named settings.py which must be placed in the repository root. settings.base.py is an example config to work from. In general, you will probably want to copy settings.base.py to settings.py, and then add your various usernames/password/database-config.

DB Backend is selected via the USE_POSTGRESQL parameter in settings.py.

If using postgre, DB setup is left to the user. xA-Scraper requires it's own database, and the ability to make IP-based connections to the hosting PG instance. The connection information, DB name, and client name must be set in settings.py.

When using sqlite, you just have to specify the path to where you want the sqlite db to be located (or you can use the default, which is ./sqlite_db.db).

settings.py is also where the login information for the various plugins goes.

Disabling of select plugins can be accomplished by commenting out the appropriate line in main.py. The JOBS list dictates the various scheduled scraper tasks that are placed into the scheduling system.

The preferred bootstrap method is to use run_scraper.sh from the repository root. It will ensure the required packages are available (build-essential, libxml2 libxslt1-dev python3-dev libz-dev), and then install all the required python modules in a local virtualenv. Additonally, it checks if the virtualenv is present, so once it's created, ./run_scraper.sh will just source the venv, and run the scraper witout any reinstallation.

To run the web UI (which handles adding names to scrape, viewing fetched files, etc...), run run_web.sh. The expected use is to have both run_scraper.sh and run_web.sh executed as daemons.

Currently, there are some aspects that need work. The artist selection system is currently a bit broken. Currently, there isn't a clean way to remove artists from the scrape list, though you can add or modify them.

Notes:

  • There have been reports that things are actively broken on non-linux platforms. Realistically, all development is done on a Ubuntu 18.04 LTS install, and running on anything else is at your own risk.

  • The Yiff-Party scraper requires significant external infrastructure, as it currently depends on threading it's fetch requests through the autotriever project. This depends on having both a publically available RabbitMQ instance, and an executing instance of the FetchAgent components of the ReadableWebProxy fetch-agent RPC service on your local LAN.

  • FurAffinity has a login captcha. This requires you either manually log the FA scraper in (via the "Manual FA Login" facility in the web-interface), or you can use a automated captcha service. Currently, the only solver service supported is the 2Captcha service.

  • This is my oldest "maintained" project, and the codebase is commensuarately horrible. Portions of it were designed and written while I was still learning python, so there are a bunch of really terrible design decisons baked into the class structure, and much of the code just does stupid things.


Anyways, Pictures!

These are a few DeviantArt Artists culled from the Reddit ImaginaryLandscapes subreddit.

The web-interface has a lot of fancy mouseover preview stuff. Since this is primarily intended to run off a local network, bandwidth concerns are not too relevant, and I went a bit nuts with jQuery.

Basic Popups

There is also a somewhat experimental "gallery slice" viewing system, where horizontal mouse movement seeks through a spaced sub-set of each artist's images. The artist is determined by the row, and each horizontal 10 pixels is a different image.

Fancy Popups

Lastly, there is also a basic, chronological view of each artist's work, though it does support infinite-scrolling for their entire gallery. The scraper also preserves the description that preserves each item, and it is presented with the corresponding image.