Slidobot

Slidobot is a tiny Scrapy project for scraping presentations info from Slideshare.

This project is only meant for educational purposes.

Spiders

The project contains 2 spiders:

slideshare - simple spider for scraping data from selection grid
slideshare_full - advanced similar spider which parses every slide page separately (so we can get expanded description for each entry).

How-To

Project based on this Scrapy tutorial. You can specify crawling pages urls in spider source file (start_urls var).

By default it's http://www.slideshare.net/popular/media/presentations/category/technology/all-time page (you can get such url using Explore part of service and applying different filters).

Properly speaking, it's possible to use any Explore filter url for crawling.

Page number is set at the same place (start_urls var).

For running spider call scrapy crawl slideshare or scrapy crawl slideshare_full.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Slidobot

Spiders

How-To

Files

README.md

Latest commit

History

README.md

File metadata and controls

Slidobot

Spiders

How-To