⛰️ caibg
Using scrapy
web crawl capabilities with two custom spiders, have been crawled the sub domains rifugi-bivacchi and sentieri (prerequisites for running the above commands scrapy
python package)
pushd scrapy
scrapy crawl caibg_rifugi --nolog -O ../data/caibg-rifugi.json
scrapy crawl caibg_sentieri --nolog -O ../data/caibg-sentieri.json
popd
jq -s '.[0] + .[1]' data/caibg_rifugi.json data/caibg_sentieri.json \
| jq '{"objects": .}' \
> data/caibg.json
And glue together the results in caibg.json
, a perfect graph of links between these two sub domains, ready to be parsed by pgrank
pgrank data/caibg.json data/caibg.csv
The result is a csv
file contains pageranks in the same order of the given input url nodes from json
... That's it. pgrank
app only compute intensive tasks, future analysis can be delegated to more friendly framework, such as pandas
To make the results more readable caibg.py
creates a summing up markdown table caibg.md
Note
requirements: pandas
tabulate
Interactive cytoscape.js network version is available at https://andros21.github.io/pgrank/caibg/
Note
cyto/data.json
can be created using cyto.py
requirements: pandas
networkx