Skip to content

Latest commit

 

History

History

caibg

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 

⛰️ caibg

Input data

Using scrapy web crawl capabilities with two custom spiders, have been crawled the sub domains rifugi-bivacchi and sentieri (prerequisites for running the above commands scrapy python package)

pushd scrapy
scrapy crawl caibg_rifugi   --nolog -O ../data/caibg-rifugi.json
scrapy crawl caibg_sentieri --nolog -O ../data/caibg-sentieri.json
popd
jq -s '.[0] + .[1]' data/caibg_rifugi.json data/caibg_sentieri.json \
   | jq '{"objects": .}' \
   > data/caibg.json

And glue together the results in caibg.json, a perfect graph of links between these two sub domains, ready to be parsed by pgrank

Output data

pgrank data/caibg.json data/caibg.csv

The result is a csv file contains pageranks in the same order of the given input url nodes from json ... That's it. pgrank app only compute intensive tasks, future analysis can be delegated to more friendly framework, such as pandas

To make the results more readable caibg.py creates a summing up markdown table caibg.md

Note

requirements: pandas tabulate

Interactive cytoscape.js network version is available at https://andros21.github.io/pgrank/caibg/

Note

cyto/data.json can be created using cyto.py
requirements: pandas networkx