Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ajax加载页面的爬虫问题 #1

Open
goldenYI opened this issue Jan 21, 2017 · 11 comments
Open

ajax加载页面的爬虫问题 #1

goldenYI opened this issue Jan 21, 2017 · 11 comments

Comments

@goldenYI
Copy link
Owner

网站是ajax加载页面,页面返回200时实际上仍在异步加载,爬虫并不能获取完整页面

@goldenYI
Copy link
Owner Author

想到的办法

  1. sleep一段时间,然后再加载(需要完整加载整个页面,包括渲染css,图片等,耗资源)
    2.一个JS引擎+模拟一个浏览器环境,即scrapy-splash+定制webkit(去掉不需要的功能)

@goldenYI
Copy link
Owner Author

scrapy提供了一个专门的scrapy-splash

@goldenYI
Copy link
Owner Author

goldenYI commented Jan 22, 2017

mac下docker安装方法,

install virtualbox
brew install docker
brew install docker-machine
docker-machine create --driver virtualbox default

实际上仍旧是虚拟机跑boot2docker
我这boot2docker下载极慢,大概两小时多,50m网速和香港ss仍这样,蜜汁
(手动到github上下载boot2docker,放到/Users/*/.docker/machine/cache/boot2docker.iso会快点)

@goldenYI
Copy link
Owner Author

即使加载时候去掉图片,但因为页面内容太多,没做剪枝的话大概10秒才能完成整个爬虫,太久了

@goldenYI
Copy link
Owner Author

加载入视频页面后,应要做

  • 视频获取
  • 弹幕获取
  • 评论获取
  • 是否可能让用户发送弹幕

@goldenYI
Copy link
Owner Author

b站html5播放器地址

@goldenYI
Copy link
Owner Author

视频流量走的是ChinaNetCenter(网宿)的cdn,获取视频有几种思路

  • 服务器缓存后下载
  • 托管第三方服务器下载
  • 解析原页面js获取cid

@goldenYI
Copy link
Owner Author

goldenYI commented Jan 27, 2017

下午看了看查看js源码,官网已经将播放器托管至第三方
也存在这样,缓存并用WebKitBlobBuilder来获取
但现在有个问题,如何获取视频资源

  • 通过aid计算出cid,api可看出cid应与aid唯一对应,很轻松就可以获取到视频,对于第二种链接格式还需要弄清sing,ts,player作用
  • 爬虫获取原网页js解析,页面存在<script type='text/javascript'>EmbedPlayer('player', "//static.hdslb.com/play.swf", "cid=13433643&aid=8167928&pre_ad=0");</script>这样一条js,可轻松获取视频,但解析延迟如何优化

@goldenYI
Copy link
Owner Author

偷懒了,扔到服务器全站爬了

@goldenYI
Copy link
Owner Author

随便网上抓的代理ip基本都挂了。。。

@goldenYI
Copy link
Owner Author

数据量大概有500w-600w,最高效合理的选择是代理委托第三方,但这只是我随意写的项目,以后工作若有类似情况就直接买了,现在尽量抓取

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant