-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
设计一个网页爬虫 翻译 #28
设计一个网页爬虫 翻译 #28
Conversation
@sqrthree 认领校对 |
好的 |
* Search analytics | ||
* Personalized search results | ||
* Page rank | ||
* 搜素分析 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
搜索
* 用户很快就能看到搜索结果 | ||
* 网页爬虫不应该陷入死循环 | ||
* 当爬虫路径包含环的时候,将会陷入死循环 | ||
* 抓取 100 万个链接 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
10 亿
* 抓取 100 万个链接 | ||
* 要定期重新抓取页面以确保新鲜度 | ||
* 平均每周重新抓取一次,网站越热门,那么重新抓取的频率越高 | ||
* 每月抓取 400 万个链接 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
40 亿
* 要定期重新抓取页面以确保新鲜度 | ||
* 平均每周重新抓取一次,网站越热门,那么重新抓取的频率越高 | ||
* 每月抓取 400 万个链接 | ||
* 每个页面的平均存储大小: 500 KB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
每个页面的平均存储大小: 500 KB
-> 每个页面的平均存储大小:500 KB
|
||
Exercise the use of more traditional systems - don't use existing systems such as [solr](http://lucene.apache.org/solr/) or [nutch](http://nutch.apache.org/). | ||
用更传统的系统来练习 —— 不要使用现成的系统,比如: [solr](http://lucene.apache.org/solr/) 或者 [nutch](http://nutch.apache.org/)。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* 1,600 write requests per second | ||
* 40,000 search requests per second | ||
* 每月存储 2 PB 页面 | ||
* 每月抓取 400 万个页面,每个页面 500 KB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
40 亿
|
||
We'll assume we have an initial list of `links_to_crawl` ranked initially based on overall site popularity. If this is not a reasonable assumption, we can seed the crawler with popular sites that link to outside content such as [Yahoo](https://www.yahoo.com/), [DMOZ](http://www.dmoz.org/), etc | ||
假设我们有一个初始列表 `links_to_crawl`(待抓取链接),它最初基于网站整体的知名度来排序。当然如果这个假设不合理,我们可以使用知名门户网站作为种子链接来进行扩散,例如: [Yahoo](https://www.yahoo.com/)、 [DMOZ](http://www.dmoz.org/),等等。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
We could store `links_to_crawl` and `crawled_links` in a key-value **NoSQL Database**. For the ranked links in `links_to_crawl`, we could use [Redis](https://redis.io/) with sorted sets to maintain a ranking of page links. We should discuss the [use cases and tradeoffs between choosing SQL or NoSQL](https://github.com/donnemartin/system-design-primer#sql-or-nosql). | ||
我们可以将 `links_to_crawl` 和 `crawled_links` 记录在键-值型 **NoSQL 数据库**。对于 `crawled_links` 中已排序的链接,我们可以使用 [Redis](https://redis.io/) 的有序集合来维护网页链接的排名。我们应当在 [选择 SQL 还是 NoSQL 的问题上,讨论有关使用场景以及利弊 ](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#sql-还是-nosql)。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
记录在...中
* For smaller lists we could use something like `sort | unique` | ||
* With 1 billion links to crawl, we could use **MapReduce** to output only entries that have a frequency of 1 | ||
* 假设数据量较小,我们可以用类似于 `sort | unique` 的方法。(译注: 先排序,后去重) | ||
* 假设有 100 万条数据,我们应该使用 **MapReduce** 来输出只出现 1 次的记录。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
10 亿
基本上很完美=w=,主要问题就是 billion 有时候看成 million 了 |
另外全角符号与英文单词之间不用加空格了,不然看起来空的太多了 参考 相关资料 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
翻译完毕…… 第一次翻译…… 请大佬们多多指教~
万谢!
@sqrthree