-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
issue108: Priority does not flow down to child nodes #109
issue108: Priority does not flow down to child nodes #109
Conversation
classes/robot/crawler.php
Outdated
@@ -902,7 +902,7 @@ private function link_from_node_to_url($from, $url, $text, $idattr) { | |||
global $DB; | |||
|
|||
// Add the node URL to the queue. | |||
$to = $this->mark_for_crawl($from->url, $url); | |||
$to = $this->mark_for_crawl($from->url, $url, null, $from->priority); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So lets say I mark A as high priority
A then links to, B,C,D,E,F
By the looks of it, this will mean all B-F will be high priority. And then when we scrape those, every link on those pages will end up high priority.And so on. I haven't tested this but it looks like priority will infect everything. But we only want it to affect this page, and the links on this page.
ie I think conceptually we should be 'this node' as high priority, as well as marking the outgoing 'edge's of this node as high priority, not the nodes.
Implementation wise, I'm not sure we actually need to add a priority column to the edge table, an easier solution might be to have 2 levels of priority: one which inherits and one which doesn't. We mark the starting node as priority 2 = inherit, and then if it's 2 then we mark it's children as 1
Add parent priority to child nodes when marking for crawl
This enhancement aims to flow priority down to direct child nodes only. Through the implementation of node levels and a level check when marking a node to be crawled, we only assign a parent priority to a child node if it is a direct ancestor of the original node. This will prevent passing priority recursively and if, for example, a child node is a top level node, filtering the priority to effectively all nodes, which is undesirable behaviour.
b49ffe6
to
54026eb
Compare
Hi @brendanheywood, You can see from the unit tests that now there is logic checking the level of each node when parsing the html and linking to a new url, so that the level is decremented each time, and parent nodes at level 2 keep their priority, while direct children at level 1 do as well, any subsequent children are at level 0 and receive the default priority level. Could you please review and let me know if you are happy with this approach? Take it easy |
Add parent priority to child nodes when marking for crawl