Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issue108: Priority does not flow down to child nodes #109

Merged

Conversation

tomdickman
Copy link
Contributor

Add parent priority to child nodes when marking for crawl

@tomdickman tomdickman self-assigned this Jan 20, 2020
@@ -902,7 +902,7 @@ private function link_from_node_to_url($from, $url, $text, $idattr) {
global $DB;

// Add the node URL to the queue.
$to = $this->mark_for_crawl($from->url, $url);
$to = $this->mark_for_crawl($from->url, $url, null, $from->priority);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So lets say I mark A as high priority

A then links to, B,C,D,E,F

By the looks of it, this will mean all B-F will be high priority. And then when we scrape those, every link on those pages will end up high priority.And so on. I haven't tested this but it looks like priority will infect everything. But we only want it to affect this page, and the links on this page.

ie I think conceptually we should be 'this node' as high priority, as well as marking the outgoing 'edge's of this node as high priority, not the nodes.

Implementation wise, I'm not sure we actually need to add a priority column to the edge table, an easier solution might be to have 2 levels of priority: one which inherits and one which doesn't. We mark the starting node as priority 2 = inherit, and then if it's 2 then we mark it's children as 1

Tom Dickman added 2 commits January 23, 2020 13:31
Add parent priority to child nodes when marking for crawl
This enhancement aims to flow priority down to direct child nodes only.
Through the implementation of node levels and a level check when
marking a node to be crawled, we only assign a parent priority to a
child node if it is a direct ancestor of the original node.
This will prevent passing priority recursively and if, for example, a child
node is a top level node, filtering the priority to effectively all nodes,
which is undesirable behaviour.
@tomdickman tomdickman force-pushed the issue108_priority_does_not_flow_down_to_child_nodes branch from b49ffe6 to 54026eb Compare January 23, 2020 02:37
@tomdickman
Copy link
Contributor Author

Hi @brendanheywood,
As per you comment in #109 (comment), to tackle the recursion issue, I have added a level field to each node in the {tool_crawler_url} table.

You can see from the unit tests that now there is logic checking the level of each node when parsing the html and linking to a new url, so that the level is decremented each time, and parent nodes at level 2 keep their priority, while direct children at level 1 do as well, any subsequent children are at level 0 and receive the default priority level.

Could you please review and let me know if you are happy with this approach?

Take it easy

@brendanheywood brendanheywood merged commit e7f557e into master Jan 28, 2020
@brendanheywood brendanheywood deleted the issue108_priority_does_not_flow_down_to_child_nodes branch January 28, 2020 01:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants