Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Royce branch #109

Open
wants to merge 93 commits into
base: master
Choose a base branch
from
Open

Royce branch #109

wants to merge 93 commits into from

Conversation

xKimChip
Copy link

No description provided.

Eucalyptus5 and others added 30 commits October 24, 2024 16:01
…website token sets, the same pattern can be modeled for other global data structures; changes made have been non destructive in theory as everything that should have been implemented has been commmented out and currently put behind an if true block
…nst whether or not the URL is similar or not and whether or not to evaluate a url based on the path similarity
Added changes mostly for checking against the url similarity and giving it a score
…tion that can be used anywhere and should likely be extracted out into its own separate file for testing
…firm that it works, changed global url similarity to a higher threshold of .85 instead of .8
Merging changes from link_similarity.py and the new test_suite.py into master branch, non destructive change
put the ngrams in their own module to allow for easier code readability
… make easier and more logical to read the code base, will have to add changes in ngrams.py to allow for the reading and writing of the ngrams globals instead of in globals.py. only ngrams alters and acesses these variables so it's more logical to include them there
dillonct and others added 30 commits November 4, 2024 18:31
…ons and the like, the testing was done in the main and can be uncommeneted, two additinoal files are needed for testing and there is my own string hashing function (extremely basic and does not take order into account) and pythons hash that can be chosen between and is declared as a global variable that is inteded only to be read and never changed (treat as const)
fixed ngrams turns out i was doing doing some wonky stuff with additi…
added some extra safety checks and made everything accessible via mul…
added some basic changes to the scraper so that it works thru more regex
added one extra safety check inside of the worker
…y checking to make sure that the urls thtat it gets are not currently already being looked at at the same time
…e of readability. Also found the < operator that was breaking the code so provided the fix
…es a function with the same body but then utilizes a lock for thread safety
Several changes to globals.py, ngrams.py, scraper.py
…requires creating the main funciton but almost everything is alreday place such that it can work
…ed by sets and the like, no functinoal changes made to the code of the class, also makes the pickled file a global variable
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants