Skip to content

Commit

Permalink
[[ crawler ]] switch to the new bs4 scriptElement.string API (#4)
Browse files Browse the repository at this point in the history
  • Loading branch information
SoheilKhodayari committed May 20, 2022
1 parent 57761d7 commit 7c0e71d
Show file tree
Hide file tree
Showing 2 changed files with 5 additions and 6 deletions.
3 changes: 1 addition & 2 deletions docs/setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,7 @@ Below is the quick installation guide. Please see [here](installation.md) for `d
### Step 1: Installing Python/NodeJS Dependencies
In the project root directory, run:
```sh
$ cd installation
$ ./install_dependencies.sh
$ ./installation/install_dependencies.sh
```

### Step 2: Setup Neo4j
Expand Down
8 changes: 4 additions & 4 deletions hpg_crawler/dom_collector.py
Original file line number Diff line number Diff line change
Expand Up @@ -218,14 +218,14 @@ def get_dynamic_data(siteId, url, driver= None, close_conn= True, internal_only=
if not i.get('src'):
if not i.get('type'):
# script contains JS if type is absent
scripts.append(['internal_script', i.text])
internals.append(['internal_script', i.text])
scripts.append(['internal_script', i.string])
internals.append(['internal_script', i.string])
else:
script_type = i.get('type')
# filter out text/json, etc
if is_valid_script_type(script_type):
scripts.append(['internal_script', i.text])
internals.append(['internal_script', i.text])
scripts.append(['internal_script', i.string])
internals.append(['internal_script', i.string])

else:
relative_link = i.get('src').lstrip('/')
Expand Down

0 comments on commit 7c0e71d

Please sign in to comment.