Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add API and caching #4

Merged
merged 3 commits into from
Oct 3, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,8 @@ RUN pip install --no-cache-dir /src/scraper \
# Copy zimui build output
COPY --from=zimui /src/dist /src/zimui

ENV LIBRETEXTS_ZIMUI_DIST=/src/zimui
ENV LIBRETEXTS_ZIMUI_DIST=/src/zimui \
benoit74 marked this conversation as resolved.
Show resolved Hide resolved
LIBRETEXTS_OUTPUT=/output \
LIBRETEXTS_TMP=/tmp

CMD ["libretexts2zim", "--help"]
1 change: 0 additions & 1 deletion scraper/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,6 @@ dev = [

[project.scripts]
libretexts2zim = "libretexts2zim.__main__:main"
libretexts2zim-playlists = "libretexts2zim.playlists.__main__:main"

[tool.hatch.version]
path = "src/libretexts2zim/__about__.py"
Expand Down
13 changes: 8 additions & 5 deletions scraper/src/libretexts2zim/__main__.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,12 @@
#!/usr/bin/env python3
# vim: ai ts=4 sts=4 et sw=4 nu
import tempfile

Check warning on line 1 in scraper/src/libretexts2zim/__main__.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/libretexts2zim/__main__.py#L1

Added line #L1 was not covered by tests

import sys
from libretexts2zim.entrypoint import main as entrypoint

Check warning on line 3 in scraper/src/libretexts2zim/__main__.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/libretexts2zim/__main__.py#L3

Added line #L3 was not covered by tests


def main():

Check warning on line 6 in scraper/src/libretexts2zim/__main__.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/libretexts2zim/__main__.py#L6

Added line #L6 was not covered by tests
with tempfile.TemporaryDirectory() as tmpdir:
entrypoint(tmpdir)

Check warning on line 8 in scraper/src/libretexts2zim/__main__.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/libretexts2zim/__main__.py#L8

Added line #L8 was not covered by tests

from libretexts2zim.entrypoint import main

if __name__ == "__main__":
sys.exit(main())
main()
144 changes: 136 additions & 8 deletions scraper/src/libretexts2zim/client.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,18 @@
import datetime
import json
import re
from collections.abc import Callable
from pathlib import Path
from typing import Any

import requests
from bs4 import BeautifulSoup, NavigableString
from pydantic import BaseModel

from libretexts2zim.constants import logger

HTTP_TIMEOUT_SECONDS = 15
HTTP_TIMEOUT_NORMAL_SECONDS = 15
HTTP_TIMEOUT_LONG_SECONDS = 30


class LibreTextsParsingError(Exception):
Expand Down Expand Up @@ -50,48 +54,152 @@
class LibreTextsClient:
"""Utility functions to read data from libretexts."""

def __init__(self, library_slug: str) -> None:
def __init__(self, library_slug: str, cache_folder: Path) -> None:
"""Initializes LibreTextsClient.

Paremters:
library_url: Scheme, hostname, and port for the Libretext library
e.g. `https://geo.libretexts.org/`.
"""
self.library_slug = library_slug
self.deki_token = None
self.cache_folder = cache_folder

Check warning on line 66 in scraper/src/libretexts2zim/client.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/libretexts2zim/client.py#L65-L66

Added lines #L65 - L66 were not covered by tests

@property
def library_url(self) -> str:
return f"https://{self.library_slug}.libretexts.org/"
return f"https://{self.library_slug}.libretexts.org"

Check warning on line 70 in scraper/src/libretexts2zim/client.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/libretexts2zim/client.py#L70

Added line #L70 was not covered by tests

def _get_text(self, url: str) -> str:
@property
def api_url(self) -> str:
return f"{self.library_url}/@api/deki"

Check warning on line 74 in scraper/src/libretexts2zim/client.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/libretexts2zim/client.py#L74

Added line #L74 was not covered by tests

def _get_cache_file(self, url_subpath_and_query: str) -> Path:
"""Get location where HTTP result should be cached"""
url_subpath_and_query = re.sub(r"^/", "", url_subpath_and_query)

Check warning on line 78 in scraper/src/libretexts2zim/client.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/libretexts2zim/client.py#L78

Added line #L78 was not covered by tests
if url_subpath_and_query.endswith("/"):
url_subpath_and_query += "index"
return self.cache_folder / url_subpath_and_query

Check warning on line 81 in scraper/src/libretexts2zim/client.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/libretexts2zim/client.py#L80-L81

Added lines #L80 - L81 were not covered by tests

def _get_text(self, url_subpath_and_query: str) -> str:
"""Perform a GET request and return the response as decoded text."""

logger.debug(f"Fetching {url}")
cache_file = self._get_cache_file(f"text{url_subpath_and_query}")

Check warning on line 86 in scraper/src/libretexts2zim/client.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/libretexts2zim/client.py#L86

Added line #L86 was not covered by tests
if cache_file.exists():
return cache_file.read_text()
cache_file.parent.mkdir(parents=True, exist_ok=True)

Check warning on line 89 in scraper/src/libretexts2zim/client.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/libretexts2zim/client.py#L88-L89

Added lines #L88 - L89 were not covered by tests

full_url = f"{self.library_url}{url_subpath_and_query}"
logger.debug(f"Fetching {full_url}")

Check warning on line 92 in scraper/src/libretexts2zim/client.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/libretexts2zim/client.py#L91-L92

Added lines #L91 - L92 were not covered by tests

resp = requests.get(
url=url,
url=full_url,
allow_redirects=True,
timeout=HTTP_TIMEOUT_SECONDS,
timeout=HTTP_TIMEOUT_NORMAL_SECONDS,
)
resp.raise_for_status()

cache_file.write_text(resp.text)

Check warning on line 101 in scraper/src/libretexts2zim/client.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/libretexts2zim/client.py#L101

Added line #L101 was not covered by tests
return resp.text

def _get_api_resp(
self, api_sub_path_and_query: str, timeout: float
) -> requests.Response:
api_url = f"{self.api_url}{api_sub_path_and_query}"
logger.debug(f"Calling API at {api_url}")
resp = requests.get(

Check warning on line 109 in scraper/src/libretexts2zim/client.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/libretexts2zim/client.py#L107-L109

Added lines #L107 - L109 were not covered by tests
url=api_url,
headers={"x-deki-token": self.deki_token},
timeout=timeout,
)
resp.raise_for_status()
return resp

Check warning on line 115 in scraper/src/libretexts2zim/client.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/libretexts2zim/client.py#L114-L115

Added lines #L114 - L115 were not covered by tests

def _get_api_json(
self, api_sub_path: str, timeout: float = HTTP_TIMEOUT_NORMAL_SECONDS
) -> Any:
cache_file = self._get_cache_file(f"api_json{api_sub_path}")

Check warning on line 120 in scraper/src/libretexts2zim/client.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/libretexts2zim/client.py#L120

Added line #L120 was not covered by tests
if cache_file.exists():
return json.loads(cache_file.read_text())
cache_file.parent.mkdir(parents=True, exist_ok=True)
resp = self._get_api_resp(

Check warning on line 124 in scraper/src/libretexts2zim/client.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/libretexts2zim/client.py#L122-L124

Added lines #L122 - L124 were not covered by tests
f"{api_sub_path}?dream.out.format=json", timeout=timeout
)
result = resp.json()
cache_file.write_text(json.dumps(result))
return result

Check warning on line 129 in scraper/src/libretexts2zim/client.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/libretexts2zim/client.py#L127-L129

Added lines #L127 - L129 were not covered by tests

def _get_api_content(
self, api_sub_path: str, timeout: float = HTTP_TIMEOUT_NORMAL_SECONDS
) -> bytes | Any:
cache_file = self._get_cache_file(f"api_content{api_sub_path}")

Check warning on line 134 in scraper/src/libretexts2zim/client.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/libretexts2zim/client.py#L134

Added line #L134 was not covered by tests
if cache_file.exists():
return cache_file.read_bytes()
cache_file.parent.mkdir(parents=True, exist_ok=True)
resp = self._get_api_resp(api_sub_path, timeout=timeout)
result = resp.content
cache_file.write_bytes(result)
return result

Check warning on line 141 in scraper/src/libretexts2zim/client.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/libretexts2zim/client.py#L136-L141

Added lines #L136 - L141 were not covered by tests

def get_home(self) -> LibreTextsHome:
home_content = self._get_text(self.library_url)
"""Retrieves data about home page by crawling home page"""
home_content = self._get_text("/")

Check warning on line 145 in scraper/src/libretexts2zim/client.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/libretexts2zim/client.py#L145

Added line #L145 was not covered by tests

soup = _get_soup(home_content)
self.deki_token = _get_deki_token_from_home(soup)

Check warning on line 148 in scraper/src/libretexts2zim/client.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/libretexts2zim/client.py#L148

Added line #L148 was not covered by tests
return LibreTextsHome(
welcome_text_paragraphs=_get_welcome_text_from_home(soup),
welcome_image_url=_get_welcome_image_url_from_home(soup),
)

def get_deki_token(self) -> str:
"""Retrieves the API token to use to query the website API"""
if self.deki_token:
return self.deki_token

Check warning on line 157 in scraper/src/libretexts2zim/client.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/libretexts2zim/client.py#L157

Added line #L157 was not covered by tests

home_content = self._get_text("/")

Check warning on line 159 in scraper/src/libretexts2zim/client.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/libretexts2zim/client.py#L159

Added line #L159 was not covered by tests

soup = _get_soup(home_content)
self.deki_token = _get_deki_token_from_home(soup)
return self.deki_token

Check warning on line 163 in scraper/src/libretexts2zim/client.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/libretexts2zim/client.py#L161-L163

Added lines #L161 - L163 were not covered by tests

def get_all_pages_ids(self):
"""Returns the IDs of all pages on current website, exploring the whole tree"""

tree = self._get_api_json("/pages/home/tree", timeout=HTTP_TIMEOUT_LONG_SECONDS)

Check warning on line 168 in scraper/src/libretexts2zim/client.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/libretexts2zim/client.py#L168

Added line #L168 was not covered by tests

page_ids: list[str] = []

Check warning on line 170 in scraper/src/libretexts2zim/client.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/libretexts2zim/client.py#L170

Added line #L170 was not covered by tests

def _get_page_ids(page_node: Any) -> None:
page_ids.append(page_node["@id"])

Check warning on line 173 in scraper/src/libretexts2zim/client.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/libretexts2zim/client.py#L172-L173

Added lines #L172 - L173 were not covered by tests
if not page_node["subpages"]:
return

Check warning on line 175 in scraper/src/libretexts2zim/client.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/libretexts2zim/client.py#L175

Added line #L175 was not covered by tests
if "@id" in page_node["subpages"]["page"]:
_get_page_ids(page_node["subpages"]["page"])

Check warning on line 177 in scraper/src/libretexts2zim/client.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/libretexts2zim/client.py#L177

Added line #L177 was not covered by tests
else:
for page in page_node["subpages"]["page"]:
_get_page_ids(page)

Check warning on line 180 in scraper/src/libretexts2zim/client.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/libretexts2zim/client.py#L180

Added line #L180 was not covered by tests

_get_page_ids(tree["page"])

Check warning on line 182 in scraper/src/libretexts2zim/client.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/libretexts2zim/client.py#L182

Added line #L182 was not covered by tests

return page_ids

Check warning on line 184 in scraper/src/libretexts2zim/client.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/libretexts2zim/client.py#L184

Added line #L184 was not covered by tests

def get_root_page_id(self) -> str:
"""Returns the ID the root of the tree of pages"""

tree = self._get_api_json("/pages/home/tree", timeout=HTTP_TIMEOUT_LONG_SECONDS)
return tree["page"]["@id"]

Check warning on line 190 in scraper/src/libretexts2zim/client.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/libretexts2zim/client.py#L189-L190

Added lines #L189 - L190 were not covered by tests


def _get_soup(content: str) -> BeautifulSoup:
"""Return a BeautifulSoup soup from textual content

This is a utility function to ensure same parser is used in the whole codebase
"""
return BeautifulSoup(content, "lxml")


def _get_welcome_image_url_from_home(soup: BeautifulSoup) -> str:
"""Return the URL of the image found on home header"""
branding_div = soup.find("div", class_="LTBranding")
if not branding_div:
raise LibreTextsParsingError("<div> with class 'LTBranding' not found")
Expand All @@ -111,6 +219,7 @@


def _get_welcome_text_from_home(soup: BeautifulSoup) -> list[str]:
"""Returns the text found on home page"""
content_section = soup.find("section", class_="mt-content-container")
if not content_section or isinstance(content_section, NavigableString):
raise LibreTextsParsingError(
Expand All @@ -121,3 +230,22 @@
if paragraph_text := paragraph.text:
welcome_text.append(paragraph_text)
return welcome_text


def _get_deki_token_from_home(soup: BeautifulSoup) -> str:
global_settings = soup.find("script", id="mt-global-settings")

Check warning on line 236 in scraper/src/libretexts2zim/client.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/libretexts2zim/client.py#L236

Added line #L236 was not covered by tests
if not global_settings:
logger.debug("home content:")
logger.debug(soup)
raise Exception(

Check warning on line 240 in scraper/src/libretexts2zim/client.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/libretexts2zim/client.py#L238-L240

Added lines #L238 - L240 were not covered by tests
"Failed to retrieve API token to query website API, missing "
"mt-global-settings script"
)
x_deki_token = json.loads(global_settings.text).get("apiToken", None)

Check warning on line 244 in scraper/src/libretexts2zim/client.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/libretexts2zim/client.py#L244

Added line #L244 was not covered by tests
if not x_deki_token:
logger.debug("mt-global-settings script content:")
logger.debug(global_settings.text)
raise Exception(

Check warning on line 248 in scraper/src/libretexts2zim/client.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/libretexts2zim/client.py#L246-L248

Added lines #L246 - L248 were not covered by tests
"Failed to retrieve API token to query website API, missing apiToken."
)
return x_deki_token

Check warning on line 251 in scraper/src/libretexts2zim/client.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/libretexts2zim/client.py#L251

Added line #L251 was not covered by tests
34 changes: 29 additions & 5 deletions scraper/src/libretexts2zim/entrypoint.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
MAXIMUM_LONG_DESCRIPTION_METADATA_LENGTH,
RECOMMENDED_MAX_TITLE_LENGTH,
)
from zimscraperlib.zim.filesystem import validate_zimfile_creatable

from libretexts2zim.client import LibreTextsClient
from libretexts2zim.constants import (
Expand Down Expand Up @@ -137,7 +138,7 @@
)


def main() -> None:
def main(tmpdir: str) -> None:
parser = argparse.ArgumentParser(
prog=NAME,
)
Expand Down Expand Up @@ -177,6 +178,13 @@
dest="output_folder",
)

parser.add_argument(

Check warning on line 181 in scraper/src/libretexts2zim/entrypoint.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/libretexts2zim/entrypoint.py#L181

Added line #L181 was not covered by tests
"--tmp",
help="Temporary folder for cache, intermediate files, ... Default: tmp",
default=os.getenv("LIBRETEXTS_TMP", tmpdir),
dest="tmp_folder",
)

parser.add_argument(
"--debug", help="Enable verbose output", action="store_true", default=False
)
Expand All @@ -191,15 +199,35 @@
default=os.getenv("LIBRETEXTS_ZIMUI_DIST", "../zimui/dist"),
)

parser.add_argument(

Check warning on line 202 in scraper/src/libretexts2zim/entrypoint.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/libretexts2zim/entrypoint.py#L202

Added line #L202 was not covered by tests
"--keep-cache",
help="Keep cache of website responses",
action="store_true",
default=False,
)

args = parser.parse_args()

logger.setLevel(level=logging.DEBUG if args.debug else logging.INFO)

output_folder = Path(args.output_folder)
output_folder.mkdir(exist_ok=True)
validate_zimfile_creatable(output_folder, "test.txt")

Check warning on line 215 in scraper/src/libretexts2zim/entrypoint.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/libretexts2zim/entrypoint.py#L213-L215

Added lines #L213 - L215 were not covered by tests

tmp_folder = Path(args.tmp_folder)
tmp_folder.mkdir(exist_ok=True)
validate_zimfile_creatable(tmp_folder, "test.txt")

Check warning on line 219 in scraper/src/libretexts2zim/entrypoint.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/libretexts2zim/entrypoint.py#L217-L219

Added lines #L217 - L219 were not covered by tests
benoit74 marked this conversation as resolved.
Show resolved Hide resolved

try:
zim_config = ZimConfig.of(args)
doc_filter = ContentFilter.of(args)

cache_folder = tmp_folder / "cache"
cache_folder.mkdir()

Check warning on line 226 in scraper/src/libretexts2zim/entrypoint.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/libretexts2zim/entrypoint.py#L225-L226

Added lines #L225 - L226 were not covered by tests

libretexts_client = LibreTextsClient(
library_slug=args.library_slug,
cache_folder=cache_folder,
)

Processor(
Expand All @@ -217,7 +245,3 @@
logger.exception(exc)
logger.error(f"Generation failed with the following error: {exc}")
raise SystemExit(1) from exc


if __name__ == "__main__":
main()
17 changes: 11 additions & 6 deletions scraper/src/libretexts2zim/processor.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
)
from zimscraperlib.image import resize_image
from zimscraperlib.zim import Creator
from zimscraperlib.zim.filesystem import validate_zimfile_creatable
from zimscraperlib.zim.indexing import IndexData

from libretexts2zim.client import LibreTextsClient, LibreTextsMetadata
Expand Down Expand Up @@ -117,8 +118,6 @@
self.zimui_dist = zimui_dist
self.overwrite_existing_zim = overwrite_existing_zim

self.output_folder.mkdir(exist_ok=True)

self.zim_illustration_path = self.libretexts_newsite_path(
"header_logo_mini.png"
)
Expand All @@ -145,11 +144,17 @@
name=self.zim_config.library_name, slug=self.libretexts_client.library_slug
)
formatted_config = self.zim_config.format(metadata.placeholders())
zim_path = Path(self.output_folder, f"{formatted_config.file_name_format}.zim")
zim_file_name = f"{formatted_config.file_name_format}.zim"
zim_path = self.output_folder / zim_file_name

Check warning on line 148 in scraper/src/libretexts2zim/processor.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/libretexts2zim/processor.py#L147-L148

Added lines #L147 - L148 were not covered by tests

if zim_path.exists():
if self.overwrite_existing_zim:
zim_path.unlink()

Check warning on line 152 in scraper/src/libretexts2zim/processor.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/libretexts2zim/processor.py#L152

Added line #L152 was not covered by tests
else:
logger.error(f" {zim_path} already exists, aborting.")
raise SystemExit(2)

Check warning on line 155 in scraper/src/libretexts2zim/processor.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/libretexts2zim/processor.py#L154-L155

Added lines #L154 - L155 were not covered by tests

if zim_path.exists() and not self.overwrite_existing_zim:
logger.error(f" {zim_path} already exists, aborting.")
raise SystemExit(2)
validate_zimfile_creatable(self.output_folder, zim_file_name)

Check warning on line 157 in scraper/src/libretexts2zim/processor.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/libretexts2zim/processor.py#L157

Added line #L157 was not covered by tests

logger.info(f" Writing to: {zim_path}")

Expand Down
6 changes: 6 additions & 0 deletions scraper/tests-integration/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
This folder contains integration tests, testing how the scraper behaves:

- with a real libretexts website
- from end-to-end

They are targetted at being ran from scraper Docker image from Github workflow(s).
11 changes: 11 additions & 0 deletions scraper/tests-integration/conftest.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
import tempfile
from collections.abc import Generator
from pathlib import Path
from typing import Any

import pytest


Expand All @@ -6,6 +11,12 @@ def libretexts_slug() -> str:
return "geo"


@pytest.fixture(scope="module")
def cache_folder() -> Generator[Path, Any, Any]:
with tempfile.TemporaryDirectory() as tmpdir:
yield Path(tmpdir)


@pytest.fixture(scope="module")
def libretexts_url(libretexts_slug: str) -> str:
return f"https://{libretexts_slug}.libretexts.org"
Expand Down
Loading