Skip to content

Commit

Permalink
Update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
RuedigerVoigt committed Jul 16, 2021
1 parent 016d275 commit a852557
Showing 1 changed file with 103 additions and 51 deletions.
154 changes: 103 additions & 51 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,36 +4,43 @@
![Last commit](https://img.shields.io/github/last-commit/RuedigerVoigt/salted)
![pypi version](https://img.shields.io/pypi/v/salted)
[![Downloads](https://pepy.tech/badge/salted)](https://pepy.tech/project/salted)
[![Coverage](https://img.shields.io/badge/coverage-65%25-yellow)](https://www.ruediger-voigt.eu/coverage/salted/index.html)

Broken hyperlinks are bad for user experience and may hurt SEO.
Salted checks if external links in HTML, Markdown, or TeX files are valid.
Key advantages of this application compared to other linkcheckers are:
* *It is smart.*
* Salted uses a configurable cache. So if your check found some broken links and you fixed them, then the next run will only check the changed links.
* Salted uses a configurable cache. If your check found some broken links and you fixed them within the cache lifetime (default: 24h), then the next run will only check the changed links.
* It normalizes URLs. Because `https://www.example.com/index.html#one` and `https://www.example.com/index.html#two` point to the same page, only one check is performed.
* *It is fast.*
* Many linkcheckers work in a linear way - one link after another. This program spawns many asynchronous worker threats that work in parallel and free up resources while waiting on a server's response.
* Many linkcheckers work in a linear way - one link after another. This applications spawns many asynchronous worker threats that work in parallel and free up resources while waiting on a server's response.
* Salted is much faster and can check dozens of links *per second* (depending on your connection).
* *Salted can be used stand-alone or in a CI pipeline.*
* The result can be written to standard out / the command line or to a file.
* It can raise an exception in case it found broken links.
* You can use salted as a library or as a command line script.
* *The result can be styled using Jinja2 templates.*
* Two default templates (for the command line and for Markdown) are available.
* You can use your own templates.

## Example

1. All files you want to check have to be in one directory. Subdirectories will be crawled.
2. Open a Python shell:
All files you want to check have to be in one directory. Subdirectories will be crawled.

Assuming the files you want to check are located in the "homepage" folder.

Open the command line:
```
cd /folder_above_homepage
salted -i ./homepage/
```
*Alternatively* open a Python shell:
```python
import logging
import salted
logging.basicConfig(level=logging.INFO)

# if there is no cache file, a new one will be created:
linkcheck = salted.Salted(cache_file='./salted-cache.sqlite3')

# Assuming there is a folder 'homepage' within your current working directory:
linkcheck = salted.Salted()
linkcheck.check('./homepage/')
```
Two runs in a row (i.e. one full check and one using the cache):
Expand All @@ -43,7 +50,7 @@ Salted automatically recognizes supported file formats by their extension (i.e.

## Installation

This application uses `aiohttp` to realize its high network speed. Python 3.7 - which is as of November 2020 still used by Debian 10 'Buster' - seems to cause an [issue](https://github.com/aio-libs/aiohttp/issues/3535). As Ubuntu and others have moved on to newer versions of Python, that will be not looked into. **So you need Python 3.8. or newer.**
Python 3.7 causes an [issue](https://github.com/aio-libs/aiohttp/issues/3535). **You need Python 3.8. or newer.**

You can check your Python version this way:
```bash
Expand All @@ -55,75 +62,120 @@ python -v
`Salted` is built with `aoihttp`.
If installing `aiohttp` fails because `multidict` does not install you need a C-Compiler present or need to install the binary. Please look at [multidict's documentation](https://github.com/aio-libs/multidict).

Aside from those issues installation is easy:
Aside from this installation is easy:

```bash
sudo pip3 install salted
```

## Using salted
*The installation via pip / pip3 install the library salted AND registers it as a command line script in the path. So you can just call salted in the terminal.*

## Salted: Supported File Formats

You need a very small Python script to run salted:
*Salted does not yet check relative links in any file-format.*

```python
import logging
import salted
* **HTML** : Standard hyperlinks / anchors are checked. Salted does not yet check relative links or `src` attributes of pictures. Mailto links are not yet checked.
* **Markdown** : The pandoc version as well as GitHub flavored markdown are supported.
* **TeX** : salted recognizes `\url{url}` as well as `\href{url}{text}`, but the hyperref option `baseurl` is ignored.

# This displays all messages of level info or above on your screen. You could write the log output to a file.
logging.basicConfig(level=logging.INFO)
## Running salted from the command line

# If there is no cache file, a new one will be created. You need to provide a path / name:
linkcheck = salted.Salted(cache_file='./choose-a-name.sqlite3')
Once salted is installed with pip, it registers itself as a command line script and is available in your path. So open a command line, switch into the directory you want to check and try:
```bash
# Check all supported files within this directory and it subdirectories.
# Output result to the command line.
salted -i ./
```
On the command line salted supports all parameters. To get an overview, simply type `salted -h` and it will display this help message with all availbale options.

# Salted assumes all your files are in one folder or subfolders of that:
linkcheck.check('./path_to_your_files/')
```
usage: salted [-h] [-i <path>] [--file_types {supported,html,tex,markdown}] [-w <num>] [--timeout <seconds>]
[--raise_for_dead_links <True/False>] [--user_agent <str>] [--cache_file <path>]
[--dont_check_again_within_hours <hours>] [--template_searchpath <path to folder>] [--template_name <filename>]
[--write_to <path>] [--base_url https://www.example.com]
Salted is an extremly fast link checker. It works with HTML, Markdown and TeX files. Currently it only checks external links.
You are using version 0.7.0.
optional arguments:
-h, --help show this help message and exit
-i <path>, --searchpath <path>
File or Folder to check (default: current working directory)
--file_types {supported,html,tex,markdown}
Choose which kind of files will be checked.
-w <num>, --num_workers <num>
The number of workers to use in parallel (default: automatic)
--timeout <seconds> Number of seconds to wait for an answer of a server (default: 5).
--raise_for_dead_links <True/False>
True if dead links shall rise an exception (default: False).
--user_agent <str> User agent to identify itself. (Default: salted / version)
--cache_file <path> Path to the cache file (default: salted-cache.sqlite3 in the current working directory)
--dont_check_again_within_hours <hours>
Number of hours an already verified URL is considered valid (default: 24).
--template_searchpath <path to folder>
Path to *folder* in which the template file can be found.
--template_name <filename>
Name of the template file.
--write_to <path> Either 'cli' to write to standard out or a path (default: cli)
--base_url https://www.example.com
The file system path to the checked folder is replaced with this URL in template outputs.
For more information see: https://github.com/RuedigerVoigt/salted
Salted uses the `logging` module instead of the `print()` command, because:
* You may want to sent the output to a file or another destination.
* By adapting the log level you can decide which information to see.
```

### Supported File Formats
## Using salted as a Python library

*Salted does not yet check relative links in any file-format.*
If you want to use `salted` not as a CLI script but as a Python library, a small script is all you need:

* **HTML** : Standard hyperlinks / anchors are checked. Salted does not yet check relative links or `src` attributes of pictures.
* **Markdown** : The pandoc version as well as GitHub flavored markdown are supported.
* **TeX** : salted recognizes `\url{url}` as well as `\href{url}{text}`, but the hyperref option `baseurl` is ignored.
```python
import logging
import salted

# This displays all messages of level info or above on your screen.
# You could write the log output to a file.
logging.basicConfig(level=logging.INFO)

### Initializing
# Initializing salted by creating an object
linkcheck = salted.Salted()

The first thing to do after importing the libraries is initializing salted by creating an object. As above:
```python
linkcheck = salted.Salted(cache_file='./choose-a-name.sqlite3')
# Now you can set parameters to specifc values, for example:
linkcheck.timeout = 10

# Salted assumes all your files are in one folder or subfolders of that.
# Simply call the check function of the instance just created:
linkcheck.check('path_to_your_files/')
```
The only necessary parameter is to provide a path at which to store your cache file.

There are some optional parameters:
* `workers` defaults to automatic, which lets salted choose how many workers to start. You can set a specific number of workers. *This is not depended on the number of cores your system has, but more so dependent on the number of URLs to check!* Once a worker has sent a request it awaits the answer and meanwhile other workers can check other URLs. For example: A machine with 4 cores on a standard home connection should work fine with 32 or more workers.
* `timeout_sec`: The number of seconds to wait for a server to answer the request. This is necessary as some servers do not answer and a single one of those would block the check. This defaults to 5 seconds.
* `dont_check_again_within_hours`: the cache lifetime in full hours. If a link was valid this number of hours ago, salted assumes it is still valid and will not check it again. This defaults to 24 hours.
* `raise_for_dead_links`: if set to `True` salted will raise an exception in case it finds obviously dead links that yield a HTTP status code like 404 ('Not found) or 410 ('Gone'). That behavior is useful for a publication workflow. It will *not* raise an exception for links it could not check as some servers block requests.
* `user_agent`: sets the 'User-Agent' field of the HTTP header. This defaults to 'salted' if not set.
This starts the check. By default the results will be displayed on the command line interface you are using.

### Running the check
## Using a Configuration File

Once you initialized salted you can use the object to call the `check_links' function.
## Parameters / Initializing

As above:
```python
# linkcheck is the object we just initialized
linkcheck.check('./path_to_your_files/')
```
All versions of salted use the same parameters. Their categories are only important for config files:

The only necessary parameter is the path to a specific file or folder which contains your files. This is enough for salted to find all files within that and its subfolders whose filename ends on `.htm` or `.html`.
* Category "FILES":
* `searchpath`: Path to file or folder to check (default: current working directory)
* `file_types`: Choose which types of files to check. Values can be 'supported' (all formats known to salted), 'html', 'tex', or 'markdown'.
* Category "BEHAVIOR":
* `num_workers` defaults to automatic, which lets salted choose how many workers to start. You can set a specific number of workers. *This is not depended on the number of cores your system has, but more so dependent on the number of URLs to check!* Once a worker has sent a request it awaits the answer and meanwhile other workers can check other URLs. For example: A machine with 4 cores on a standard home connection should work fine with 32 or more workers.
* `timeout`: The number of seconds to wait for a server to answer the request. This is necessary as some servers do not answer and a single one of those would block the check. This defaults to 5 seconds.
* `raise_for_dead_links`: if set to `True` salted will raise an exception in case it finds obviously dead links that yield a HTTP status code like 404 ('Not found) or 410 ('Gone'). That behavior is useful for a publication workflow. It will *not* raise an exception for links it could not check as some servers block requests.
* `user_agent`: sets the 'User-Agent' field of the HTTP header. This defaults to 'salted' if not set.
* Category "CACHE":
* `cache_file`: Path to the cache file. Default is `salted-cache.sqlite3` in the current working directory.
* `dont_check_again_within_hours`: The cache lifetime in full hours. If a link was valid this number of hours ago, salted assumes it is still valid and will not check it again. This defaults to 24 hours.
* Category "TEMPLATE":
* `template_searchpath`: In case you want to use a custom template, this has to be the path to the *folder* in which the template file can be found.
* `template_name`: The name of the template file. Built-In templates are `default.md.jinja` (for markdown output) and `default.cli.jinja` (for text output on the command line).
* `write_to`: Default is 'cli' to write to standard out. Alternatively this accepts a file path.
* `base_url`: The file system path to the checked folder is replaced with this URL in template outputs. So for example if you check the folder `/home/username/homepage/` and in the file `index.html` has a broken link, then the path could be changed to `https://www.example.com/index.html`.

This starts the check. By default the results will be displayed on the command line interface you are using.

### Style the Output and Write to File
## Style the Output and Write to File

However you can style the output using templates and can choose to write it to a file.
You can style the output using templates and can choose to write it to a file.

There are two builtin templates for salted:
* `default.cli.jinja`: formats the output for display on the command line interface.
Expand Down

0 comments on commit a852557

Please sign in to comment.