Skip to content

Commit

Permalink
Update FAQ with two common questions I see
Browse files Browse the repository at this point in the history
  • Loading branch information
jhiemstrawisc committed Jan 6, 2025
1 parent 80c2701 commit f3b6eff
Showing 1 changed file with 106 additions and 0 deletions.
106 changes: 106 additions & 0 deletions docs/pages/faq.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
# Frequently Asked Questions & Troubleshooting

## Why do `osdf:///` URLs use 3 slashes instead of two slashes like regular URLs?

URLs, or **Uniform Resource Locators**, play a crucial role in the way computers are able to discover, locate and access digital resources. Since
their broad adoption in the mid 1990s, their structure has become a well-defined internet standard<sup>[1](#triple-slash-fn1)</sup>. To quote their
definition:
> "Uniform Resource Locators" (URLs), in addition to identifying
a resource, provide a means of locating the resource by describing its
primary access mechanism (e.g., its network "location").<sup>[2](#triple-slash-fn2)</sup>

Whether you're trying to access a remote PDF or watch your favorite playlist of YouTube cats, the URL you give your browser conveys important
information about _what_ it's supposed to find, _where_ it can look, and _how_ it should be accessed.

Understanding a URLs basic components will help answer why `osdf:///` URLs typically require triple slashes.

For our intents and pursposes, URLs contain three main parts -- a _scheme_, a _hostname_ and a _path_, where these pieces can be loosely defined as follows
<sup>[3](#triple-slash-fn3)</sup>:
1. **scheme**: A URL's _scheme_ tells the computer _how_ something should be accessed. In most cases, this specifies a protocol like `https`, `ftp`,
or in our case `pelican` and `osdf`. Essentially, this tells your computer what "language" it needs to speak to interact with the resource.
2. **hostname**: The URL's _hostname_ gives the computer information about who/what remote resource might be able to fulfill your request.
3. **path**: The _path_ component of a URL specifies the name of a requested resource from the requested hostname. Typically this is something like
a specific web page or file.

These components are stitched together in a predictable fashion:
```
<scheme>://<hostname>/<path>
```

For example, when you visit `https://docs.pelicanplatform.org/parameters`, you've defined the URL scheme as `https`, the hostname as
`docs.pelicanplatform.org` and the path as `parameters`. Together, these pieces tell your computer to use HTTPS to access the `parameters` page from
Pelican's `docs.pelicanplatform.org` documentation website.

The `pelican`-schemed URLs you use to access objects from Pelican federations follow the same setup, leading to URLs like:
```
pelican://osg-htc.org/some/object
```
Here, you've indicated you want to use the `pelican` protocol to interact with `some/object` from the `osg-htc.org` federation.

However, some URL schemes are inherent to a specific location and don't need a hostname. If you've ever used a browser to open a PDF on your personal
computer, you've likely seen a URL like `file:///some/path/to/file.pdf`. The triple slash after the `file` scheme happens because `file` already pre-supposes
that the browser needs to get a file from the local machine, so the hostname information isn't needed. It is equally valid to use the URL
`file://localhost/some/path/to/file.pdf`, but that's more to type! Instead, we wind up cutting out the redundant information to yield
> file://~~localhost~~/some/path/to/file.pdf --> file:///some/path/to/file.pdf
Similarly, the `osdf` URL scheme already encodes two pieces of information -- that you're speaking Pelican _and_ you're talking to the OSDF, whose
hostname is `osg-htc.org`. By using `osdf` URLs, you've indicated the object you're interacting with is part of a specific networked system that should
already be understood.

The hostname in the previous `pelican`-schemed URL matches the OSDF's hostname, so it can be rewritten using an `osdf` url:
> pelican://osg-htc.org/some/object --> osdf://~~osg-htc.org~~/some/object = osdf:///some/object
On the other hand, construction of a URL like `osdf://some/object` has the potential to confuse many clients, because now part of the object's name "`some`"
will be interpreted as the federation's hostname.

More information about `pelican` and `osdf` URLs can be found in our [client usage docs](./getting-data-with-pelican/client#the-different-pelican-url-schemes).

> <a id="triple-slash-fn1">**1**</a>: For more information about the structure of URLs, see [RFC 1738](https://www.rfc-editor.org/rfc/rfc1738).<br />
<a id="triple-slash-fn2">**2**</a>: For more information on the difference between URIs, URLs and URNs, see [RFC 3986](https://www.rfc-editor.org/rfc/rfc3986#section-1.1.3).<br />
<a id="triple-slash-fn3">**3**</a>: URLs can also contain things like ports, query parameters and "fragments," and while Pelican makes use of these, they aren't as crucial to understanding
the question at hand.

## Why isn't Pelican using the closest cache(s) when I download objects?
Whenever a Pelican client tries to download an object, one of its first steps is to talk to the appropriate federation's Director, where the Directors job is to
match the client's request to some service(s) that can best fulfill the request. This usually means giving the client an ordered list of caches that the Director
thinks either have the object or that are capable of delivering the object quickly.

By default, Directors order this list by trying to determine the physical distance between the client and any caches in the federation with closer caches
being assigned higher priority<sup>[1](#geoloc-fn1)</sup>. This troubleshooting guide assumes the Director is configured for distance-based sorting.
There are several ways this process can break.

#### Client Resolution
First, the Director uses the IP address of the incoming client request to generate a lat/long pair and confidence range for the client. It does this by running
the client's IP address through a local database<sup>[2](#geoloc-fn2)</sup>. Issues that can occur at this stage include:
- The IP address reported by the client is invalid, or in a private range (e.g. 192.168.0.12 for IPv4)
- The IP address is valid, but the database doesn't have an entry for it
- The IP address is valid and has an entry, but the database reports a confidence range greater than 900km.
In any of these cases, the Director will decide it can't reasonably determine where the client is, and it will assign a temporary lat/long pair by picking
a coordinate somewhere in the continental US. This coordinate is cached for a short time (~20 minutes), so subsequent requests from the same client will resolve to the
same spot.

If the list of caches you see being tried look like they have a geographic center, but not the _correct_ geographic center, you might try determining the IP
address the Director sees when the client contacts it. This can be done by running:
```bash
curl ifconfig.me
```
and running the resulting IP address through [MaxMind's GeoLite City demo](https://www.maxmind.com/en/geoip-demo). If the location it determines is incorrect,
has a large accuracy radius, or appears to be otherwise invalide, that's likely causing a problem.
> **NOTE**: The database used by this demo is not exactly the same database used by the Director. If you see a problem here, there's definitely an issue, but if
this step yields the expected results, there may still be issues with the Director's database. If the list of caches tried by your client(s) appear to have a
geographic center that's incorrect, contact your federation administrators to ask if they can create a manual override for your IP range.

#### Cache/Origin Resolution
Alternatively, the client's location may be known by the Director, but locations for some caches (or even origins) in the list can't be determined. While the
Director can use a client's IP address directly for geo-location, it uses a DNS lookup against cache/origin hostnames to determine IP addresses. Failure to produce
an IP address in this step means something more fundamental is wrong with the cache/origin, and that it should be fixed before receiving any requests. However,
it's still possible that the resolved IP address is incorrect, or has the same types of issues client IPs might have with the MaxMind database. When this happens,
the server should be sorted to the end of the potential list of servers.

Errors can compound if both of these issues (client _and_ cache/origin geo-location failures) occur. If the cache list you see from the Director has no discernable
geographic center, you should contact your federation administrators for help debugging.

> <a id="geoloc-fn1">**1**</a>: Directors may implement more intelligent cache selection schemes. For a full list of options, see the documentation for the
Director's [`Director.CacheSortMethod`](./parameters.mdx#Director-CacheSortMethod) config parameter.<br />
<a id="geoloc-fn2">**2**</a>: In particular, the Director uses the [MaxMind GeoLite City database](https://www.maxmind.com/en/geolite-free-ip-geolocation-data), which it
updates twice weekly on Wednesdays and Fridays (shortly after the databases are updated upstream by MaxMind).<br />

0 comments on commit f3b6eff

Please sign in to comment.