Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support http-path in index provider address #117

Open
bajtos opened this issue Feb 13, 2025 · 12 comments
Open

Support http-path in index provider address #117

bajtos opened this issue Feb 13, 2025 · 12 comments
Assignees

Comments

@bajtos
Copy link
Member

bajtos commented Feb 13, 2025

Some IPNI index providers serve advertisements at a non-root location. We need to improve piece-indexer to support such providers.

Example provider:

https://cid.contact/providers/12D3KooWJ91c6xQshrNe7QAXPFAaeRrHWq2UrgXGPf8UmMZMwyZ5

Address in the multiaddr format:

/dns/f03303347-market.duckdns.org/https/http-path/%2Fipni-provider%2F12D3KooWJ91c6xQshrNe7QAXPFAaeRrHWq2UrgXGPf8UmMZMwyZ5

Base URL we need to build from the above multiaddr:

https://f03303347-market.duckdns.org/ipni-provider/12D3KooWJ91c6xQshrNe7QAXPFAaeRrHWq2UrgXGPf8UmMZMwyZ5

Full advertisement URL:
https://f03303347-market.duckdns.org/ipni-provider/12D3KooWJ91c6xQshrNe7QAXPFAaeRrHWq2UrgXGPf8UmMZMwyZ5/ipni/v1/ad/baguqeerakqnjugtecgj5hhfmnh46pthk5k3vy6vf4bw3vpmwmudknlatun5a

Current piece-indexer ingestion status:
https://pix.filspark.com/ingestion-status/12D3KooWJ91c6xQshrNe7QAXPFAaeRrHWq2UrgXGPf8UmMZMwyZ5

Index provider advertises over an unsupported protocol:
/dns/f03303347-market.duckdns.org/https/http-path/%2Fipni-provider%2F12D3KooWJ91c6xQshrNe7QAXPFAaeRrHWq2UrgXGPf8UmMZMwyZ5
@bajtos
Copy link
Member Author

bajtos commented Feb 13, 2025

FWIW, the example provider shown above is running Curio.

❯ libp2p-lookup direct --address /dns/f03303347-market.duckdns.org/tcp/443/wss
Local peer id: 12D3KooWFNh7x5CaD36A96iiwkz7h4iHZWXb5GF5gCzZDRRbWZDm
Lookup for peer with id PeerId("12D3KooWCtiN7tAjeLKL4mashteXdH4htUrzWu8bWN9kDU3qbKjQ") succeeded.

Protocol version: ""
Agent version: "curio-1.24.4+mainnet+git_da64b37_2025-02-10T15:57:50+00:00"
Observed address: /ip4/192.168.0.101/tcp/39418/ws
Listen addresses:
        - /dns/f03303347-market.duckdns.org/tcp/443/wss
Protocols:
        - /fil/retrieval/transports/1.0.0
        - /fil/storage/ask/1.1.0
        - /fil/storage/mk/1.2.0
        - /fil/storage/mk/1.2.1
        - /fil/storage/status/1.2.0
        - /ipfs/id/1.0.0
        - /ipfs/id/push/1.0.0
        - /ipfs/ping/1.0.0
        - /libp2p/autonat/1.0.0

@NikolasHaimerl
Copy link

@bajtos I have a question for understanding.
The way that the piece indexer currently resolved the path is that it takes a given multiaddr format like so:
/dns/f03303347-market.duckdns.org/https/%2Fipni-provider%2F12D3KooWJ91c6xQshrNe7QAXPFAaeRrHWq2UrgXGPf8UmMZMwyZ5
and then pieces together the URL with a regex pattern that matches the pattern: /^https?:\/\//.
If we look at the curio pattern we see that .../https/http-path... is in the path instead of only .../https/..... This would lead me to believe the way we can handle this is to cut out the part that indicates the http-path. Are my assumptions correct?

@bajtos
Copy link
Member Author

bajtos commented Feb 19, 2025

Great question.

then pieces together the URL with a regex pattern that matches the pattern: /^https?:///.

My understanding is different. That regexp is used to check if the index provider talks HTTP(s). I don't think we need any changes there.

We are converting the multiaddr string to a URL in multiaddrToHttpUrl, see indexer/lib/vendored/multiaddr.js. That file is mirrored from spark-checker, so it's best to make the changes in spark-checker first and then mirror them to this repository. (If you do so, then please remember to mention CheckerNetwork/roadmap#239 in the spark-checker pull request.)

You can learn more about multiaddr and http-path here:

we can handle this is to cut out the part that indicates the http-path.

We must preserve the http-path segment and include it in the base URL.

Test vector:

  • Input:
    /dns/f03303347-market.duckdns.org/https/http-path/%2Fipni-provider%2F12D3KooWJ91c6xQshrNe7QAXPFAaeRrHWq2UrgXGPf8UmMZMwyZ5
    
  • Output:
    https://f03303347-market.duckdns.org/ipni-provider/12D3KooWJ91c6xQshrNe7QAXPFAaeRrHWq2UrgXGPf8UmMZMwyZ5
    

@NikolasHaimerl
Copy link

NikolasHaimerl commented Feb 19, 2025

I see , thanks for clarifying. So if we convert a regular Addr field like for example: '/ip4/103.166.228.25/tcp/6122/http' to "http://103.166.228.25:6122"
we need to convert
/dns/f03303347-market.duckdns.org/https/http-path/%2Fipni-provider%2F12D3KooWJ91c6xQshrNe7QAXPFAaeRrHWq2UrgXGPf8UmMZMwyZ5 to https://f03303347-market.duckdns.org/ipni-provider/12D3KooWJ91c6xQshrNe7QAXPFAaeRrHWq2UrgXGPf8UmMZMwyZ5 and that would be the output of a function which does something similar to multiaddrToHttpUrl but for http-path protocols?

@NikolasHaimerl
Copy link

@bajtos is there a reason we never used the multiaddress javascript implementation to resolve multiaddresses?

@NikolasHaimerl
Copy link

Context

The error that the protocol is unsupported occurs while processing the next advertisement. It expects a string that matches the pattern /^https?:\/\//. This indicates that we are only looking for protocols that support http or https. If we follow the stack trace to where the providerAddress inside of ProviderInfo is created we will see that the function getProvidersWithMetadata creates the ProviderInfo and the ProviderAddress is set here and here. The former sets the multiaddress string and the latter sets the actual https?:// link. If the latter fails then the mutliaddress string will remain and thus lead to a failure while processing the processing the next advertisement. To receive a link which will not raise an error, we need to account for http-path multiaddresses inside of multiaddrToHttpUrl.

Implementation Plan

There exists a dedicated javascript library for dealing with multiaddresses. We can use it to handle the different forms of multiaddresses and give us information about which protocols are used and what the values for these protocols are. There are multiple ways how to extract the information stored in the multiaddresses. One way is to use the function stringTuples, which yields an array of tuples containing the code for a certain protocol and the value for that protocol. You can find the codes for each protocol here. Similarly to how the function multiaddrToHttpUrl currently fails if neither http nor https are supported we can make the new implementation of it fail if the code for http or https is not included in the list of tuples. From the tests for the js-multiaddr library I derived that one can decode the URI from the value of the code 481 by calling decodeURIComponent. This is how we can extract the path that is embedded in the http-path multiaddress. According to the documentation the http-path should be appended to an existing multiaddress which means that the part that extracts the host, port, etc. should be handled the same way whether there exists an http-path multiaddress inside the overall mutliaddress or not.

I am keen to hear your feedback on the proposed changes @bajtos @juliangruber @pyropy

@bajtos
Copy link
Member Author

bajtos commented Feb 20, 2025

I think I understand now where the confusion comes from.

The error message makes it look like the problem is in the code checking whether the address matches /^https?:///

Index provider advertises over an unsupported protocol:
/dns/f03303347-market.duckdns.org/https/http-path/%2Fipni-provider%2F12D3KooWJ91c6xQshrNe7QAXPFAaeRrHWq2UrgXGPf8UmMZMwyZ5

However, that's not the real cause. Take a look here:

let providerAddress = p.Publisher.Addrs[0]
try {
providerAddress = multiaddrToHttpUrl(providerAddress)
} catch (err) {
debug('Cannot convert address to HTTP(s) URL (provider: %s): %s', providerId, err)
}

If we cannot convert the multiaddress to URL (multiaddrToHttpUrl throws), we store the original multiaddr as providerAddress.

You can think of providerAddress as a Rust enumeration type:

enum ProviderAddress {
  Supported { url: string },
  Unsupported { multiaddr: string },
}

All packed into a single JavaScript string variable, where the check /^https?:\/\// is used to discriminate between supported vs unsupported values.

In hindsight, it was not the best design decision on my side.


To implement support for http-path, I think it will be faster to improve our existing parser (the function multiaddrToHttpUrl ) to support an http-path segment.
Here is the place where to start:
https://github.com/CheckerNetwork/spark-checker/blob/d16b3a11b4d7243e5fd51f02d725f9ceae16f53c/lib/multiaddr.js#L22-L27

Essentially, we want to support two cases:

  • rest is an empty array (current implementation)
  • rest is ['http-path', pathValue] (the new flavour we need to support)

@bajtos
Copy link
Member Author

bajtos commented Feb 20, 2025

When I was looking into conversion of multiaddr into HTTP(S) URL, I reviewed existing npm packages and found them unsuitable to our needs: 1) how they handle non-http protocols like TCP 2) how easy it is to interpret parsing errors.

Here are test-cases covering those requirements:

I did not find a js-multiaddr API I could use to do the conversion.

I was looking at https://www.npmjs.com/package/@multiformats/multiaddr-to-uri, too. This is what they have in their README:

console.log(multiaddrToUri('/ip4/127.0.0.1/tcp/8080'))
// -> http://127.0.0.1:8080/

That's a problem for Spark, because an address like /ip4/127.0.0.1/tcp/8080 typically indicates a Graphsync endpoint.

@bajtos
Copy link
Member Author

bajtos commented Feb 20, 2025

Implementation Plan

There exists a dedicated javascript library for dealing with multiaddresses. We can use it to handle the different forms of multiaddresses and give us information about which protocols are used and what the values for these protocols are. There are multiple ways how to extract the information stored in the multiaddresses. One way is to use the function stringTuples, which yields an array of tuples containing the code for a certain protocol and the value for that protocol. You can find the codes for each protocol here. Similarly to how the function multiaddrToHttpUrl currently fails if neither http nor https are supported we can make the new implementation of it fail if the code for http or https is not included in the list of tuples. From the tests for the js-multiaddr library I derived that one can decode the URI from the value of the code 481 by calling decodeURIComponent.

I think such solution would work, but it seems like a lot of extra work to me - wouldn't you have to write most parts of multiaddrToHttpUrl again?

See #117 (comment) for my suggestion for a solution that would IMO require less work.

@NikolasHaimerl
Copy link

NikolasHaimerl commented Feb 20, 2025

Here's the structured GitHub comment with the second part in Bash formatting:


@bajtos Thank you for your detailed answer!

When I first examined how multiaddrToHttpUrl was implemented—especially the way we split the multiaddress here—it seemed that we are making quite a few assumptions about the structure of a multiaddress.

For example, consider the two cases that multiaddrToHttpUrl should support after adding support for http-path:

  1. /ip4/103.166.228.25/tcp/6122/http
    • This fits our assumption well, as http (called scheme in the code) is at the 5th position, and rest appears at the 6th position when splitting by /.
  2. /dns/f03303347-market.duckdns.org/https/http-path/%2Fipni-provider%2F12D3KooWJ91c6xQshrNe7QAXPFAaeRrHWq2UrgXGPf8UmMZMwyZ5
    • In this case, https appears at the 3rd position, and there is no tcp port defined. So our assumption on where which protocol is defined would be wrong.

If we only consider rest to look for the http-path, we might miss http-path in one of the prior parts of the split. Given these assumptions, I started looking for existing solutions to handle any multiaddr, assuming that since this standard was created by someone, there would also be a library to process it properly. That led me to js-multiaddr.

I attempted a quick integration of js-multiaddr into spark-checker, but I couldn't get it to pass all checks immediately. While researching multiformats, I noticed that http-path is defined as a concatenation of an existing multiaddress. We could leverage this by splitting the original multiaddress at http-path:

/dns/f03303347-market.duckdns.org/https/

Would be processed as usual, while the second part:

http-path/%2Fipni-provider%2F12D3KooWJ91c6xQshrNe7QAXPFAaeRrHWq2UrgXGPf8UmMZMwyZ5

Contains the encoded path, which we can extract using decodeURIComponent.

This approach should be a minimal change and avoids the added complexity of introducing js-multiaddr.

WDYT?

@bajtos bajtos moved this from 📥 next to 📋 planned in Space Meridian Feb 21, 2025
@bajtos
Copy link
Member Author

bajtos commented Feb 21, 2025

Here's the structured GitHub comment with the second part in Bash formatting:

@bajtos Thank you for your detailed answer!

When I first examined how multiaddrToHttpUrl was implemented—especially the way we split the multiaddress here—it seemed that we are making quite a few assumptions about the structure of a multiaddress.

For example, consider the two cases that multiaddrToHttpUrl should support after adding support for http-path:

  1. /ip4/103.166.228.25/tcp/6122/http

    • This fits our assumption well, as http (called scheme in the code) is at the 5th position, and rest appears at the 6th position when splitting by /.
  2. /dns/f03303347-market.duckdns.org/https/http-path/%2Fipni-provider%2F12D3KooWJ91c6xQshrNe7QAXPFAaeRrHWq2UrgXGPf8UmMZMwyZ5

    • In this case, https appears at the 3rd position, and there is no tcp port defined. So our assumption on where which protocol is defined would be wrong.

Thank you for a detailed explanation. You are right, our current implementation in multiaddrToHttpUrl does handle the case when the multiaddr does not contain /tcp/{port} pair. I did not realise that.

If we only consider rest to look for the http-path, we might miss http-path in one of the prior parts of the split. Given these assumptions, I started looking for existing solutions to handle any multiaddr, assuming that since this standard was created by someone, there would also be a library to process it properly. That led me to js-multiaddr.

I was thinking the same back in the day, that's how I discovered multiaddr-to-uri. (And found it's limitations.)

I attempted a quick integration of js-multiaddr into spark-checker, but I couldn't get it to pass all checks immediately. While researching multiformats, I noticed that http-path is defined as a concatenation of an existing multiaddress. We could leverage this by splitting the original multiaddress at http-path:

/dns/f03303347-market.duckdns.org/https/
Would be processed as usual, while the second part:

http-path/%2Fipni-provider%2F12D3KooWJ91c6xQshrNe7QAXPFAaeRrHWq2UrgXGPf8UmMZMwyZ5
Contains the encoded path, which we can extract using decodeURIComponent.

This approach should be a minimal change and avoids the added complexity of introducing js-multiaddr.

I like the idea to split the original multaddress at http-path 👍🏻

However, we still need to fix the current code parsing the protocol+hostname+port. It should support values like /dns/f03303347-market.duckdns.org/https (with no port specified and no http-path pair). If I remember correctly, Curio uses such multiaddress as the retrieval provider address advertised to IPNI. (This will be covered by CheckerNetwork/spark-checker#112).

I see two tasks here:

  1. Modify multiaddrToHttpUrl() to handle values like /dns/f03303347-market.duckdns.org/https.
  2. Modify multiaddrToHttpUrl() to handle http-path, regardless of how the protocol+hostname+port is specified.

Considering the original goal of this issue - support Curio multiaddr like /dns/f03303347-market.duckdns.org/https/http-path/%2Fipni-provider%2F12D3KooWJ91c6xQshrNe7QAXPFAaeRrHWq2UrgXGPf8UmMZMwyZ5, I propose to implement your suggestion handling http-path and defer the second item for later, until we confirm it's needed.

In other words, I agree with your proposal 👍🏻

@NikolasHaimerl
Copy link

NikolasHaimerl commented Feb 21, 2025

Sounds good, I created a POC for what we just discussed and all tests are passing including a new case for the curio example. I will work on more thorough testing next.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 📋 planned
Development

No branches or pull requests

2 participants