Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add WARC-IP-Address header to WARCWriterChainProcessor #396

Closed
anjackson opened this issue May 31, 2021 · 4 comments · Fixed by #409
Closed

Add WARC-IP-Address header to WARCWriterChainProcessor #396

anjackson opened this issue May 31, 2021 · 4 comments · Fixed by #409
Labels

Comments

@anjackson
Copy link
Collaborator

The original WARCWriterProcessor added a WARC-IP-Address header if the IP of the host is known:

headers.addLabelValue(HEADER_KEY_IP, getHostAddress(curi));

The newer WARCWriterChainProcessor uses a builder for HTTP responses, but this does not include the IP address:

public WARCRecordInfo buildRecord(CrawlURI curi, URI concurrentTo) throws IOException {

Other builders do, like this:

String ip = (String)curi.getData().get(A_DNS_SERVER_IP_LABEL);
if (ip != null && ip.length() > 0) {
recordInfo.addExtraHeader(HEADER_KEY_IP, ip);
}

It's not clear the original complexity is needed:

protected String getHostAddress(CrawlURI curi) {
// special handling for DNS URIs: want address of DNS server
if (curi.getUURI().getScheme().toLowerCase().equals("dns")) {
return (String)curi.getData().get(A_DNS_SERVER_IP_LABEL);
}
// otherwise, host referenced in URI
// TODO:FIXME: have fetcher insert exact IP contacted into curi,
// use that rather than inferred by CrawlHost lookup
CrawlHost h = getServerCache().getHostFor(curi.getUURI());
if (h == null) {
throw new NullPointerException("Crawlhost is null for " +
curi + " " + curi.getVia());
}
InetAddress a = h.getIP();
if (a == null) {
throw new NullPointerException("Address is null for " +
curi + " " + curi.getVia() + ". Address " +
((h.getIpFetched() == CrawlHost.IP_NEVER_LOOKED_UP)?
"was never looked up.":
(System.currentTimeMillis() - h.getIpFetched()) +
" ms ago."));
}
return h.getIP().getHostAddress();
}

But if the IP is handy, it should be included in the HTTP response.

@ato
Copy link
Collaborator

ato commented Jun 23, 2021

I don't think the IP is handy - A_DNS_SERVER_IP_LABEL is only populated for DNS requests. This explains why the original code looks up the address in the cache.

We could probably do what that TODO:FIXME comment suggests though and have RecordingHttpClientConnection stash the exact ip in the Recorder since it has access to the underlying Socket object when it wraps the input and output streams for recording.

@ato
Copy link
Collaborator

ato commented Jun 23, 2021

Oh except Recorder is in webarchive-commons. Gah.

@ato
Copy link
Collaborator

ato commented Jun 25, 2021

Looks like DNS records are also being written with ip/port label like /127.0.0.53:53 rather than just the IP address:

WARC/1.0
WARC-Type: response
WARC-Target-URI: dns:localhost
WARC-Date: 2021-06-25T04:04:35Z
WARC-IP-Address: /127.0.0.53:53
WARC-Record-ID: <urn:uuid:38d9d425-3a4f-4158-86bf-8a5862ecdd0b>
Content-Type: text/dns
Content-Length: 44
``

@ato
Copy link
Collaborator

ato commented Jun 25, 2021

It also turns out BaseWARCRecordBuilder.getHostAddress(curi) helper is completely broken with the default config since the RecordBuilders are instantiated by Java code rather than Spring XML so the ServerCache is never autowired into them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants