Skip to content

Commit

Permalink
Add option to set fetcherName for Tika >= 2.0.0
Browse files Browse the repository at this point in the history
In Tika >= 2.0.0, fetching remote files via the server is done using so called [fetchers](https://cwiki.apache.org/confluence/display/TIKA/tika-pipes). If you are running a Tika Server that is configured to use an HTTP fetcher, you need the client to tell the server which fetcher to use, which is done by adding the HTTP header `fetcherName` to the request. Furthermore, the URL of the remote file to be fetched must be passed using a `fetchKey` header instead `fetchUrl` as in Tika 1.x.x.

This adds a public API method to set the fetcher name, and replaces the `fileUrl` header with `fetcherName` and `fetchKey` if a fetcher name is set. If no fetcher name is set, the `fileUrl` header is still added to the request as usual to keep TIKA 1.x.x compatibility.
  • Loading branch information
relthyg committed Aug 14, 2023
1 parent d0db71f commit d760b8d
Show file tree
Hide file tree
Showing 2 changed files with 34 additions and 1 deletion.
11 changes: 11 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,11 @@ You can use an URL instead of a file path and the library will download the file
**no need** to add `-enableUnsecureFeatures -enableFileUrl` to command line when starting the server, as described
[here](https://wiki.apache.org/tika/TikaJAXRS#Specifying_a_URL_Instead_of_Putting_Bytes).

If you use Apache Tika >= 2.0.0, you *can* [define an HttpFetcher](https://cwiki.apache.org/confluence/display/TIKA/tika-pipes)
and use the option `-enableUnsecureFeatures -enableFileUrl` when starting the server to make the server download remote
files when passing a URL instead of a filname to `$client->getText()`. In order to do so, you must set the name of
the HttpFetcher using `$client->setFetcherName('yourFetcherName')`.

### Methods

Here are the full list of available methods
Expand Down Expand Up @@ -254,6 +259,12 @@ $client->setOCRLanguages($languages);
$client->getOCRLanguages();
```

Set HTTP fetcher name (for Tika >= 2.0.0 only, see https://cwiki.apache.org/confluence/display/TIKA/tika-pipes)

```php
$client->setFetcherName($fetcherName)
```

### Breaking changes

Since 1.0 version there are some breaking changes:
Expand Down
24 changes: 23 additions & 1 deletion src/Clients/WebClient.php
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,13 @@ class WebClient extends Client
*/
protected $retries = 3;

/**
* Name of the fetcher to be used (for Tika >= 2.0.0 only)
*
* @var string|null
*/
protected $fetcherName = null;

/**
* Default cURL options
*
Expand Down Expand Up @@ -208,6 +215,16 @@ public function setRetries(int $retries): self
return $this;
}

/**
* Set the name of the fetcher to be used (for Tika >= 2.0.0 only)
*/
public function setFetcherName(string $fetcherName): self
{
$this->fetcherName = $fetcherName;

return $this;
}

/**
* Get all the options
*/
Expand Down Expand Up @@ -626,7 +643,12 @@ protected function getParameters(string $type, string $file = null): array

if(!empty($file) && preg_match('/^http/', $file))
{
$headers[] = "fileUrl:$file";
if($this->fetcherName) {
$headers[] = "fetcherName:$this->fetcherName";
$headers[] = "fetchKey:$file";
} else {
$headers[] = "fileUrl:$file";
}
}

switch($type)
Expand Down

0 comments on commit d760b8d

Please sign in to comment.