Diskover v2 Usage

Hardware Requirements (Essential version)
License files(Essential version)
Config files
Building indices
Add additional directory tree to existing index (Essential version)
Find duplicate files (dupes) in existing index (Pro version)
Auto-tagging an existing index (Pro version)
Copy tags from one index to another (Pro version)
Add extra meta data using crawl Plugins
Media Info plugin (Media Edition)
Log Warnings
Index Lifecycle Management
Index State Management on AWS ES
Indexing Shell Script Example (Pro version)
diskover-web Task Panel (Essential version)
diskover-web Indexing Tasks (Essential version)
Run diskoverd as a service (Essential version)
Indexing AWS S3 buckets (Essential version)
Diskover-web API (Pro version)
Restrict diskover-web API Access (Pro version)
Securing Elasticsearch and Kibana
Optimizing (Essential version)
Troubleshooting

Hardware Requirements

Minimum

Elasticsearch hosts: 8 cpu cores, 16 gb ram (8 gb reserved for ES mem heap) (1 es node for testing, min 3 for production)
diskover-web web server: 2 cpu cores, 4gb ram
indexing host(s): 4 cpu cores, 4gb ram

Config files

diskover config file is located in:

Linux: ~/.config/diskover/config.yaml
macOS: ~/Library/Application Support/diskover/config.yaml
Windows: %APPDATA%\diskover\config.yaml

Default configs are located in configs_sample/. There are separate configs for diskover crawler, autotag, dupes-finder, diskoverd, etc.

The default config files are not used by diskover crawler, etc., they are default/sample configs and need to be copied to the appropriate directory based on the OS.

For example, in Linux the config files are in ~/.config/<appName>/config.yaml. Each config file has a setting appName that matches the directory name where the config file is located. For diskover dupes-finder for example, this would be ~/.config/diskover_dupesfinder/config.yaml.

Use spaces in config files, not tabs.

If you get an error message when starting diskover.py like Config ERROR: diskover.excludes.dirs not found, check config for errors or missing settings from default config., check that your config file is not missing any lines from default/sample config or there are no errors in your config like missing values.

Using alternate config

To use an alternate config file, set an environment var to the directory containing the config.yaml, example in Linux for diskover:

export DISKOVERDIR=/path/altconfigdir

The env var name is dependant on the appName setting in config, example in Linux to set an alt config file for diskover_dupesfinder which is set to appName: diskover_dupesfinder in config:

export DISKOVER_DUPESFINDERDIR=/path/altconfigdir

License files

License files should be copied to/located in:

diskover: /opt/diskover/diskover.lic
diskover-web: /var/www/diskover-web/src/diskover/diskover-web.lic

License permissions:

Please check diskover-web.lic file is owned by nginx user and permissions are 644:

chown nginx:nginx diskover-web.lic && chmod 644 diskover-web.lic

Building indices

Run a crawl in foreground printing all log output to screen:

python3 diskover.py -i diskover-<indexname> <tree_dir>

See all cli options:

python3 diskover.py -h

Multiple directory trees (tree_dir) can be set to index multiple top paths into single index (Essential version).
UNC paths and drive maps are supported in Windows.
Index name requires diskover- prefix. Recommended index name diskover-<mountname>-<datetime>.
Index name is optional, indices will default be named diskover-<treedir>-<datetime>.

On Linux or macOS, to run a crawl in the background and redirect all output to a log file:

nohup python3 diskover.py ... > /var/log/<logname>.log 2>&1 &

Log settings including log level (logLevel) and logging to a file (logToFile) instead of screen can be found in diskover config.

Add additional directory tree to existing index

python3 diskover.py -i diskover-<indexname> -a <tree_dir>

Creating multiple indices vs all top paths in single index

The advantage of running multiple index tasks is speed, you can run them in parallel (in background or on separate indexing machines) and also don’t have to wait for some long directory tree to finish scanning for the index to be usable in diskover-web, etc.

sh
diskover.py -i diskover-nas1 /mnt/stor1
diskover.py -i diskover-nas2 /mnt/stor2

will be better than

diskover.py -i diskover-nas /mnt/stor1 /mnt/stor2

since stor2 may have a lot more files/directories and you won’t be able to use the diskover-nas index until both finish scanning.

diskover uses threads for walking a directory tree, example if maxthreads in diskover config is set to 20, up to max 20 sub directories under index top path (top directory path/ mount point/ volume) can scan and index at once. This is important if you have a lot or very few sub directories at level 1 in /mnt/toppath. If /mnt/toppath has only a single sub directory at level 1, crawls will be slower since then there will ever only be 1 thread running.

Find duplicate files (dupes) in existing index

Check that you have the config file in ~/.config/diskover_dupesfinder/config.yaml, if not, copy from default config folder in configs_sample/diskover_dupesfinder/config.yaml.

To use the default hashing mode (xxhash), you will first need to install xxhash Python module.

Post indexing plugins, such as dupes finder, are located in plugins_postindex/ directory.

pip3 install xxhash

Run dupes finder for a completed index:

python3 diskover-dupesfinder.py diskover-<indexname>

Run dupes finder for multiple completed indices:

python3 diskover-dupesfinder.py diskover-<indexname1> diskover-<indexname2>

See all cli options:

python3 diskover-dupesfinder.py -h

For example, using the diskover-cache sqlite3 db using -u to store file hashes, or using an existing index to look up hashes with -U can save time by not having to hash the file again. -a will add all file hashes generated to the index, not just ones for dupe files found.

Calculating file hash checksums is an expensive cpu/disk operation so you may want to control what files in the index you want to hash. There are various config options for limiting what gets hashed and marked as a dupe in the index (is_dupe set to True). The ES index fields for file type that get updated are hash and is_dupe.

Config settings:

hash mode either xxhash or md5
file extensions you want to hash
minsize and maxsize of files to hash
additional ES query otherquery when searching in index for what files to hash
replacepaths for translating paths from source (index path) to dest path, example on Windows translating /z_drive/ to Z:\

Auto-tagging an existing index

Check that you have the config file in ~/.config/diskover_autotag/config.yaml, if not, copy from default config folder in configs_sample/diskover_autotag/config.yaml.

Post indexing plugins, such as auto tag, are located in plugins_postindex/ directory.

python3 diskover-autotag.py diskover-<indexname>

Auto-tagging rules can be found in the diskover-autotag config file for tagging files and directories
Auto-tagging can also be done during crawl by enabling autotag in diskover config and setting rules in the diskover config file
You can also manually tag in diskover-web or by using diskover-web api
All tags are stored in the tags field in the index, there is no limit to the number of tags

Copy tags from one index to another

Check that you have the config file in ~/.config/diskover_tagcopier/config.yaml, if not, copy from default config folder in configs_sample/diskover_tagcopier/config.yaml.

Post indexing plugins, such as tag copier, are located in plugins_postindex/ directory.

python3 diskover-tagcopier.py diskover-<source_indexname> diskover-<dest_indexname>

See all cli options:

python3 diskover-tagcopier.py -h

Add extra meta data using crawl Plugins

You can add extra meta data, for example unix permissions, to an index during crawl time by adding a plugin to Diskover crawler. Plugins are stored in plugins/ folder in the root directory of diskover. There are a few examples in the plugins folder to get you started.

For example, using a unix permissions plugin could allow you to see all open permissions for files and directories (rwx). Some other examples are database lookups to apply extra tags, content indexing and if keyword found tag file, copy or backup file if matches certain criteria, etc. This is all done during crawl time.

To create a plugin:

make a directory in plugins with the name of the plugin, example myplugin
create a file in the myplugin directory named __init__.py
copy the code from one of the example plugins and edit to create the plugin. There are six required function names but they can be edited however you want as long as the return value type is the same.

The six required function names for plugins are:

add_mappings
add_meta
add_tags
for_type
init
close

To enable or disable plugins, edit the diskover config file diskover > plugins section.

Example to enable the mediainfo plugin, add 'mediainfo' to the files list in the plugin section.

To list all plugins that will be used during crawl:

python3 diskover.py -l

Media Info plugin

The mediainfo plugin will enable additional meta data for video files to be indexed. The media info plugin uses FFmpeg/ ffprobe, you will need to install the FFmpeg package and check that ffprobe is in the PATH before using the plugin.

How to install FFmpeg on CentOS 7

To enable the mediainfo plugin, edit the diskover config file diskover > plugins section and add 'mediainfo' to the files list.

files: ['mediainfo']

New indices will use the plugin and any video file will get additional media info added to it's Elasticsearch doc's media_info field.

You can now view media info in diskover-web since it will store it in a new field for video files, the field name is media_info. To add it as a new field in diskover-web, go to settings and then in additional fields add

media_info|MediaInfo

Then you can search for a video file and its media info will be in that column. You can do searches on media info like this

media_info.<key>:<value>

Example to find video file with 1920x1080 resolution:

media_info.resolution:1920x1080

Another example to find every video file that is not 1920x1080:

media_info:* AND NOT media_info.resolution:”1920x1080"

Log Warnings

urllib3.connectionpool - WARNING - Connection pool is full, discarding connection: localhost

If you are seeing this ES warning, there are two things you can try. In your diskover config lower the maxthreads to be something like 16 or 20 and/or set your maxsize setting to be higher for ES connections to 40 or more.

Index Lifecycle Management

More information about Index Lifecycle Management can be found on elastic.co.

Example:

Your elasticsearch server is accessible at http://elasticsearch:9200
You want your indices to be purged after thirty days (30d)
Your policy name will be created as cleanup_policy_diskover

Create a policy that deletes indices after one month

curl -X PUT "http://elasticsearch:9200/_ilm/policy/cleanup_policy_diskover?pretty" \
     -H 'Content-Type: application/json' \
     -d '{
      "policy": {                       
        "phases": {
          "hot": {                      
            "actions": {}
          },
          "delete": {
            "min_age": "30d",           
            "actions": { "delete": {} }
          }
        }
      }
    }'

Apply this policy to all existing diskover indices

curl -X PUT "http://elasticsearch:9200/diskover-*/_settings?pretty" \
     -H 'Content-Type: application/json' \
     -d '{ "lifecycle.name": "cleanup_policy_diskover" }'

Create a template to apply this policy to new diskover indices

curl -X PUT "http://elasticsearch:9200/_template/logging_policy_template?pretty" \
     -H 'Content-Type: application/json' \
     -d '{
      "index_patterns": ["diskover-*"],                 
      "settings": { "index.lifecycle.name": "cleanup_policy_diskover" }
    }'

Index State Management on AWS ES

More information about Index State Management in Amazon ES can be found on aws docs.

Example:

Your AWS Elasticsearch Service endpoint url is <aws es endpoint>
You want your indices to be purged after thirty days (30d)
Your policy name will be created as cleanup_policy_diskover

Create a policy that deletes indices after one month for new diskover indices

curl -u username:password -X PUT "https://<aws es endpoint>:443/_opendistro/_ism/policies/cleanup_policy_diskover" \
     -H 'Content-Type: application/json' \
     -d '{
	  "policy": {
	    "description": "Cleanup policy for diskover indices on AWS ES.",
	    "schema_version": 1,
	    "default_state": "current",
	    "states": [{
	      "name": "current",
	      "actions": [],
	      "transitions": [{
	        "state_name": "delete",
	        "conditions": {
	          "min_index_age": "30d"
	        }
	      }]
	      },
	      {
	        "name": "delete",
	        "actions": [{
	          "delete": {}
	        }],
	        "transitions": []
	      }
	    ],
	    "ism_template": {
	      "index_patterns": ["diskover-*"],
	      "priority": 100
	    }
	  }
        }'

Apply this policy to all existing diskover indices

curl -u username:password -X POST "https://<aws es endpoint>:443/_opendistro/_ism/add/diskover-*" \
     -H 'Content-Type: application/json' \
     -d '{ "policy_id": "cleanup_policy_diskover" }'

Indexing Shell Script Example

#!/bin/bash
###
### Example diskover shell script to build an index and after completion, 
### run a find dupes on the index using the previous days index name as a 
### file hash lookup index.
###

### Set VARS
 
# DIRS is the directory level ABOVE the top level paths that are set to be indexed. For example:
# If you have a directory structure with /storage/path1 and /storage/path2, set DIRS to /storage/*
# in order to create a single index with both path1 and path2 as separate top level directories 
# inside of it.

DIRS=$(ls -d /storage/*)

# TODAY is today's date.
TODAY=$(date +%d-%m)
 
# YDAY is yesterday's date. Used by find dupes command. On MacOS use date -v-1d +%d-%m.
YDAY=$(date -d 'yesterday' +%d-%m)

# INDEX is the name you want to give your indices. Prefix diskover- is required. TODAY will be added 
# to the end of the index name.
INDEX=diskover-storage

### Indexing

# Build Index
python3 /opt/diskover/diskover.py -i $INDEX-$TODAY $DIRS

### Find Dupes
python3 /opt/diskover/plugins_postindex/diskover_dupesfinder.py $INDEX-$TODAY -U $INDEX-$YDAY -a

diskover-web Task Panel

Task panel can be used to schedule building indices or running any type of file action task such as copying files, running duplicate file finding, checking permissions on directories, etc. The task panel is a swiss-army knife for data management.

To get started with the task panel, check you have the json files in diskover-web public/tasks/ directory.

cd /var/www/diskover-web/public/tasks
ls *.json
tasklog.json	tasks.json	templates.json	workers.json

You should see the above .json files which are used by the Task Panel for storing task and worker related data. If you only see .json.sample files, copy the sample/default files

for f in *.json.sample; do cp $f "${f%.*}"; done
chmod 660 *.json
chown nginx:nginx *.json

You will need to start at least one diskover worker daemon (diskoverd) to work on tasks. diskoverd can run on the diskover host or on any host. diskoverd requires access to the diskover-web rest api which is located at http://<diskover-web-host>:<port>/api.php.

Check you have access to the web api for example

curl http://<diskover-web-host>:<port>/api.php

If you have access, you should receive a response like

{
    "status": true,
    "message": {
        "version": "diskover REST API",
        "message": "endpoint not found"
    }
}

To get diskoverd up and running and working on tasks, you'll first need to copy the default/sample diskoverd config file from your diskover directory in configs_sample/diskoverd/config.yaml and copy it to for example on Linux to ~/.configs/diskoverd/.

mkdir -p ~/.config/diskoverd/
cd /opt/diskover/configs_sample/diskoverd/
cp config.yaml ~/.config/diskoverd/

After copying the config file, edit the diskoverd config in ~/.config/diskoverd/config.yaml and change the setting for api endpoint apiurl to your diskover-web url. Also in the diskoverd config, set your local worker timezone and email server settings if you want to receive emails after a task completes.

diskoverd Workers

To start up a diskoverd worker run the below command

python3 diskoverd.py

With no cli options, diskoverd uses a unique worker name (hostname + unique id) each time it is started.

To see all cli options, such as setting a worker name, use -h

python3 diskoverd.py -h

To enable logging to a file and set log level, edit the config and set logLevel, logToFile and logDirectory and stop and restart diskoverd.

After diskoverd has started, it will appear in diskover-web Tasks Panel on the workers page. From there you can see the health of the worker (online/offline) and disable the worker, etc. A worker will show as offline if it does not send a hearbeat for 10 min. diskoverd tries to send a heartbeat every 2 min to diskover-web api.

diskover-web Indexing Tasks

Copy/roll tags and/or dupe hashes

To copy/roll tags and/or dupe hashes from one index to the next, you will need to create and use the indexing task Post-Crawl command. In the diskover directory there is a directory named scripts with an example post command bash script task-postcommands-example.sh.

Copy task-postcommands-example.sh script to task-postcommands.sh (or any file name you want) and edit for any post commands you want to run. In the example sh script, diskover-tagcopier.py is run to copy tags from previous index to new index.

To use the post command script edit the task in diskover-web task panel and in the Post-Crawl Command add

/bin/bash

and in the Post-Crawl Command Args add

./scripts/task-postcommands.sh {indexname}

{indexname} gets past into the shell script as argument 1 and is translated by diskoverd task daemon to the real index name.

Run diskoverd as a service

Setting up diskoverd task worker daemon as a service in CentOS 7.

Set the timezone setting in the diskoverd config file to match the task worker's time zone. List of Timezones.

vi /root/.config/diskoverd/config.yaml
timezone: America/Vancouver

Enable logging to a file in diskoverd and diskover config files by setting the logToFile setting to True.

Set up the diskoverd service by creating the below serviced file.

sudo vi /etc/systemd/system/diskoverd.service

[Unit]
Description=diskoverd task worker daemon
After=network.target

[Service]
Type=simple
User=root
WorkingDirectory=/opt/diskover/
ExecStart=/usr/bin/python3 /opt/diskover/diskoverd.py -n worker-%H
Restart=always

[Install]
WantedBy=multi-user.target

Set permissions, enable and start the diskoverd service:

sudo chmod 644 /etc/systemd/system/diskoverd.service
sudo systemctl daemon-reload
sudo systemctl enable diskoverd.service
sudo systemctl start diskoverd.service
sudo systemctl status diskoverd.service

Now you should have a diskoverd task service running and ready to work on tasks.

Starting, stopping and seeing the status of diskoverd service:

sudo systemctl stop diskoverd.service
sudo systemctl start diskoverd.service
sudo systemctl restart diskoverd.service
sudo systemctl status diskoverd.service

Accessing logs for diskoverd service:

journalctl -u diskoverd

Additional log files for diskoverd can be found in the directory set in diskoverd config files logDirectory setting.

Indexing S3 Buckets

diskover has the ability to add additional alternate scanners besides the default scandir python module. The scanners/ directory is the location of alternate python modules for scanning. Included in the directory is a python module scandir_s3 for scanning AWS S3 buckets.

To use the s3 alternate scanner first install the boto3 python module:

pip3 install boto3

After you will need to set up and configure AWS credentials, etc for boto3:

Scan and index a s3 bucket bucketname using an auto index name:

python3 diskover.py --altscanner scandir_s3 s3://bucketname

Using different endpoint URL (other than AWS)

To use a different s3 endpoint url (Wasabi, etc), set S3_ENDPOINT_URL environment variable before running the crawl.

export S3_ENDPOINT_URL=https://<endpoint>

Diskover-web API

Diskover-web has a rest api for getting and updating index data.

Get (with curl or web browser)
Getting file/directory tag info is done with the GET method.
Curl example:
curl -X GET http://localhost:8000/api.php/indexname/endpoint

List all diskover indices and stats for each:
GET http://localhost:8000/api.php/list

List all files with no tag (untagged):
GET http://localhost:8000/api.php/diskover-2018.01.17/tags?tag=&type=file

List all directories with no tag (untagged):
GET http://localhost:8000/api.php/diskover-2018.01.17/tags?tag=&type=directory

List files with tag "version 1":
GET http://localhost:8000/api.php/diskover-2018.01.17/tags?tag=version%201&type=file

List directories with tag "version 1":
GET http://localhost:8000/api.php/diskover-2018.01.17/tags?tag=version%201&type=directory

List files/directories (all items) with tag "version 1":
GET http://localhost:8000/api.php/diskover-2018.01.17/tags?tag=version%201

List files with tag "archive":
GET http://localhost:8000/api.php/diskover-2018.01.17/tags?tag=archive&type=file

List directories with tag "delete":
GET http://localhost:8000/api.php/diskover-2018.01.17/tags?tag=delete&type=directory

List total size (in bytes) of files for each tag:
GET http://localhost:8000/api.php/diskover-2018.01.17/tagsize?type=file

List total size (in bytes) of files with tag "delete":
GET http://localhost:8000/api.php/diskover-2018.01.17/tagsize?tag=delete&type=file

List total size (in bytes) of files with tag "version 1":
GET http://localhost:8000/api.php/diskover-2018.01.17/tagsize?tag=version%201&type=file

List total number of files for each tag:
GET http://localhost:8000/api.php/diskover-2018.01.17/tagcount?type=file

List total number of files with tag "delete":
GET http://localhost:8000/api.php/diskover-2018.01.17/tagcount?tag=delete&type=file

List total number of files with tag "version 1":
GET http://localhost:8000/api.php/diskover-2018.01.17/tagcount?tag=version+1&type=file

Search index using ES query syntax:
GET http://localhost:8000/api.php/diskover-2018.01.17/search?query=extension:png%20AND%20type:file%20AND%20size:>1048576

Get latest index name for top path in index:
GET http://localhost:8000/api.php/latest?toppath=/dirpath

For "tags", and "search" endpoints, you can set the page number and result size with ex. &page=1 and &size=100. Default is page 1 and size 1000.


Update (with JSON object)
Updating file/directory tags is done with the PUT method. You can send a JSON object in the body. The call returns the status and number of items updated.
Curl example:
curl -X PUT http://localhost:8000/api.php/index/endpoint -d '{}'

Tag files "delete":
PUT http://localhost:8000/api.php/diskover-2018.01.17/tagfiles
{"tags": ["delete"], "files": ["/Users/shirosai/file1.png", "/Users/shirosai/file2.png"]}

Tag files with tags "archive" and "version 1":
PUT http://localhost:8000/api.php/diskover-2018.01.17/tagfiles
{"tags": ["archive", "version 1"], "files": ["/Users/shirosai/file1.png", "/Users/shirosai/file2.png"]}

Remove tag "delete" for files which are tagged "delete":
PUT http://localhost:8000/api.php/diskover-2018.01.17/tagfiles
{"tags": ["delete"], "files": ["/Users/shirosai/file1.png", "/Users/shirosai/file2.png"]}

Remove all tags for files:
PUT http://localhost:8000/api.php/diskover-2018.01.17/tagfiles
{"tags": [], "files": ["/Users/shirosai/file1.png", "/Users/shirosai/file2.png"]}

Tag directory "archive" (non-recursive):
PUT http://localhost:8000/api.php/diskover-2018.01.17/tagdirs
{"tags": ["archive"], "dirs": ["/Users/shirosai/Downloads"]}

Tag directories and all files in directories with tags "archive" and "version 1" (non-recursive):
PUT http://localhost:8000/api.php/diskover-2018.01.17/tagdirs
{"tags": ["archive", "version 1"], "dirs": ["/Users/shirosai/Downloads", "/Users/shirosai/Documents"], "tagfiles": "true"}

Tag directory and all sub dirs (no files) with tag "version 1" (recursive):
PUT http://localhost:8000/api.php/diskover-2018.01.17/tagdirs
{"tags": ["version 1"], "dirs": ["/Users/shirosai/Downloads"], "recursive": "true"}

Tag directory and all items (files/directories) in directory and all sub dirs with tag "version 1" (recursive):
PUT http://localhost:8000/api.php/diskover-2018.01.17/tagdirs
{"tags": ["version 1"], "dirs": ["/Users/shirosai/Downloads"], "recursive": "true", "tagfiles": "true"}

Remove tag "archive" from directory which is tagged "archive":
PUT http://localhost:8000/api.php/diskover-2018.01.17/tagdirs
{"tags": ["archive"], "dirs": ["/Users/shirosai/Downloads"]}

Remove all tags from directory and all files in directory (non-recursive):
PUT http://localhost:8000/api.php/diskover-2018.01.17/tagdirs
{"tags": [], "dirs": ["/Users/shirosai/Downloads"], "tagfiles": "true"}

Remove all tags from directory and all items (files/directories) in directory and all sub dirs (recursive):
PUT http://localhost:8000/api.php/diskover-2018.01.17/tagdirs
{"tags": [], "dirs": ["/Users/shirosai/Downloads"], "recursive": "true", "tagfiles": "true"}

Example api calls in Python

"""example usage of diskover-web rest-api using requests and urllib
"""
import requests
try:
    from urllib import quote
except ImportError:
    from urllib.parse import quote
import json


url = "http://localhost:8000/api.php"

# list all diskover indices
r = requests.get('%s/list' % url)
print(r.url + "\n")
print(r.text + "\n")

# list total number of files for each tag in diskover-index index
index = "diskover-index"
r = requests.get('%s/%s/tagcount?type=file' % (url, index))
print(r.url + "\n")
print(r.text + "\n")

# list all png files in diskover-index index
q = quote("extension:png AND _type:file AND filesize:>1048576")
r = requests.get('%s/%s/search?query=%s' % (url, index, q))
print(r.url + "\n")
print(r.text + "\n")

# tag directory and all files in directory with tag "archive" (non-recursive)
d = {'tag': 'archive', 'path_parent': '/Users/cp/Downloads', 'tagfiles': 'true'}
r = requests.put('%s/%s/tagdir' % (url, index), data = json.dumps(d))
print(r.url + "\n")
print(r.text + "\n")

Restrict diskover-web API Access

To limit api access to certain hosts or networks, you can add an additional location block with allow/deny rules to your diskover-web nginx config /etc/nginx/conf.d/diskover-web.conf.

The nginx location block below needs to go above the other location block that starts with location ~ \.php(/|$) {

Change 1.2.3.4 to the IP address you want to allow access to the api. You can add additional allow lines if you want to allow more hosts/networks to access the api. The deny all line needs to come after all allow lines.

location ~ /api\.php(/|$) {
    allow 1.2.3.4;
    deny all;
    fastcgi_split_path_info ^(.+\.php)(/.+)$;
    set $path_info $fastcgi_path_info;
    fastcgi_param PATH_INFO $path_info;
    try_files $fastcgi_script_name =404; 
    fastcgi_pass unix:/var/run/php-fpm/php-fpm.sock;
    #fastcgi_pass 127.0.0.1:9000;
    include fastcgi_params;
    fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
    include fastcgi_params;
    fastcgi_read_timeout 900;
    fastcgi_buffers 16 16k;
    fastcgi_buffer_size 32k;
}

Restart nginx

systemctl restart nginx

Then verify you can access api with curl or web browser on an allowed host

curl http://<diskover-web-host>:<port>/api.php

you should see this

{
    "status": true,
    "message": {
        "version": "diskover REST API v2.0-b.3",
        "message": "endpoint not found"
    }
}

Others will now be blocked with a 403 forbidden http error page.

Securing Elasticsearch and Kibana

By default Elasticsearch has no security enabled. Follow this user guide to set up security.

https://www.elastic.co/guide/en/elasticsearch/reference/7.x/secure-cluster.html

Optimizing

Indices

We recommend you have more indices than a few very large indices. Rather than indexing at the very top level of your storage mounts, you could index 1 level down into multiple indices and then run parallel diskover.py index processes which will be much faster to index a really large share with 100’s of millions of files.

You can optimize your indices by setting number of shards and replicas in diskover config file. By default in diskover config, shards are set to 1 and replicas are set to 0. It is important to note that these settings are not meant for production as they provide no load balancing or fault tolerance.

You want to aim for shard sizes between 10GB - 50GB, ideally somewhere between 20 - 40GB per shard.

Examples:

index that is 60 GB in size, you will want to set shards to 3 and replicas to 1 or 2 and spread across 3 ES nodes.
index that is 5 GB in size, you will want to set shards to 1 and replicas to 1 or 2 and be on 1 ES node or spread across 3 ES nodes (recommended).

Note: Replicas help with search performance and provide fault tolerance. When you change shard/replica numbers, you have to delete the index and re-index.

Separate Elasticsearch hosts

It is recommended to index on a a separate host to not impact ES. Also, run diskover-web on a separate host from ES and the indexing host(s).

Elasticsearch memory

Another important configuration for Elasticsearch you will want to set java heap mem size. It should be half your ES host ram up to 32 GB.

Indexing hosts

The indexing host uses a separate thread for each directory at level 1 of top crawl directory. If you have many directories at level 1 you will want to increase the number of cpu cores and adjust maxthreads in diskover config.

Troubleshooting

Troubleshoot diskover indexing

Enable debug logging in config by setting logLevel to DEBUG and enable logging to a file by setting logToFile to True.

logLevel: DEBUG
logToFile: True
logDirectory: /tmp/

Run and redirect all stdout/stderr output to a log file:

python3 diskover.py ... > /var/log/<logname>.log 2>&1

Troubleshoot diskover-web

Do a hard refresh of your browser

this will load any new files from web server and not used the ones cached locally

Clear all diskover-web cookies

go to settings page, clear cookies section and click "Clear" button, logout of diskover-web and log back in

Clear cookies and cached data for diskover-web from your browser settings menu

View nginx error logs for any php/es warnings/errors:

tail -f /var/log/nginx/error.log

If you see any permission errors, see diskover-web v2 install guide and check diskover-web permissions and user/group changes required.

Diskover v2 Usage

Hardware Requirements

Minimum

Recommended

Config files

Using alternate config

License files

Building indices

Add additional directory tree to existing index

Creating multiple indices vs all top paths in single index

Find duplicate files (dupes) in existing index

Auto-tagging an existing index

Copy tags from one index to another

Add extra meta data using crawl Plugins

Media Info plugin

Log Warnings

Index Lifecycle Management

Index State Management on AWS ES

Indexing Shell Script Example

diskover-web Task Panel

diskoverd Workers

diskover-web Indexing Tasks

Copy/roll tags and/or dupe hashes

Run diskoverd as a service

Indexing S3 Buckets

Using different endpoint URL (other than AWS)

Diskover-web API

Example api calls in Python

Restrict diskover-web API Access

Securing Elasticsearch and Kibana

Optimizing

Indices

Separate Elasticsearch hosts

Elasticsearch memory

Indexing hosts

Troubleshooting

Troubleshoot diskover indexing

Troubleshoot diskover-web

Clone this wiki locally