-
Notifications
You must be signed in to change notification settings - Fork 163
Diskover v2 Usage
- Hardware Requirements (Essential version)
- License files(Essential version)
- Config files
- Building indices
- Add additional directory tree to existing index (Essential version)
- Find duplicate files (dupes) in existing index (Pro version)
- Auto-tagging an existing index (Pro version)
- Copy tags from one index to another (Pro version)
- Add extra meta data using crawl Plugins
- Media Info plugin (Media Edition)
- Log Warnings
- Index Lifecycle Management
- Index State Management on AWS ES
- Indexing Shell Script Example (Pro version)
- diskover-web Task Panel (Essential version)
- diskover-web Indexing Tasks (Essential version)
- Run diskoverd as a service (Essential version)
- Indexing AWS S3 buckets (Essential version)
- Diskover-web API (Pro version)
- Restrict diskover-web API Access (Pro version)
- Securing Elasticsearch and Kibana
- Optimizing (Essential version)
- Troubleshooting
- Elasticsearch hosts: 8 cpu cores, 16 gb ram (8 gb reserved for ES mem heap) (1 es node for testing, min 3 for production)
- diskover-web web server: 2 cpu cores, 4gb ram
- indexing host(s): 4 cpu cores, 4gb ram
- Elasticsearch hosts: 16 cpu cores, 32gb ram (16 gb reserved for ES mem heap) (1 es node for testing, min 3 for production)
- diskover-web web server: 4 cpu cores, 8gb ram
- indexing host(s): 8-16 cpu cores, 8gb ram
Note: It is recommended to separate ES, web server and indexing host(s). Indices ideally should be on SSD. NFS data stores do not usually perform well for indices.
diskover config file is located in:
- Linux:
~/.config/diskover/config.yaml
- macOS:
~/Library/Application Support/diskover/config.yaml
- Windows:
%APPDATA%\diskover\config.yaml
Default configs are located in configs_sample/
. There are separate configs for diskover crawler, autotag, dupes-finder, diskoverd, etc.
The default config files are not used by diskover crawler, etc., they are default/sample configs and need to be copied to the appropriate directory based on the OS.
For example, in Linux the config files are in ~/.config/<appName>/config.yaml
. Each config file has a setting appName that matches the directory name where the config file is located. For diskover dupes-finder for example, this would be ~/.config/diskover_dupesfinder/config.yaml
.
Use spaces in config files, not tabs.
If you get an error message when starting diskover.py like Config ERROR: diskover.excludes.dirs not found, check config for errors or missing settings from default config.
, check that your config file is not missing any lines from default/sample config or there are no errors in your config like missing values.
To use an alternate config file, set an environment var to the directory containing the config.yaml, example in Linux for diskover:
export DISKOVERDIR=/path/altconfigdir
The env var name is dependant on the appName setting in config, example in Linux to set an alt config file for diskover_dupesfinder which is set to appName: diskover_dupesfinder
in config:
export DISKOVER_DUPESFINDERDIR=/path/altconfigdir
License files should be copied to/located in:
- diskover:
/opt/diskover/diskover.lic
- diskover-web:
/var/www/diskover-web/src/diskover/diskover-web.lic
License permissions:
Please check diskover-web.lic file is owned by nginx user and permissions are 644:
chown nginx:nginx diskover-web.lic && chmod 644 diskover-web.lic
Run a crawl in foreground printing all log output to screen:
python3 diskover.py -i diskover-<indexname> <tree_dir>
See all cli options:
python3 diskover.py -h
- Multiple directory trees (tree_dir) can be set to index multiple top paths into single index (Essential version).
- UNC paths and drive maps are supported in Windows.
- Index name requires diskover- prefix. Recommended index name
diskover-<mountname>-<datetime>
. - Index name is optional, indices will default be named
diskover-<treedir>-<datetime>
.
On Linux or macOS, to run a crawl in the background and redirect all output to a log file:
nohup python3 diskover.py ... > /var/log/<logname>.log 2>&1 &
- Log settings including log level (logLevel) and logging to a file (logToFile) instead of screen can be found in diskover config.
python3 diskover.py -i diskover-<indexname> -a <tree_dir>
The advantage of running multiple index tasks is speed, you can run them in parallel (in background or on separate indexing machines) and also don’t have to wait for some long directory tree to finish scanning for the index to be usable in diskover-web, etc.
sh
diskover.py -i diskover-nas1 /mnt/stor1
diskover.py -i diskover-nas2 /mnt/stor2
will be better than
diskover.py -i diskover-nas /mnt/stor1 /mnt/stor2
since stor2 may have a lot more files/directories and you won’t be able to use the diskover-nas index until both finish scanning.
diskover uses threads for walking a directory tree, example if maxthreads
in diskover config is set to 20, up to max 20 sub directories under index top path (top directory path/ mount point/ volume) can scan and index at once. This is important if you have a lot or very few sub directories at level 1 in /mnt/toppath. If /mnt/toppath has only a single sub directory at level 1, crawls will be slower since then there will ever only be 1 thread running.
Check that you have the config file in ~/.config/diskover_dupesfinder/config.yaml, if not, copy from default config folder in configs_sample/diskover_dupesfinder/config.yaml.
To use the default hashing mode (xxhash), you will first need to install xxhash Python module.
Post indexing plugins, such as dupes finder, are located in plugins_postindex/
directory.
pip3 install xxhash
Run dupes finder for a completed index:
python3 diskover-dupesfinder.py diskover-<indexname>
Run dupes finder for multiple completed indices:
python3 diskover-dupesfinder.py diskover-<indexname1> diskover-<indexname2>
See all cli options:
python3 diskover-dupesfinder.py -h
For example, using the diskover-cache sqlite3 db using -u
to store file hashes, or using an existing index to look up hashes with -U
can save time by not having to hash the file again. -a
will add all file hashes generated to the index, not just ones for dupe files found.
Calculating file hash checksums is an expensive cpu/disk operation so you may want to control what files in the index you want to hash. There are various config options for limiting what gets hashed and marked as a dupe in the index (is_dupe set to True). The ES index fields for file type that get updated are hash and is_dupe.
Config settings:
- hash mode either xxhash or md5
- file extensions you want to hash
- minsize and maxsize of files to hash
- additional ES query otherquery when searching in index for what files to hash
- replacepaths for translating paths from source (index path) to dest path, example on Windows translating /z_drive/ to Z:\
Check that you have the config file in ~/.config/diskover_autotag/config.yaml, if not, copy from default config folder in configs_sample/diskover_autotag/config.yaml.
Post indexing plugins, such as auto tag, are located in plugins_postindex/
directory.
python3 diskover-autotag.py diskover-<indexname>
- Auto-tagging rules can be found in the diskover-autotag config file for tagging files and directories
- Auto-tagging can also be done during crawl by enabling autotag in diskover config and setting rules in the diskover config file
- You can also manually tag in diskover-web or by using diskover-web api
- All tags are stored in the tags field in the index, there is no limit to the number of tags
Check that you have the config file in ~/.config/diskover_tagcopier/config.yaml, if not, copy from default config folder in configs_sample/diskover_tagcopier/config.yaml.
Post indexing plugins, such as tag copier, are located in plugins_postindex/
directory.
python3 diskover-tagcopier.py diskover-<source_indexname> diskover-<dest_indexname>
See all cli options:
python3 diskover-tagcopier.py -h
You can add extra meta data, for example unix permissions, to an index during crawl time by adding a plugin to Diskover crawler. Plugins are stored in plugins/ folder in the root directory of diskover. There are a few examples in the plugins folder to get you started.
For example, using a unix permissions plugin could allow you to see all open permissions for files and directories (rwx). Some other examples are database lookups to apply extra tags, content indexing and if keyword found tag file, copy or backup file if matches certain criteria, etc. This is all done during crawl time.
To create a plugin:
- make a directory in plugins with the name of the plugin, example
myplugin
- create a file in the
myplugin
directory named__init__.py
- copy the code from one of the example plugins and edit to create the plugin. There are six required function names but they can be edited however you want as long as the return value type is the same.
The six required function names for plugins are:
- add_mappings
- add_meta
- add_tags
- for_type
- init
- close
To enable or disable plugins, edit the diskover config file diskover > plugins section.
Example to enable the mediainfo plugin, add 'mediainfo' to the files list in the plugin section.
To list all plugins that will be used during crawl:
python3 diskover.py -l
The mediainfo plugin will enable additional meta data for video files to be indexed. The media info plugin uses FFmpeg/ ffprobe, you will need to install the FFmpeg package and check that ffprobe is in the PATH before using the plugin.
How to install FFmpeg on CentOS 7
To enable the mediainfo plugin, edit the diskover config file diskover > plugins section and add 'mediainfo' to the files list.
files: ['mediainfo']
New indices will use the plugin and any video file will get additional media info added to it's Elasticsearch doc's media_info
field.
You can now view media info in diskover-web since it will store it in a new field for video files, the field name is media_info
. To add it as a new field in diskover-web, go to settings and then in additional fields add
media_info|MediaInfo
Then you can search for a video file and its media info will be in that column. You can do searches on media info like this
media_info.<key>:<value>
Example to find video file with 1920x1080 resolution:
media_info.resolution:1920x1080
Another example to find every video file that is not 1920x1080:
media_info:* AND NOT media_info.resolution:”1920x1080"
- urllib3.connectionpool - WARNING - Connection pool is full, discarding connection: localhost
If you are seeing this ES warning, there are two things you can try. In your diskover config lower the maxthreads to be something like 16 or 20 and/or set your maxsize setting to be higher for ES connections to 40 or more.
More information about Index Lifecycle Management can be found on elastic.co.
Example:
- Your elasticsearch server is accessible at http://elasticsearch:9200
- You want your indices to be purged after thirty days (30d)
- Your policy name will be created as cleanup_policy_diskover
- Create a policy that deletes indices after one month
curl -X PUT "http://elasticsearch:9200/_ilm/policy/cleanup_policy_diskover?pretty" \
-H 'Content-Type: application/json' \
-d '{
"policy": {
"phases": {
"hot": {
"actions": {}
},
"delete": {
"min_age": "30d",
"actions": { "delete": {} }
}
}
}
}'
- Apply this policy to all existing diskover indices
curl -X PUT "http://elasticsearch:9200/diskover-*/_settings?pretty" \
-H 'Content-Type: application/json' \
-d '{ "lifecycle.name": "cleanup_policy_diskover" }'
- Create a template to apply this policy to new diskover indices
curl -X PUT "http://elasticsearch:9200/_template/logging_policy_template?pretty" \
-H 'Content-Type: application/json' \
-d '{
"index_patterns": ["diskover-*"],
"settings": { "index.lifecycle.name": "cleanup_policy_diskover" }
}'
More information about Index State Management in Amazon ES can be found on aws docs.
Example:
- Your AWS Elasticsearch Service endpoint url is
<aws es endpoint>
- You want your indices to be purged after thirty days (30d)
- Your policy name will be created as cleanup_policy_diskover
- Create a policy that deletes indices after one month for new diskover indices
curl -u username:password -X PUT "https://<aws es endpoint>:443/_opendistro/_ism/policies/cleanup_policy_diskover" \
-H 'Content-Type: application/json' \
-d '{
"policy": {
"description": "Cleanup policy for diskover indices on AWS ES.",
"schema_version": 1,
"default_state": "current",
"states": [{
"name": "current",
"actions": [],
"transitions": [{
"state_name": "delete",
"conditions": {
"min_index_age": "30d"
}
}]
},
{
"name": "delete",
"actions": [{
"delete": {}
}],
"transitions": []
}
],
"ism_template": {
"index_patterns": ["diskover-*"],
"priority": 100
}
}
}'
- Apply this policy to all existing diskover indices
curl -u username:password -X POST "https://<aws es endpoint>:443/_opendistro/_ism/add/diskover-*" \
-H 'Content-Type: application/json' \
-d '{ "policy_id": "cleanup_policy_diskover" }'
#!/bin/bash
###
### Example diskover shell script to build an index and after completion,
### run a find dupes on the index using the previous days index name as a
### file hash lookup index.
###
### Set VARS
# DIRS is the directory level ABOVE the top level paths that are set to be indexed. For example:
# If you have a directory structure with /storage/path1 and /storage/path2, set DIRS to /storage/*
# in order to create a single index with both path1 and path2 as separate top level directories
# inside of it.
DIRS=$(ls -d /storage/*)
# TODAY is today's date.
TODAY=$(date +%d-%m)
# YDAY is yesterday's date. Used by find dupes command. On MacOS use date -v-1d +%d-%m.
YDAY=$(date -d 'yesterday' +%d-%m)
# INDEX is the name you want to give your indices. Prefix diskover- is required. TODAY will be added
# to the end of the index name.
INDEX=diskover-storage
### Indexing
# Build Index
python3 /opt/diskover/diskover.py -i $INDEX-$TODAY $DIRS
### Find Dupes
python3 /opt/diskover/plugins_postindex/diskover_dupesfinder.py $INDEX-$TODAY -U $INDEX-$YDAY -a
Task panel can be used to schedule building indices or running any type of file action task such as copying files, running duplicate file finding, checking permissions on directories, etc. The task panel is a swiss-army knife for data management.
To get started with the task panel, check you have the json files in diskover-web public/tasks/ directory.
cd /var/www/diskover-web/public/tasks
ls *.json
tasklog.json tasks.json templates.json workers.json
You should see the above .json files which are used by the Task Panel for storing task and worker related data. If you only see .json.sample files, copy the sample/default files
for f in *.json.sample; do cp $f "${f%.*}"; done
chmod 660 *.json
chown nginx:nginx *.json
You will need to start at least one diskover worker daemon (diskoverd) to work on tasks. diskoverd can run on the diskover host or on any host. diskoverd requires access to the diskover-web rest api which is located at http://<diskover-web-host>:<port>/api.php
.
Check you have access to the web api for example
curl http://<diskover-web-host>:<port>/api.php
If you have access, you should receive a response like
{
"status": true,
"message": {
"version": "diskover REST API",
"message": "endpoint not found"
}
}
To get diskoverd up and running and working on tasks, you'll first need to copy the default/sample diskoverd config file from your diskover directory in configs_sample/diskoverd/config.yaml
and copy it to for example on Linux to ~/.configs/diskoverd/
.
mkdir -p ~/.config/diskoverd/
cd /opt/diskover/configs_sample/diskoverd/
cp config.yaml ~/.config/diskoverd/
After copying the config file, edit the diskoverd config in ~/.config/diskoverd/config.yaml
and change the setting for api endpoint apiurl to your diskover-web url. Also in the diskoverd config, set your local worker timezone and email server settings if you want to receive emails after a task completes.
To start up a diskoverd worker run the below command
python3 diskoverd.py
With no cli options, diskoverd uses a unique worker name (hostname + unique id) each time it is started.
To see all cli options, such as setting a worker name, use -h
python3 diskoverd.py -h
To enable logging to a file and set log level, edit the config and set logLevel, logToFile and logDirectory and stop and restart diskoverd.
After diskoverd has started, it will appear in diskover-web Tasks Panel on the workers page. From there you can see the health of the worker (online/offline) and disable the worker, etc. A worker will show as offline if it does not send a hearbeat for 10 min. diskoverd tries to send a heartbeat every 2 min to diskover-web api.
To copy/roll tags and/or dupe hashes from one index to the next, you will need to create and use the indexing task Post-Crawl command. In the diskover directory there is a directory named scripts
with an example post command bash script task-postcommands-example.sh
.
Copy task-postcommands-example.sh script to task-postcommands.sh (or any file name you want) and edit for any post commands you want to run. In the example sh script, diskover-tagcopier.py is run to copy tags from previous index to new index.
To use the post command script edit the task in diskover-web task panel and in the Post-Crawl Command add
/bin/bash
and in the Post-Crawl Command Args add
./scripts/task-postcommands.sh {indexname}
{indexname} gets past into the shell script as argument 1 and is translated by diskoverd task daemon to the real index name.
Setting up diskoverd task worker daemon as a service in CentOS 7.
Set the timezone
setting in the diskoverd config file to match the task worker's time zone. List of Timezones.
vi /root/.config/diskoverd/config.yaml
timezone: America/Vancouver
Enable logging to a file in diskoverd and diskover config files by setting the logToFile
setting to True
.
Set up the diskoverd service by creating the below serviced file.
sudo vi /etc/systemd/system/diskoverd.service
[Unit]
Description=diskoverd task worker daemon
After=network.target
[Service]
Type=simple
User=root
WorkingDirectory=/opt/diskover/
ExecStart=/usr/bin/python3 /opt/diskover/diskoverd.py -n worker-%H
Restart=always
[Install]
WantedBy=multi-user.target
Set permissions, enable and start the diskoverd service:
sudo chmod 644 /etc/systemd/system/diskoverd.service
sudo systemctl daemon-reload
sudo systemctl enable diskoverd.service
sudo systemctl start diskoverd.service
sudo systemctl status diskoverd.service
Now you should have a diskoverd task service running and ready to work on tasks.
Starting, stopping and seeing the status of diskoverd service:
sudo systemctl stop diskoverd.service
sudo systemctl start diskoverd.service
sudo systemctl restart diskoverd.service
sudo systemctl status diskoverd.service
Accessing logs for diskoverd service:
journalctl -u diskoverd
Additional log files for diskoverd can be found in the directory set in diskoverd config files logDirectory
setting.
diskover has the ability to add additional alternate scanners besides the default scandir python module. The scanners/ directory is the location of alternate python modules for scanning. Included in the directory is a python module scandir_s3 for scanning AWS S3 buckets.
To use the s3 alternate scanner first install the boto3 python module:
pip3 install boto3
After you will need to set up and configure AWS credentials, etc for boto3:
- https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html
- https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html
Scan and index a s3 bucket bucketname using an auto index name:
python3 diskover.py --altscanner scandir_s3 s3://bucketname
To use a different s3 endpoint url (Wasabi, etc), set S3_ENDPOINT_URL
environment variable before running the crawl.
export S3_ENDPOINT_URL=https://<endpoint>
Diskover-web has a rest api for getting and updating index data.
Get (with curl or web browser)
Getting file/directory tag info is done with the GET method.
Curl example:
curl -X GET http://localhost:8000/api.php/indexname/endpoint
List all diskover indices and stats for each:
GET http://localhost:8000/api.php/list
List all files with no tag (untagged):
GET http://localhost:8000/api.php/diskover-2018.01.17/tags?tag=&type=file
List all directories with no tag (untagged):
GET http://localhost:8000/api.php/diskover-2018.01.17/tags?tag=&type=directory
List files with tag "version 1":
GET http://localhost:8000/api.php/diskover-2018.01.17/tags?tag=version%201&type=file
List directories with tag "version 1":
GET http://localhost:8000/api.php/diskover-2018.01.17/tags?tag=version%201&type=directory
List files/directories (all items) with tag "version 1":
GET http://localhost:8000/api.php/diskover-2018.01.17/tags?tag=version%201
List files with tag "archive":
GET http://localhost:8000/api.php/diskover-2018.01.17/tags?tag=archive&type=file
List directories with tag "delete":
GET http://localhost:8000/api.php/diskover-2018.01.17/tags?tag=delete&type=directory
List total size (in bytes) of files for each tag:
GET http://localhost:8000/api.php/diskover-2018.01.17/tagsize?type=file
List total size (in bytes) of files with tag "delete":
GET http://localhost:8000/api.php/diskover-2018.01.17/tagsize?tag=delete&type=file
List total size (in bytes) of files with tag "version 1":
GET http://localhost:8000/api.php/diskover-2018.01.17/tagsize?tag=version%201&type=file
List total number of files for each tag:
GET http://localhost:8000/api.php/diskover-2018.01.17/tagcount?type=file
List total number of files with tag "delete":
GET http://localhost:8000/api.php/diskover-2018.01.17/tagcount?tag=delete&type=file
List total number of files with tag "version 1":
GET http://localhost:8000/api.php/diskover-2018.01.17/tagcount?tag=version+1&type=file
Search index using ES query syntax:
GET http://localhost:8000/api.php/diskover-2018.01.17/search?query=extension:png%20AND%20type:file%20AND%20size:>1048576
Get latest index name for top path in index:
GET http://localhost:8000/api.php/latest?toppath=/dirpath
For "tags", and "search" endpoints, you can set the page number and result size with ex. &page=1 and &size=100. Default is page 1 and size 1000.
Update (with JSON object)
Updating file/directory tags is done with the PUT method. You can send a JSON object in the body. The call returns the status and number of items updated.
Curl example:
curl -X PUT http://localhost:8000/api.php/index/endpoint -d '{}'
Tag files "delete":
PUT http://localhost:8000/api.php/diskover-2018.01.17/tagfiles
{"tags": ["delete"], "files": ["/Users/shirosai/file1.png", "/Users/shirosai/file2.png"]}
Tag files with tags "archive" and "version 1":
PUT http://localhost:8000/api.php/diskover-2018.01.17/tagfiles
{"tags": ["archive", "version 1"], "files": ["/Users/shirosai/file1.png", "/Users/shirosai/file2.png"]}
Remove tag "delete" for files which are tagged "delete":
PUT http://localhost:8000/api.php/diskover-2018.01.17/tagfiles
{"tags": ["delete"], "files": ["/Users/shirosai/file1.png", "/Users/shirosai/file2.png"]}
Remove all tags for files:
PUT http://localhost:8000/api.php/diskover-2018.01.17/tagfiles
{"tags": [], "files": ["/Users/shirosai/file1.png", "/Users/shirosai/file2.png"]}
Tag directory "archive" (non-recursive):
PUT http://localhost:8000/api.php/diskover-2018.01.17/tagdirs
{"tags": ["archive"], "dirs": ["/Users/shirosai/Downloads"]}
Tag directories and all files in directories with tags "archive" and "version 1" (non-recursive):
PUT http://localhost:8000/api.php/diskover-2018.01.17/tagdirs
{"tags": ["archive", "version 1"], "dirs": ["/Users/shirosai/Downloads", "/Users/shirosai/Documents"], "tagfiles": "true"}
Tag directory and all sub dirs (no files) with tag "version 1" (recursive):
PUT http://localhost:8000/api.php/diskover-2018.01.17/tagdirs
{"tags": ["version 1"], "dirs": ["/Users/shirosai/Downloads"], "recursive": "true"}
Tag directory and all items (files/directories) in directory and all sub dirs with tag "version 1" (recursive):
PUT http://localhost:8000/api.php/diskover-2018.01.17/tagdirs
{"tags": ["version 1"], "dirs": ["/Users/shirosai/Downloads"], "recursive": "true", "tagfiles": "true"}
Remove tag "archive" from directory which is tagged "archive":
PUT http://localhost:8000/api.php/diskover-2018.01.17/tagdirs
{"tags": ["archive"], "dirs": ["/Users/shirosai/Downloads"]}
Remove all tags from directory and all files in directory (non-recursive):
PUT http://localhost:8000/api.php/diskover-2018.01.17/tagdirs
{"tags": [], "dirs": ["/Users/shirosai/Downloads"], "tagfiles": "true"}
Remove all tags from directory and all items (files/directories) in directory and all sub dirs (recursive):
PUT http://localhost:8000/api.php/diskover-2018.01.17/tagdirs
{"tags": [], "dirs": ["/Users/shirosai/Downloads"], "recursive": "true", "tagfiles": "true"}
"""example usage of diskover-web rest-api using requests and urllib
"""
import requests
try:
from urllib import quote
except ImportError:
from urllib.parse import quote
import json
url = "http://localhost:8000/api.php"
# list all diskover indices
r = requests.get('%s/list' % url)
print(r.url + "\n")
print(r.text + "\n")
# list total number of files for each tag in diskover-index index
index = "diskover-index"
r = requests.get('%s/%s/tagcount?type=file' % (url, index))
print(r.url + "\n")
print(r.text + "\n")
# list all png files in diskover-index index
q = quote("extension:png AND _type:file AND filesize:>1048576")
r = requests.get('%s/%s/search?query=%s' % (url, index, q))
print(r.url + "\n")
print(r.text + "\n")
# tag directory and all files in directory with tag "archive" (non-recursive)
d = {'tag': 'archive', 'path_parent': '/Users/cp/Downloads', 'tagfiles': 'true'}
r = requests.put('%s/%s/tagdir' % (url, index), data = json.dumps(d))
print(r.url + "\n")
print(r.text + "\n")
To limit api access to certain hosts or networks, you can add an additional location block with allow/deny rules to your diskover-web nginx config /etc/nginx/conf.d/diskover-web.conf.
The nginx location block below needs to go above the other location block that starts with location ~ \.php(/|$) {
Change 1.2.3.4 to the IP address you want to allow access to the api. You can add additional allow lines if you want to allow more hosts/networks to access the api. The deny all
line needs to come after all allow lines.
location ~ /api\.php(/|$) {
allow 1.2.3.4;
deny all;
fastcgi_split_path_info ^(.+\.php)(/.+)$;
set $path_info $fastcgi_path_info;
fastcgi_param PATH_INFO $path_info;
try_files $fastcgi_script_name =404;
fastcgi_pass unix:/var/run/php-fpm/php-fpm.sock;
#fastcgi_pass 127.0.0.1:9000;
include fastcgi_params;
fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
include fastcgi_params;
fastcgi_read_timeout 900;
fastcgi_buffers 16 16k;
fastcgi_buffer_size 32k;
}
Restart nginx
systemctl restart nginx
Then verify you can access api with curl or web browser on an allowed host
curl http://<diskover-web-host>:<port>/api.php
you should see this
{
"status": true,
"message": {
"version": "diskover REST API v2.0-b.3",
"message": "endpoint not found"
}
}
Others will now be blocked with a 403 forbidden http error page.
By default Elasticsearch has no security enabled. Follow this user guide to set up security.
https://www.elastic.co/guide/en/elasticsearch/reference/7.x/secure-cluster.html
We recommend you have more indices than a few very large indices. Rather than indexing at the very top level of your storage mounts, you could index 1 level down into multiple indices and then run parallel diskover.py index processes which will be much faster to index a really large share with 100’s of millions of files.
You can optimize your indices by setting number of shards and replicas in diskover config file. By default in diskover config, shards are set to 1 and replicas are set to 0. It is important to note that these settings are not meant for production as they provide no load balancing or fault tolerance.
You want to aim for shard sizes between 10GB - 50GB, ideally somewhere between 20 - 40GB per shard.
Examples:
- index that is 60 GB in size, you will want to set shards to 3 and replicas to 1 or 2 and spread across 3 ES nodes.
- index that is 5 GB in size, you will want to set shards to 1 and replicas to 1 or 2 and be on 1 ES node or spread across 3 ES nodes (recommended).
Note: Replicas help with search performance and provide fault tolerance. When you change shard/replica numbers, you have to delete the index and re-index.
It is recommended to index on a a separate host to not impact ES. Also, run diskover-web on a separate host from ES and the indexing host(s).
Another important configuration for Elasticsearch you will want to set java heap mem size. It should be half your ES host ram up to 32 GB.
The indexing host uses a separate thread for each directory at level 1 of top crawl directory. If you have many directories at level 1 you will want to increase the number of cpu cores and adjust maxthreads
in diskover config.
Enable debug logging in config by setting logLevel to DEBUG and enable logging to a file by setting logToFile to True.
logLevel: DEBUG
logToFile: True
logDirectory: /tmp/
Run and redirect all stdout/stderr output to a log file:
python3 diskover.py ... > /var/log/<logname>.log 2>&1
Do a hard refresh of your browser
- this will load any new files from web server and not used the ones cached locally
Clear all diskover-web cookies
- go to settings page, clear cookies section and click "Clear" button, logout of diskover-web and log back in
Clear cookies and cached data for diskover-web from your browser settings menu
View nginx error logs for any php/es warnings/errors:
tail -f /var/log/nginx/error.log
If you see any permission errors, see diskover-web v2 install guide and check diskover-web permissions and user/group changes required.