MetadataHarvester is an advanced file metadata extraction tool designed for cybersecurity professionals, researchers, and analysts. This tool efficiently scans websites for downloadable files, extracts metadata using ExifTool, and stores the information in a structured format, allowing for comprehensive analysis. With capabilities for deep web searches through the Tor network, MetadataHarvester offers unparalleled versatility for collecting crucial metadata from a wide range of file types.
- Comprehensive Metadata Extraction: Extract detailed metadata from various file types, including PDF, DOC, DOCX, JPG, PNG, and many more.
- Tor Network Compatibility: Seamlessly integrates with the Tor network to ensure anonymity and access to .onion domains, expanding its reach into the deep web.
- Automatic Data Logging: Store metadata in SQLite databases for easy management and future analysis.
- User-Defined File Types: Customize file type searches based on specific needs, or scan for all supported file types.
- Efficient Web Crawling: Employs user-agent rotation and randomized delays to crawl web pages without triggering security defenses.
- Integrated with ExifTool: Leverages the power of ExifTool to provide accurate and detailed metadata extraction from supported files.
- Simple Output Options: Save the results in a database or as a simple text file.
Before using MetadataHarvester, ensure you have the required dependencies installed.
- Python 3.6+
- Tor service installed and running
- ExifTool installed (
sudo apt-get install libimage-exiftool-perl
on Debian-based systems)
-
Clone the repository:
git clone https://github.com/n4rr34n6/MetadataHarvester.git cd MetadataHarvester
-
Install dependencies:
pip3 install -r requirements.txt
-
Ensure the Tor service is active and configured correctly:
sudo service tor start
Run the script by specifying the target URL and output file:
python3 MetadataHarvester.py -u https://example.com -o output.db
You can also specify the file types to search for:
python3 MetadataHarvester.py -u https://example.com -o output.db -t pdf,docx
- Web Scraping: Uses
BeautifulSoup
for HTML parsing andrequests
to handle HTTP and HTTPS connections. - Tor Integration: Uses SOCKS5 proxies for routing traffic through the Tor network.
- ExifTool: Extracts metadata from files, and results are stored in SQLite databases or text files for flexible output options.
MetadataHarvester is intended for use in lawful research, cybersecurity analysis, and file management. Unauthorized scanning or data extraction from websites may violate terms of service and legal statutes. The developers are not responsible for any misuse of this tool.
This project is provided under the GNU Affero General Public License v3.0. You can find the full license text in the LICENSE file.