Skip to content

Latest commit

 

History

History
222 lines (211 loc) · 5.75 KB

URL.md

File metadata and controls

222 lines (211 loc) · 5.75 KB

URL Extraction

Install

$ sudo apt-get update && sudo apt-get upgrade
$ sudo apt-get install virtualenv python3 python3-dev python-dev gcc libpq-dev libssl-dev libffi-dev build-essentials
$ virtualenv -p /usr/bin/python3 .env
$ source .env/bin/activate
$ pip install -r requirements.txt

URL Feature Extractor

Extracting 111 features from URLs from your own database

How to use

Run:

$ python run.py <input-urls> <output-dataset>

: Text files consisting of one or more url File format of choice (bytes, csv etc)

Features implemented

LEXICAL
Count (.) in URL Count (-) in URL Count (_) in URL Count (/) in URL
Count (?) in URL Count (=) in URL Count (@) in URL Count (&) in URL
Count (!) in URL Count ( ) in URL Count (~) in URL Count (,) in URL
Count (+) in URL Count (*) in URL Count (#) in URL Count ($) in URL
Count (%) in URL URL LengthL TLD amount in URL Count (.) in Domain
Count (-) in Domain Count (_) in Domain Count (/) in Domain Count (?) in Domain
Count (=) in Domain Count (@) in Domain Count (&) in Domain Count (!) in Domain
Count ( ) in Domain Count (~) in Domain Count (,) in Domain Count (+) in Domain
Count (*) in Domain Count (#) in Domain Count ($) in Domain Count (%) in Domain
Domain Length Quantidade de vogais in Domain URL domain in IP address format Domain contains the key words "server" or "client"
Count (.) in Directory Count (-) in Directory Count (_) in Directory Count (/) in Directory
Count (?) in Directory Count (=) in Directory Count (@) in Directory Count (&) in Directory
Count (!) in Directory Count ( ) in Directory Count (~) in Directory Count (,) in Directory
Count (+) in Directory Count (*) in Directory Count (#) in Directory Count ($) in Directory
Count (%) in Directory Directory Length Count (.) in file Count (-) in file
Count (_) in file Count (/) in file Count (?) in file Count (=) in file
Count (@) in file Count (&) in file Count (!) in file Count ( ) in file
Count (~) in file Count (,) in file Count (+) in file Count (*) in file
Count (#) in file Count ($) in file Count (%) in file File length
Count (.) in parameters Count (-) in parameters Count (_) in parameters Count (/) in parameters
Count (?) in parameters Count (=) in parameters Count (@) in parameters Count (&) in parameters
Count (!) in parameters Count ( ) in parameters Count (~) in parameters Count (,) in parameters
Count (+) in parameters Count (*) in parameters Count (#) in parameters Count ($) in parameters
Count (%) in parameters Length of parameters TLD presence in arguments Number of parameters
Email present at URL
HOST
Search time (response) domain (lookup) Domain has SPF?
AS Number (or ASN) Time (in days) of domain activation Time (in days) of domain expiration
Number of resolved IPs Number of resolved name servers (NameServers - NS) Number of MX Servers Time-to-live (TTL) value associated with hostname
OTHERS
Valid TLS / SSL Certificate Number of redirects Check if URL is indexed on Google Check if domain is indexed on Google
Uses URL shortener service