Skip to content

Polars extension for IP address parsing and enrichment including geolocation

License

Notifications You must be signed in to change notification settings

erichutchins/polars_iptools

Repository files navigation

Polars IPTools

Polars IPTools is a Rust-based extension to accelerates IP address manipulation and enrichment in Polars dataframes. This library includes various utility functions for working with IPv4 and IPv6 addresses and geoip and anonymization/proxy enrichment using MaxMind databases.

Install

pip install polars-iptools

Examples

Simple enrichments

IPTools' Rust implementation gives you speedy answers to basic IP questions like "is this a private IP?"

>>> import polars as pl
>>> import polars_iptools as ip
>>> df = pl.DataFrame({'ip': ['8.8.8.8', '2606:4700::1111', '192.168.100.100', '172.21.1.1', '172.34.5.5', 'a.b.c.d']})
>>> df.with_columns(ip.is_private(pl.col('ip')).alias('is_private'))
shape: (6, 2)
┌─────────────────┬────────────┐
│ ipis_private │
│ ------        │
│ strbool       │
╞═════════════════╪════════════╡
│ 8.8.8.8false      │
│ 2606:4700::1111false      │
│ 192.168.100.100true       │
│ 172.21.1.1true       │
│ 172.34.5.5false      │
│ a.b.c.dfalse      │
└─────────────────┴────────────┘

is_in but for network ranges

Pandas and Polars have is_in functions to perform membership lookups. IPTools extends this to enable IP address membership in IP networks. This function works seamlessly with both IPv4 and IPv6 addresses and converts the specified networks into a Level-Compressed trie (LC-Trie) for fast, efficient lookups.

>>> import polars as pl
>>> import polars_iptools as ip
>>> df = pl.DataFrame({'ip': ['8.8.8.8', '1.1.1.1', '2606:4700::1111']})
>>> networks = ['8.8.8.0/24', '2606:4700::/32']
>>> df.with_columns(ip.is_in(pl.col('ip'), networks).alias('is_in'))
shape: (3, 2)
┌─────────────────┬───────┐
│ ipis_in │
│ ------   │
│ strbool  │
╞═════════════════╪═══════╡
│ 8.8.8.8true  │
│ 1.1.1.1false │
│ 2606:4700::1111true  │
└─────────────────┴───────┘

GeoIP enrichment

Using MaxMind's GeoLite2-ASN.mmdb and GeoLite2-City.mmdb databases, IPTools provides offline enrichment of network ownership and geolocation.

ip.geoip.full returns a Polars struct containing all available metadata parameters. If you just want the ASN and AS organization, you can use ip.geoip.asn.

>>> import polars as pl
>>> import polars_iptools as ip

>>> df = pl.DataFrame({"ip":["8.8.8.8", "192.168.1.1", "2606:4700::1111", "999.abc.def.123"]})
>>> df.with_columns([ip.geoip.full(pl.col("ip")).alias("geoip")])

shape: (4, 2)
┌─────────────────┬─────────────────────────────────┐
│ ipgeoip                           │
│ ------                             │
│ strstruct[11]                      │
╞═════════════════╪═════════════════════════════════╡
│ 8.8.8.8         ┆ {15169,"GOOGLE","","NA","","",… │
│ 192.168.1.1     ┆ {0,"","","","","","","",0.0,0.… │
│ 2606:4700::1111 ┆ {13335,"CLOUDFLARENET","","","… │
│ 999.abc.def.123 ┆ {null,null,null,null,null,null… │
└─────────────────┴─────────────────────────────────┘

>>> df.with_columns([ip.geoip.asn(pl.col("ip")).alias("asn")])
shape: (4, 2)
┌─────────────────┬───────────────────────┐
│ ipasn                   │
│ ------                   │
│ strstr                   │
╞═════════════════╪═══════════════════════╡
│ 8.8.8.8AS15169 GOOGLE        │
│ 192.168.1.1     ┆                       │
│ 2606:4700::1111AS13335 CLOUDFLARENET │
│ 999.abc.def.123 ┆                       │
└─────────────────┴───────────────────────┘

Spur enrichment

Spur is a commercial service that provides "data to detect VPNs, residential proxies, and bots". One of its offerings is a Maxmind mmdb format of at most 2,000,000 "busiest" Anonymous or Anonymous+Residential ips.

ip.spur.full returns a Polars struct containing all available metadata parameters.

>>> import polars as pl
>>> import polars_iptools as ip

>>> df = pl.DataFrame({"ip":["8.8.8.8", "192.168.1.1", "999.abc.def.123"]})
>>> df.with_columns([ip.spur.full(pl.col("ip")).alias("spur")])

shape: (3, 2)
┌─────────────────┬─────────────────────────────────┐
│ ipgeoip                           │
│ ------                             │
│ strstruct[7]                       │
╞═════════════════╪═════════════════════════════════╡
│ 8.8.8.8         ┆ {0.0,"","","","","",null}       │
│ 192.168.1.1     ┆ {0.0,"","","","","",null}       │
│ 999.abc.def.123 ┆ {null,null,null,null,null,null… │
└─────────────────┴─────────────────────────────────┘

Environment Configuration

IPTools uses two MaxMind databases: GeoLite2-ASN.mmdb and GeoLite2-City.mmdb. You only need these files if you call the geoip functions.

Set the MAXMIND_MMDB_DIR environment variable to tell the extension where these files are located.

export MAXMIND_MMDB_DIR=/path/to/your/mmdb/files
# or Windows users
set MAXMIND_MMDB_DIR=c:\path\to\your\mmdb\files

If the environment is not set, polars_iptools will check two other common locations (on Mac/Linux):

/usr/local/share/GeoIP
/opt/homebrew/var/GeoIP

Spur Environment

If you're a Spur customer, export the feed as spur.mmdb and specify its location using SPUR_MMDB_DIR environment variable.

export SPUR_MMDB_DIR=/path/to/spur/mmdb
# or Windows users
set SPUR_MMDB_DIR=c:\path\to\spur\mmdb

Credit

Developing this extension was super easy by following Marco Gorelli's tutorial and cookiecutter template.

About

Polars extension for IP address parsing and enrichment including geolocation

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published