Skip to content

Python Libraries for Data Formats: Raster and Vector

Carlos Lizarraga-Celaya edited this page Sep 17, 2024 · 11 revisions

(This page: https://github.com/ua-datalab/Geospatial_Workshops/wiki/Python-Libraries-for-Data-Formats:-Raster-and-Vector)

(Image credit: NASA, Unsplash)

Watch a Recording of the Workshop

Example Image

Introduction

Earth science or geoscience includes all fields of natural science related to the planet Earth. This is a branch of science dealing with the physical, chemical, and biological complex constitutions and synergistic linkages. Earth science encompasses four main branches of study the biosphere, the hydrosphere, the atmosphere, and the lithosphere, each of which is further broken down into more specialized fields.

Python is a widely used, open-source programming language. In Earth science, scientific programming languages like Python, help you speed up and automate lengthy tasks like selecting and downloading large datasets or performing repetitive calculations that you might otherwise have to do manually.

Depending on the type of scientific application sensor or measuring device, geoscience associated data is stored in many formats and data types. We need to be aware of this, so we can plan how to read data into our analysis environment.

(Image credit: Florent Poux. Towards Data Science, Medium.)


Geospatial Data File Formats

Vector File Formats

These files are composed of vertices and paths. The basic elements for vector data are points, lines and polygons (areas).

Extension File type Description
.shp, .shx, .dbf Shapefile The ESRI Shapefile has become an industry standard geospatial data format.
.GEOJSON, .JSON Geographic JavaScript Object Notation (GeoJSON) GeoJSON is an open standard format designed for representing simple geographical features, along with their non-spatial attributes. It is based on the JSON format.
.KML, .KMZ Keyhole Markup Language KML is an XML notation for expressing geographic annotation and visualization within two-dimensional maps and three-dimensional Earth browsers.

Raster File Formats

Extension File type Description
.dem Digital Elevation Model (DEM) DEM is a raster format used by the USGS to record elevation information.
.tif GeoTIFF As part of the header of the TIFF file, this provides the Lat/Long boundary extent of the data.
.cog Cloud Optimized TIFF (COG) An imagery format for cloud-native geospatial processing.
.LAS, .LAZ,.XYZ LiDAR point cloud File format designed for the interchange and archiving of lidar point cloud data.
.COPC Cloud Optimized Point Cloud A COPC file is a LAZ 1.4 file that stores point data organized in a clustered octree.

Multitemporal File Formats

Extension File type Description
.nc Network Common Data Form (NetCDF) NetCDF is a data format that allows access, and sharing of array-oriented scientific data.
.hdf Hierarchical Data Format (HDF) HDF is a set of file formats (HDF4, HDF5) designed to store and organize large amounts of data.
.GRIB General Regularly-distributed Information in Binary form (GRIB) GRIB is a concise data format commonly used in meteorology to store historical and forecast weather data.

Basic Python Libraries

There is a set of basic general Python libraries that allows us to perform data analysis and data visualization. They involve a set of data structures, available mathematical operations, defined statistical analysis functions and a collection of different functions to visualize data properties.

  • numpy. NumPy is the fundamental library for scientific computing in Python. It provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and more.
  • matplotlib. Matplotlib is the main plotting library for Python. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits. A more popular and user friendly visualization library derived from Matplotlib is Seaborn.
  • pandas. Pandas is a Python software library for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. It is free software released under the three-clause BSD license.
  • scipy. SciPy is a free and open-source Python library used for scientific computing and technical computing. It includes modules for optimization, linear algebra, integration, interpolation, special functions, FFT, signal and image processing, ODE solvers and other tasks common in science and engineering.

Python Libraries for Geospatial Data

There is a collection of available Python Libraries to work with spatially distribuited data. We mention a list of the most relevant ones.

Essential General Python Libraries

  • arraylake. Arraylake is a data lake platform for managing multidimensional arrays and metadata in the cloud.
  • dask. Dask is a flexible library for parallel computing in Python.
  • kerchunk. Kerchunk is a library that provides a unified way to represent a variety of chunked, compressed data formats (e.g. NetCDF/HDF5, GRIB2, TIFF, …), allowing efficient access to the data from traditional file systems or cloud object storage.
  • numba. Numba is an open source JIT compiler that translates a subset of Python and NumPy code into fast machine code.
  • pooch. Pooch is a data file downloader.
  • pystac. PySTAC is a library for working with SpatioTemporal Asset Catalogs (STAC).
  • stackstac. Load a STAC collection into xarray with dask.
  • torchgeo. Torchgeo is a geospatial machine learning library.
  • xarray. Xarray is an open-source project and Python package that introduces labels in the form of dimensions, coordinates, and attributes on top of raw NumPy-like arrays, which allows for a more intuitive, more concise, and less error-prone user experience.
  • xarray-spatial. Xarray-Spatial implements common raster analysis functions using Numba and provides an easy-to-install, easy-to-extend codebase for raster analysis.
  • xCDAT. xCDAT (Xarray Climate Data Analysis Tools) is an extension of xarray for climate data analysis on structured grids.
  • zarr. Zarr is a format for the storage of chunked, compressed, N-dimensional arrays.

Essential Geospatial Python Libraries for Raster Data

  • GDAL. GDAL is a translator library for raster and vector geospatial data formats that is released under an MIT style Open Source License by the Open Source Geospatial Foundation.
  • RasterFrames. RasterFrames brings together Earth-observation (EO) data access, cloud computing, and DataFrame-based data science. The recent explosion of EO data from public and private satellite operators presents both a huge opportunity and a huge challenge to the data analysis community.
  • rasterio. Rasterio allows access to geospatial raster data.
  • rasterstats. Rasterstats is a Python module for summarizing geospatial raster datasets based on vector geometries. It includes functions for zonal statistics and interpolated point queries. The command-line interface allows for easy interoperability with other GeoJSON tools.
  • RSGISLib.The Remote Sensing and Geographical Information Systems software Library (RSGISLib), contains a number of algorithms for processing and analysing remote sensing data that are the product of research carried out by the authors and their collaborators.

Essential Geospatial Python libraries For Vector Data

  • fiona. Fiona focuses on reading and writing data in standard Python IO style and relies upon familiar Python types and protocols such as files, dictionaries, mappings, and iterators. Fiona can read and write real-world data using multi-layered GIS formats and zipped virtual file systems and integrates readily with other Python GIS packages such as pyproj, Rtree, and Shapely.

  • GDAL/OGR. Several software programs use the GDAL/OGR libraries to allow them to read and write multiple GIS formats.

  • geomesa. GeoMesa is an open source suite of tools that enables large-scale geospatial querying and analytics on distributed computing systems. GeoMesa provides spatio-temporal indexing on top of the Accumulo, HBase, Google Bigtable and Cassandra databases for massive storage of point, line, and polygon data. GeoMesa also provides near real time stream processing of spatio-temporal data by layering spatial semantics on top of Apache Kafka.

  • geopandas. GeoPandas is an open source Python library for working geospatial data. GeoPandas extends the datatypes used by pandas to allow spatial operations on geometric types. Geometric operations are performed by shapely. GeoPandas further depends on fiona for file access and matplotlib for data visualization.

  • pyproj. Python interface to PROJ (cartographic projections and coordinate transformations library).

  • shapely. Shapely is a BSD-license Python package for manipulation and analysis of planar geometric objects.

Essential Geospatial Python libraries For Point Clouds

  • PDAL. PDAL is a C++ library for translating and manipulating point cloud data. It is very much like the GDAL library which handles raster and vector data.
  • laspy. LAS (and its compressed counterpart LAZ), is a popular format for lidar point cloud and full waveform, laspy reads and writes these formats and provides a Python API via Numpy Arrays.
  • numpy. NumPy is the fundamental library for scientific computing in Python. When combined with a reader/writer library like LasPy, we can store point cloud data in a NumPy array, as well as filter/process the data. NumPy is also good for general use across the geospatial domain.

Essential Python Libraries for Geospatial Data Visualization

  • Matplotlib. Matplotlib is the main library for creating static, animated, and interactive visualizations in Python.
  • Bokeh. Bokeh is an interactive visualization library for modern web browsers.
  • Datashader. Datashader is a graphics pipeline system for creating meaningful representations of large datasets.
  • Folium. Folium builds on the data wrangling strengths of the Python ecosystem and the mapping strengths of the Leaflet.js library.
  • GeoPandas. _GeoPandas is an open source project to make working with geospatial data in python easier.
  • geoplotlib. Geoplotlib is a python toolbox for visualizing geographical data and making maps.
  • GeoViews. GeoViews is a Python library that makes it easy to explore and visualize geographical, meteorological, and oceanographic datasets.
  • Holoviews. HoloViews is an open-source Python library designed to make data analysis and visualization seamless and simple.
  • hvPlot. API for data exploration and visualization.
  • ipyleaflet.Ipyleaflet is a Jupyter widget for Leaflet.js , enabling interactive maps in the Jupyter notebook.
  • kepler.gl. Kepler.gl is a geospatial analysis tool for large-scale data sets.
  • samgeo. A Python package for segmenting geospatial data with the Segment Anything Model (SAM).

📝 See Jupyter Notebook Examples 👈

more Planetary Computer Examples


R packages

RGDAL

lidR

raster

sf

rgeos

sp

randomForests

C50

ForestTools

ggplot2

rTLS

TreeLS

spatstat

gstat

doParallel

Tmap


General Geospatial Data Science References

Geospatial data analysis using Python

Geospatial data analysis using R.

List of general Geospatial Data Science Applications

Below you will find a collection of available online resources:

Geospatial Applications

Geospatial Datasets

Medium articles

Other application examples:


Created: 08/18/2022; Updated: 09/10/2024

Carlos Lizárraga.