This repository contains code and data for the worked examples described in Koger et al. (n.d.). Full descriptions of the examples are below; you can use the links there to navigate through the notebooks for each example and use the example datasets linked below to run the notebooks on existing data. Beyond using the provided data, we encourage researchers to use and modify the code to suit their own needs. If you find the code or the paper useful in your own studies, we ask that you cite this project:
Koger, B., Deshpande, A., Kerby, J.T., Graving, J.M., Costelloe, B.R., Couzin, I.D. Multi-animal behavioral tracking and environmental reconstruction using drones and computer vision in the wild.
The data required to run these examples can be downloaded from Edmond.
To run the model training or inference (object detection) steps in full, the user will require an GPU which supports pytorch. This includes local NVIDIA GPUs with enough memory or most computing clusters or cloud computing services with GPU suport. We suggest a GPU with a minimum of 8GB memory (ideally 10+ GB). We provide our already trained models that researchers can use if they aren't able to train their own but want to explore the object detection step with our datasets. Additionally, storing the extracted video frames will require approximately 420 GB for the entire ungulates example video, and 45 GB for the entire gelada example video. To explore our examples one may decide to only use a clip from the video to reduce memory requirements.
The dataset includes intermediate outputs for the worked examples so people interested in exploring just one part of the method can do so.
To run the code in the provided notebooks, you will need to have the following packages installed:
- cv2
- Detectron2
- fvcore
- gdal
- imutils
- matplotlib
- numpy
- pandas
- pycocotools
- requests
- scipy
- tabulate
- torch
- utm
- yaml
These notebooks load and save data on the user's local system. Users therefore need to define local file paths where various data types can be saved and retrieved. For this, we use a single local-paths.json file for each project that all notebooks can refer to get appropriate local paths. These paths only need to be set once for each project.
Each project has a demo-local-paths.json with dummy paths as an example, the actual local-paths.json file with empty paths, and a local-paths-info.md markdown file that describes how each particular path is used. Users should edit the local-paths.json to define the necessary file paths.
This example processes aerial video footage of gelada monkeys (Theropithecus gelada). The recordings were provided by the Guassa Gelada Research Project, and were collected between October 16, 2019 and February 28, 2020 at the Guassa Community Conservation Area in the Ethiopian highlands. Our analyses here focus on a single video observation. Geladas were recorded with a DJI Mavic 2 Pro.
In this example, we start with the raw videos and build an annotated dataset from scratch. The provided notebooks work through the steps listed below. The step numbers correspond to the step numbers in the main text and supplement of Koger et al. (n.d.).
Note that Step 2 and Step 4 require the use of 3rd party software to complete image annotation and Structure-from-Motion (SfM) tasks. We use Labelbox for annotation and Pix4d for SfM, but there are other options available.
-
Step 1: Video Recording
- See the paper for information on appropriate video recording techniques. We provide the example gelada video here.
-
Step 2: Detection
- Annotation
- We extract frames from our videos that we use to build our annotated training, validation, and test set.
- We use Labelbox with a free educational license to label these extracted frames. We annotate bounding boxes and have three classes: gelada-adult-male, gelada-other, and human-observer.
- We generate standard coco format json files from the Labelbox format .json export file that we will use to train our detection model.
- The multiple coco format .jsons can be combined or split into separate training/validation/test files with the notebook combine_or_split_jsons.ipynb.
- We can also get stats about our annotations including number of different classes and object sizes.
- (Optional) After training a model (see next step), we can use it for model assisted labeling to speed up further annotation. (Note this is just for Labelbox and the code isn't intuitive (let us know how to do it better :) )
- Model Training
- After building an annotated data set, we train a detection model.
- We can visualize the trained model's detections to get an intuition of the model's performance
- We can also quantify the trained model's performance by calculating values like precision and recall, among others.
- If model performance isn't high enough, further annotation may help (see the previously mentioned model assisted labeling notebook)
- Video Processing
- We extract all video frames from the observation video.
- Note: this takes a lot of memory and is unnecessary if you just want to use the detection model on this video. We do this because the frames are reused at various parts of this process.
- We then use the trained model to detect the geladas in each frame.
- We extract all video frames from the observation video.
- Annotation
-
Step 3: Tracking
- After detecting individuals in the observation we connect these detections together into track segments.
- We then use a GUI that lets us visually connect and correct the generated track segments.
-
Step 4: Landscape Reconstruction and Geographic Coordinate Transformation
- We first extract anchor frames from the complete observation that we use with structure from motion (SfM) software to build the 3D landscape model. The selected anchor frames are saved in a user-specified folder for SfM processing with the user’s preferred software. Unlike the ungulate worked example below, we do this without drone logs so don't have latitude, longitude, and elevation coordinates of the drone for each anchor frame.
- We used Pix4D with an educational license for SfM processing. To follow this notebook without Pix4D, find the generated outputs here. For high quality geo-referencing, ground control points should be incorporated at this point in the process with your chosen SfM software. We extract GCPs from visible features in google earth. (To find the overall observation location on external maps, 10.330651 N, 39.798676 E is the location of one of the large bushes).
- We export the generated map products into the local_paths['pix4D_folder'].
- We then calculate how the drone (camera) moves between anchor frames. We can then optionally confirm that the local movement estimation was accurate. Combining this local drone movement with the SfM outputs we project the detected animal tracks into the 3D landscape and do some initial visualization of the tracks in the landscape.
- We can additionally visualize the georeferenced tracks on the 3D landscapes exactly as visualized in the paper. (This code is more involved and maybe less intuitive compared to the visualization previously mentioned.)
- Human Validation
- After getting tracks in coordinates of the 3D landscape (and the world if properly georeferenced), we can visually check the correspondence between animal locations in the video frames and corresponding locations in the 3D landscape.
- For a set of random track locations, we generate a series of crop and location files. Using a GUI a human can view both crops and click the location in the 3D landscape that corresponds to the location the target animal is standing in the video.
- Then, to evaluate the accuracy of our location projections, we compare the distance between the true animal location in the 3D landscape as indicated by the human and location generated by our method.
-
Step 5: Body-part Keypoint Detection
- In this gelada example we don't use keypoint detection. Please see the ungulates worked example for information on this step.
-
Step 6: Landscape Quantification
- Many important landscape features, like ground topology, elevation, and color, are already quantified during the structure from motion step. For examples of extracting more complex landscape features, see our example extracting possible game trails from the landscape in the ungulates worked examples below. To work through this example in sequence, download the data and start here.
This example processes aerial video footage of African ungulate herds. We recorded ungulate groups at Ol Pejeta and Mpala Conservancies in Laikipia, Kenya over two field seasons, from November 2 to 16, 2017 and from March 30 to April 19, 2018. In total, we recorded thirteen species, but here we focus most of our analyses on a single 50-minute observation of a herd of 18 Grevy’s zebras (Equus grevyi). We used DJI Phantom 4 Pro drones, and deployed two drones sequentially in overlapping relays to achieve continuous observations longer than a single battery duration.
In this example, we start with a pre-annotated dataset (we previously annotated it with now-outdated software). For an example of building an annotated dataset from scratch, please see the gelada example. Our annotated image set contains five classes: zebra, impala, buffalo, waterbuck, and other spanning 1913 annotated video frames. See the main text of the paper or annotated_data_stats.ipynb for more details on the annotated dataset.
The provided notebooks work through the steps listed below. The step numbers correspond to the step numbers in the main text and supplement of Koger et al. (n.d.).
Note that Step 4 requires the use of 3rd party software to complete Structure-from-Motion tasks. We use Pix4D, but there are other options available.
-
Step 1: Video Recording
- See the paper for information on appropriate video recording techniques. We provide the videos from the example Grevy's zebra observation here.
-
Step 2: Detection
- We start by training and then evaluating a model that detects various ungulate species. We then use this model to process an observation of Grevy's zebras that spans three overlapping drone flights.
- Before processing, we extract the individual video frames from the complete observation. These will be used throughout the pipeline, including during detection.
-
Step 3: Tracking
- After detecting individuals in the observation we connect these detections together into track segments.
- We then use a GUI that lets us visually connect and correct the generated track segments.
-
Step 4: Landscape Reconstruction and Geographic Coordinate Transformation
- We first extract anchor frames from the complete observation that we use with structure from motion (SfM) software to build the 3D landscape model. The selected anchor frames are saved in a user-specified folder for SfM processing with the user’s preferred software. The latitude, longitude, and elevation coordinates of the drone for each anchor frame as recorded in the drone logs is saved as a .csv in the input format used by Pix4D.
- We used Pix4D with an educational license for SfM processing. To follow this notebook without Pix4D, find the generated outputs here. For high quality geo-referencing, ground control points should be incorporated at this point in the process with your chosen SfM software.
- We then calculate how the drone (camera) moves between anchor frames. We can then optionally confirm that the local movement estimation was accurate. Combining this local drone movement with the SfM outputs we project the detected animal tracks into the 3D landscape.
- We can visualize the georeferenced tracks on the 3D landscapes.
-
Step 5: Body-part Keypoint Detection
- We extract square crops around each tracked individual. These crops can be used with one of many existing open source animal keypoint tracking softwares such as DeepLabCut, SLEAP, or DeepPoseKit.
- We provide complete keypoints for the observation previously generated by DeepPoseKit (performance stats) as well as a page describing how to use DeepLabCut to train a new model to use for keypoint detection in the context of this method.
-
Step 6: Landscape Quantification
- Many important landscape features, like ground topology, elevation, and color, are already quantified during the structure from motion step.
- For demonstration purposes, we also include a notebook for training a CNN to detect game trails in the landscape and another notebook for using this model on our 3D landscape maps.
- This is just meant as a demonstration of what is possible and hasn't been carefully validated beyond visual inspection
- See the notebooks for details on the training data and training regime used.
To work through this example in sequence, download the data and start here.