Add temporal matching example to documentation

TUW-GEO · Jul 17, 2019 · bcaeea4 · bcaeea4
1 parent 1ab44ec
commit bcaeea4
Show file tree

Hide file tree

Showing 3 changed files with 618 additions and 0 deletions.
diff --git a/docs/examples.rst b/docs/examples.rst
@@ -7,6 +7,14 @@ Examples
    validation_framework.rst
 
 
+Temporal matching of time series
+================================
+The following example explains temporal matching of time series.
+
+.. include::
+   temporal_matching.rst
+
+
 Calculating anomalies and climatologies
 =======================================
 

diff --git a/docs/temporal_matching.ipynb b/docs/temporal_matching.ipynb
@@ -0,0 +1,354 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Why is temporal matching important?\n",
+    "------------------------------------------------------\n",
+    "Satellite observations usually have an irregular temporal sampling pattern (intervals between 6-36 hours), which is mostly controlled by the orbit of the satellite and the instrument measurement geometry. On the other hand, in-situ instruments or land surface models generally sample on regular time intervals (commonly every 1, 3, 6, 12 or 24 hours). In order to compute error/performance statistics (such as RMSD, bias, correlation) between the time series coming different sources, it is required that observation pairs (or triplets, etc.) are found which (nearly) coincide in time. A simple way to identify such pairs is by using a nearest neighbor search. First, one time series needs to be selected as temporal reference (i.e. all other time series will be matched to this reference) and second, a tolerance window (typically around 1-12 hours) has to be defined characterizing the temporal correlation of neighboring observation (i.e. observations outside of the tolerance window are no longer be considered as representative neighbors). An important special case may occur during the nearest neighbor search, which leads to duplicated neighbors. Depending on the application and use-case, the user needs to decide whether to keep the duplicates or to remove them before computing any statistics."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Matching two time series\n",
+    "------------------------------------\n",
+    "The following examples shows how to match two time series with regular and irregular temporal sampling."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 42,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "            distance  matched_data               index\n",
+      "2007-01-01     0.375             0 2007-01-01 09:00:00\n",
+      "2007-01-02     0.375             1 2007-01-02 09:00:00\n",
+      "2007-01-03     0.375             2 2007-01-03 09:00:00\n",
+      "2007-01-04     0.375             3 2007-01-04 09:00:00\n",
+      "2007-01-05     0.375             4 2007-01-05 09:00:00\n"
+     ]
+    }
+   ],
+   "source": [
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "\n",
+    "from pytesmo.temporal_matching import df_match\n",
+    "\n",
+    "# create reference time series as dataframe\n",
+    "ref_index = pd.date_range(\"2007-01-01\", \"2007-01-05\", freq=\"D\")\n",
+    "ref_data = np.arange(len(ref_index))\n",
+    "ref_df = pd.DataFrame({\"data\": ref_data}, index=ref_index)\n",
+    "\n",
+    "# create other time series as dataframe\n",
+    "match_index = pd.date_range(\"2007-01-01 09:00:00\", \"2007-01-05 09:00:00\", freq=\"D\")\n",
+    "match_data = np.arange(len(match_index))\n",
+    "match_df = pd.DataFrame({\"matched_data\": match_data}, index=match_index)\n",
+    "\n",
+    "# match time series\n",
+    "matched = df_match(ref_df, match_df)\n",
+    "\n",
+    "# test if data and index are correct\n",
+    "print(matched)\n",
+    "np.testing.assert_allclose(5 * [9/24.], matched.distance.values)\n",
+    "np.testing.assert_allclose(np.arange(5), matched.matched_data)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 43,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "            distance  matched_data               index\n",
+      "2007-01-01  0.166667           0.0 2007-01-01 04:00:00\n",
+      "2007-01-02 -0.083333           1.0 2007-01-01 22:00:00\n",
+      "2007-01-03       NaN           NaN                 NaT\n",
+      "2007-01-04       NaN           NaN                 NaT\n",
+      "2007-01-05       NaN           NaN                 NaT\n"
+     ]
+    }
+   ],
+   "source": [
+    "# create other (irregular) time series as dataframe\n",
+    "match_irr_index = pd.to_datetime([\"2007-01-01 04:00:00\", \"2007-01-01 22:00:00\", \n",
+    "                                  \"2007-01-02 06:00:00\", \"2007-01-03 12:00:00\"])\n",
+    "match_irr_data = np.arange(len(match_irr_index))\n",
+    "match_irr_df = pd.DataFrame({\"matched_data\": match_irr_data}, index=match_irr_index)\n",
+    "\n",
+    "# match time series with 8 hour time window\n",
+    "matched = df_match(ref_df, match_irr_df, window=8/24.)\n",
+    "\n",
+    "# test if data and index are correct\n",
+    "print(matched)\n",
+    "np.testing.assert_allclose([4/24., -2/24., np.nan, np.nan, np.nan], matched.distance.values)\n",
+    "np.testing.assert_allclose([0, 1, np.nan, np.nan, np.nan], matched.matched_data)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 44,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "            distance  matched_data               index\n",
+      "2007-01-01  0.166667           0.0 2007-01-01 04:00:00\n",
+      "2007-01-02 -0.083333           1.0 2007-01-01 22:00:00\n"
+     ]
+    }
+   ],
+   "source": [
+    "# match time series with 8 hour time window and drop nan\n",
+    "matched = df_match(ref_df, match_irr_df, window=8/24., dropna=True)\n",
+    "\n",
+    "# test if data and index are correct\n",
+    "print(matched)\n",
+    "np.testing.assert_allclose([4/24., -2/24.], matched.distance.values)\n",
+    "np.testing.assert_allclose([0, 1], matched.matched_data)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Special case of duplicated neighbor\n",
+    "---------------------------------------------------"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 45,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "                     distance  matched_data               index\n",
+      "2007-01-01 04:00:00  0.208333             0 2007-01-01 09:00:00\n",
+      "2007-01-01 06:00:00  0.125000             0 2007-01-01 09:00:00\n",
+      "2007-01-02 06:00:00  0.125000             1 2007-01-02 09:00:00\n",
+      "2007-01-02 08:00:00  0.041667             1 2007-01-02 09:00:00\n"
+     ]
+    }
+   ],
+   "source": [
+    "# create reference time series as dataframe\n",
+    "ref_index = pd.to_datetime([\"2007-01-01 04:00:00\", \"2007-01-01 06:00:00\", \n",
+    "                            \"2007-01-02 06:00:00\", \"2007-01-02 08:00:00\"])\n",
+    "ref_data = np.arange(len(ref_index))\n",
+    "ref_df = pd.DataFrame({\"data\": ref_data}, index=ref_index)\n",
+    "\n",
+    "# create other time series as dataframe\n",
+    "ref_index = pd.date_range(\"2007-01-01 00:00:00\", \"2007-01-05 00:00:00\", freq=\"3h\")\n",
+    "match_data = np.arange(len(match_index))\n",
+    "match_df = pd.DataFrame({\"matched_data\": match_data}, index=match_index)\n",
+    "\n",
+    "# match time series\n",
+    "matched = df_match(ref_df, match_df)\n",
+    "\n",
+    "print(matched)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 46,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "                     distance  matched_data               index\n",
+      "2007-01-01 06:00:00  0.125000             0 2007-01-01 09:00:00\n",
+      "2007-01-02 08:00:00  0.041667             1 2007-01-02 09:00:00\n"
+     ]
+    }
+   ],
+   "source": [
+    "# match time series and drop duplicates\n",
+    "matched = df_match(ref_df, match_df, dropduplicates=True)\n",
+    "\n",
+    "print(matched)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Matching three or more time series\n",
+    "--------------------------------------------------"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 47,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "                     distance  matched_data               index\n",
+      "2007-01-01 04:00:00 -0.041667             1 2007-01-01 03:00:00\n",
+      "2007-01-01 06:00:00  0.000000             2 2007-01-01 06:00:00\n",
+      "2007-01-02 06:00:00  0.000000            10 2007-01-02 06:00:00\n",
+      "2007-01-02 08:00:00  0.041667            11 2007-01-02 09:00:00\n",
+      "2007-01-03 09:00:00  0.000000            19 2007-01-03 09:00:00\n",
+      "2007-01-03 10:00:00 -0.041667            19 2007-01-03 09:00:00\n",
+      "                     distance  matched_data               index\n",
+      "2007-01-01 04:00:00  0.083333             1 2007-01-01 06:00:00\n",
+      "2007-01-01 06:00:00  0.000000             1 2007-01-01 06:00:00\n",
+      "2007-01-02 06:00:00  0.000000             5 2007-01-02 06:00:00\n",
+      "2007-01-02 08:00:00 -0.083333             5 2007-01-02 06:00:00\n",
+      "2007-01-03 09:00:00 -0.125000             9 2007-01-03 06:00:00\n",
+      "2007-01-03 10:00:00  0.083333            10 2007-01-03 12:00:00\n"
+     ]
+    }
+   ],
+   "source": [
+    "# create reference time series as dataframe\n",
+    "ref_index = pd.to_datetime([\"2007-01-01 04:00:00\", \"2007-01-01 06:00:00\", \n",
+    "                            \"2007-01-02 06:00:00\", \"2007-01-02 08:00:00\",\n",
+    "                            \"2007-01-03 09:00:00\", \"2007-01-03 10:00:00\"])\n",
+    "ref_data = np.arange(len(ref_index))\n",
+    "ref_df = pd.DataFrame({\"data\": ref_data}, index=ref_index)\n",
+    "\n",
+    "# create other time series as dataframe\n",
+    "match_index = pd.date_range(\"2007-01-01 00:00:00\", \"2007-01-05 00:00:00\", freq=\"3h\")\n",
+    "match_data = np.arange(len(match_index))\n",
+    "match_df1 = pd.DataFrame({\"matched_data\": match_data}, index=match_index)\n",
+    "\n",
+    "# create other time series as dataframe\n",
+    "match_index = pd.date_range(\"2007-01-01 00:00:00\", \"2007-01-05 00:00:00\", freq=\"6h\")\n",
+    "match_data = np.arange(len(match_index))\n",
+    "match_df2 = pd.DataFrame({\"matched_data\": match_data}, index=match_index)\n",
+    "\n",
+    "# match time series\n",
+    "matched = df_match(ref_df, match_df1, match_df2)\n",
+    "\n",
+    "print(matched[0])\n",
+    "\n",
+    "print(matched[1])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 48,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "                     distance  matched_data               index\n",
+      "2007-01-01 04:00:00 -0.041667             1 2007-01-01 03:00:00\n",
+      "2007-01-01 06:00:00  0.000000             2 2007-01-01 06:00:00\n",
+      "2007-01-02 06:00:00  0.000000            10 2007-01-02 06:00:00\n",
+      "2007-01-02 08:00:00  0.041667            11 2007-01-02 09:00:00\n",
+      "2007-01-03 09:00:00  0.000000            19 2007-01-03 09:00:00\n",
+      "                     distance  matched_data               index\n",
+      "2007-01-01 06:00:00  0.000000             1 2007-01-01 06:00:00\n",
+      "2007-01-02 06:00:00  0.000000             5 2007-01-02 06:00:00\n",
+      "2007-01-03 09:00:00 -0.125000             9 2007-01-03 06:00:00\n",
+      "2007-01-03 10:00:00  0.083333            10 2007-01-03 12:00:00\n"
+     ]
+    }
+   ],
+   "source": [
+    "# match time series and drop duplicates\n",
+    "matched = df_match(ref_df, match_df1, match_df2, dropduplicates=True)\n",
+    "\n",
+    "print(matched[0])\n",
+    "\n",
+    "print(matched[1])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 49,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "                     distance  matched_data               index\n",
+      "2007-01-01 04:00:00 -0.041667           1.0 2007-01-01 03:00:00\n",
+      "2007-01-01 06:00:00  0.000000           2.0 2007-01-01 06:00:00\n",
+      "2007-01-02 06:00:00  0.000000          10.0 2007-01-02 06:00:00\n",
+      "2007-01-02 08:00:00  0.041667          11.0 2007-01-02 09:00:00\n",
+      "2007-01-03 09:00:00  0.000000          19.0 2007-01-03 09:00:00\n",
+      "                     distance  matched_data               index\n",
+      "2007-01-01 06:00:00  0.000000           1.0 2007-01-01 06:00:00\n",
+      "2007-01-02 06:00:00  0.000000           5.0 2007-01-02 06:00:00\n",
+      "2007-01-03 10:00:00  0.083333          10.0 2007-01-03 12:00:00\n"
+     ]
+    }
+   ],
+   "source": [
+    "# match time series, 2 hour window and drop duplicates\n",
+    "matched = df_match(ref_df, match_df1, match_df2, window=2/24., dropduplicates=True)\n",
+    "\n",
+    "print(matched[0])\n",
+    "\n",
+    "print(matched[1])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.8"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}