Skip to content

Commit

Permalink
Add temporal matching example to documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
sebhahn committed Jul 17, 2019
1 parent 1ab44ec commit bcaeea4
Show file tree
Hide file tree
Showing 3 changed files with 618 additions and 0 deletions.
8 changes: 8 additions & 0 deletions docs/examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,14 @@ Examples
validation_framework.rst


Temporal matching of time series
================================
The following example explains temporal matching of time series.

.. include::
temporal_matching.rst


Calculating anomalies and climatologies
=======================================

Expand Down
354 changes: 354 additions & 0 deletions docs/temporal_matching.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,354 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Why is temporal matching important?\n",
"------------------------------------------------------\n",
"Satellite observations usually have an irregular temporal sampling pattern (intervals between 6-36 hours), which is mostly controlled by the orbit of the satellite and the instrument measurement geometry. On the other hand, in-situ instruments or land surface models generally sample on regular time intervals (commonly every 1, 3, 6, 12 or 24 hours). In order to compute error/performance statistics (such as RMSD, bias, correlation) between the time series coming different sources, it is required that observation pairs (or triplets, etc.) are found which (nearly) coincide in time. A simple way to identify such pairs is by using a nearest neighbor search. First, one time series needs to be selected as temporal reference (i.e. all other time series will be matched to this reference) and second, a tolerance window (typically around 1-12 hours) has to be defined characterizing the temporal correlation of neighboring observation (i.e. observations outside of the tolerance window are no longer be considered as representative neighbors). An important special case may occur during the nearest neighbor search, which leads to duplicated neighbors. Depending on the application and use-case, the user needs to decide whether to keep the duplicates or to remove them before computing any statistics."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Matching two time series\n",
"------------------------------------\n",
"The following examples shows how to match two time series with regular and irregular temporal sampling."
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" distance matched_data index\n",
"2007-01-01 0.375 0 2007-01-01 09:00:00\n",
"2007-01-02 0.375 1 2007-01-02 09:00:00\n",
"2007-01-03 0.375 2 2007-01-03 09:00:00\n",
"2007-01-04 0.375 3 2007-01-04 09:00:00\n",
"2007-01-05 0.375 4 2007-01-05 09:00:00\n"
]
}
],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"\n",
"from pytesmo.temporal_matching import df_match\n",
"\n",
"# create reference time series as dataframe\n",
"ref_index = pd.date_range(\"2007-01-01\", \"2007-01-05\", freq=\"D\")\n",
"ref_data = np.arange(len(ref_index))\n",
"ref_df = pd.DataFrame({\"data\": ref_data}, index=ref_index)\n",
"\n",
"# create other time series as dataframe\n",
"match_index = pd.date_range(\"2007-01-01 09:00:00\", \"2007-01-05 09:00:00\", freq=\"D\")\n",
"match_data = np.arange(len(match_index))\n",
"match_df = pd.DataFrame({\"matched_data\": match_data}, index=match_index)\n",
"\n",
"# match time series\n",
"matched = df_match(ref_df, match_df)\n",
"\n",
"# test if data and index are correct\n",
"print(matched)\n",
"np.testing.assert_allclose(5 * [9/24.], matched.distance.values)\n",
"np.testing.assert_allclose(np.arange(5), matched.matched_data)"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" distance matched_data index\n",
"2007-01-01 0.166667 0.0 2007-01-01 04:00:00\n",
"2007-01-02 -0.083333 1.0 2007-01-01 22:00:00\n",
"2007-01-03 NaN NaN NaT\n",
"2007-01-04 NaN NaN NaT\n",
"2007-01-05 NaN NaN NaT\n"
]
}
],
"source": [
"# create other (irregular) time series as dataframe\n",
"match_irr_index = pd.to_datetime([\"2007-01-01 04:00:00\", \"2007-01-01 22:00:00\", \n",
" \"2007-01-02 06:00:00\", \"2007-01-03 12:00:00\"])\n",
"match_irr_data = np.arange(len(match_irr_index))\n",
"match_irr_df = pd.DataFrame({\"matched_data\": match_irr_data}, index=match_irr_index)\n",
"\n",
"# match time series with 8 hour time window\n",
"matched = df_match(ref_df, match_irr_df, window=8/24.)\n",
"\n",
"# test if data and index are correct\n",
"print(matched)\n",
"np.testing.assert_allclose([4/24., -2/24., np.nan, np.nan, np.nan], matched.distance.values)\n",
"np.testing.assert_allclose([0, 1, np.nan, np.nan, np.nan], matched.matched_data)"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" distance matched_data index\n",
"2007-01-01 0.166667 0.0 2007-01-01 04:00:00\n",
"2007-01-02 -0.083333 1.0 2007-01-01 22:00:00\n"
]
}
],
"source": [
"# match time series with 8 hour time window and drop nan\n",
"matched = df_match(ref_df, match_irr_df, window=8/24., dropna=True)\n",
"\n",
"# test if data and index are correct\n",
"print(matched)\n",
"np.testing.assert_allclose([4/24., -2/24.], matched.distance.values)\n",
"np.testing.assert_allclose([0, 1], matched.matched_data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Special case of duplicated neighbor\n",
"---------------------------------------------------"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" distance matched_data index\n",
"2007-01-01 04:00:00 0.208333 0 2007-01-01 09:00:00\n",
"2007-01-01 06:00:00 0.125000 0 2007-01-01 09:00:00\n",
"2007-01-02 06:00:00 0.125000 1 2007-01-02 09:00:00\n",
"2007-01-02 08:00:00 0.041667 1 2007-01-02 09:00:00\n"
]
}
],
"source": [
"# create reference time series as dataframe\n",
"ref_index = pd.to_datetime([\"2007-01-01 04:00:00\", \"2007-01-01 06:00:00\", \n",
" \"2007-01-02 06:00:00\", \"2007-01-02 08:00:00\"])\n",
"ref_data = np.arange(len(ref_index))\n",
"ref_df = pd.DataFrame({\"data\": ref_data}, index=ref_index)\n",
"\n",
"# create other time series as dataframe\n",
"ref_index = pd.date_range(\"2007-01-01 00:00:00\", \"2007-01-05 00:00:00\", freq=\"3h\")\n",
"match_data = np.arange(len(match_index))\n",
"match_df = pd.DataFrame({\"matched_data\": match_data}, index=match_index)\n",
"\n",
"# match time series\n",
"matched = df_match(ref_df, match_df)\n",
"\n",
"print(matched)"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" distance matched_data index\n",
"2007-01-01 06:00:00 0.125000 0 2007-01-01 09:00:00\n",
"2007-01-02 08:00:00 0.041667 1 2007-01-02 09:00:00\n"
]
}
],
"source": [
"# match time series and drop duplicates\n",
"matched = df_match(ref_df, match_df, dropduplicates=True)\n",
"\n",
"print(matched)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Matching three or more time series\n",
"--------------------------------------------------"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" distance matched_data index\n",
"2007-01-01 04:00:00 -0.041667 1 2007-01-01 03:00:00\n",
"2007-01-01 06:00:00 0.000000 2 2007-01-01 06:00:00\n",
"2007-01-02 06:00:00 0.000000 10 2007-01-02 06:00:00\n",
"2007-01-02 08:00:00 0.041667 11 2007-01-02 09:00:00\n",
"2007-01-03 09:00:00 0.000000 19 2007-01-03 09:00:00\n",
"2007-01-03 10:00:00 -0.041667 19 2007-01-03 09:00:00\n",
" distance matched_data index\n",
"2007-01-01 04:00:00 0.083333 1 2007-01-01 06:00:00\n",
"2007-01-01 06:00:00 0.000000 1 2007-01-01 06:00:00\n",
"2007-01-02 06:00:00 0.000000 5 2007-01-02 06:00:00\n",
"2007-01-02 08:00:00 -0.083333 5 2007-01-02 06:00:00\n",
"2007-01-03 09:00:00 -0.125000 9 2007-01-03 06:00:00\n",
"2007-01-03 10:00:00 0.083333 10 2007-01-03 12:00:00\n"
]
}
],
"source": [
"# create reference time series as dataframe\n",
"ref_index = pd.to_datetime([\"2007-01-01 04:00:00\", \"2007-01-01 06:00:00\", \n",
" \"2007-01-02 06:00:00\", \"2007-01-02 08:00:00\",\n",
" \"2007-01-03 09:00:00\", \"2007-01-03 10:00:00\"])\n",
"ref_data = np.arange(len(ref_index))\n",
"ref_df = pd.DataFrame({\"data\": ref_data}, index=ref_index)\n",
"\n",
"# create other time series as dataframe\n",
"match_index = pd.date_range(\"2007-01-01 00:00:00\", \"2007-01-05 00:00:00\", freq=\"3h\")\n",
"match_data = np.arange(len(match_index))\n",
"match_df1 = pd.DataFrame({\"matched_data\": match_data}, index=match_index)\n",
"\n",
"# create other time series as dataframe\n",
"match_index = pd.date_range(\"2007-01-01 00:00:00\", \"2007-01-05 00:00:00\", freq=\"6h\")\n",
"match_data = np.arange(len(match_index))\n",
"match_df2 = pd.DataFrame({\"matched_data\": match_data}, index=match_index)\n",
"\n",
"# match time series\n",
"matched = df_match(ref_df, match_df1, match_df2)\n",
"\n",
"print(matched[0])\n",
"\n",
"print(matched[1])"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" distance matched_data index\n",
"2007-01-01 04:00:00 -0.041667 1 2007-01-01 03:00:00\n",
"2007-01-01 06:00:00 0.000000 2 2007-01-01 06:00:00\n",
"2007-01-02 06:00:00 0.000000 10 2007-01-02 06:00:00\n",
"2007-01-02 08:00:00 0.041667 11 2007-01-02 09:00:00\n",
"2007-01-03 09:00:00 0.000000 19 2007-01-03 09:00:00\n",
" distance matched_data index\n",
"2007-01-01 06:00:00 0.000000 1 2007-01-01 06:00:00\n",
"2007-01-02 06:00:00 0.000000 5 2007-01-02 06:00:00\n",
"2007-01-03 09:00:00 -0.125000 9 2007-01-03 06:00:00\n",
"2007-01-03 10:00:00 0.083333 10 2007-01-03 12:00:00\n"
]
}
],
"source": [
"# match time series and drop duplicates\n",
"matched = df_match(ref_df, match_df1, match_df2, dropduplicates=True)\n",
"\n",
"print(matched[0])\n",
"\n",
"print(matched[1])"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" distance matched_data index\n",
"2007-01-01 04:00:00 -0.041667 1.0 2007-01-01 03:00:00\n",
"2007-01-01 06:00:00 0.000000 2.0 2007-01-01 06:00:00\n",
"2007-01-02 06:00:00 0.000000 10.0 2007-01-02 06:00:00\n",
"2007-01-02 08:00:00 0.041667 11.0 2007-01-02 09:00:00\n",
"2007-01-03 09:00:00 0.000000 19.0 2007-01-03 09:00:00\n",
" distance matched_data index\n",
"2007-01-01 06:00:00 0.000000 1.0 2007-01-01 06:00:00\n",
"2007-01-02 06:00:00 0.000000 5.0 2007-01-02 06:00:00\n",
"2007-01-03 10:00:00 0.083333 10.0 2007-01-03 12:00:00\n"
]
}
],
"source": [
"# match time series, 2 hour window and drop duplicates\n",
"matched = df_match(ref_df, match_df1, match_df2, window=2/24., dropduplicates=True)\n",
"\n",
"print(matched[0])\n",
"\n",
"print(matched[1])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Loading

0 comments on commit bcaeea4

Please sign in to comment.