-
Notifications
You must be signed in to change notification settings - Fork 45
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add temporal matching example to documentation
- Loading branch information
Showing
3 changed files
with
618 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,354 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Why is temporal matching important?\n", | ||
"------------------------------------------------------\n", | ||
"Satellite observations usually have an irregular temporal sampling pattern (intervals between 6-36 hours), which is mostly controlled by the orbit of the satellite and the instrument measurement geometry. On the other hand, in-situ instruments or land surface models generally sample on regular time intervals (commonly every 1, 3, 6, 12 or 24 hours). In order to compute error/performance statistics (such as RMSD, bias, correlation) between the time series coming different sources, it is required that observation pairs (or triplets, etc.) are found which (nearly) coincide in time. A simple way to identify such pairs is by using a nearest neighbor search. First, one time series needs to be selected as temporal reference (i.e. all other time series will be matched to this reference) and second, a tolerance window (typically around 1-12 hours) has to be defined characterizing the temporal correlation of neighboring observation (i.e. observations outside of the tolerance window are no longer be considered as representative neighbors). An important special case may occur during the nearest neighbor search, which leads to duplicated neighbors. Depending on the application and use-case, the user needs to decide whether to keep the duplicates or to remove them before computing any statistics." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Matching two time series\n", | ||
"------------------------------------\n", | ||
"The following examples shows how to match two time series with regular and irregular temporal sampling." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 42, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
" distance matched_data index\n", | ||
"2007-01-01 0.375 0 2007-01-01 09:00:00\n", | ||
"2007-01-02 0.375 1 2007-01-02 09:00:00\n", | ||
"2007-01-03 0.375 2 2007-01-03 09:00:00\n", | ||
"2007-01-04 0.375 3 2007-01-04 09:00:00\n", | ||
"2007-01-05 0.375 4 2007-01-05 09:00:00\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"import numpy as np\n", | ||
"import pandas as pd\n", | ||
"\n", | ||
"from pytesmo.temporal_matching import df_match\n", | ||
"\n", | ||
"# create reference time series as dataframe\n", | ||
"ref_index = pd.date_range(\"2007-01-01\", \"2007-01-05\", freq=\"D\")\n", | ||
"ref_data = np.arange(len(ref_index))\n", | ||
"ref_df = pd.DataFrame({\"data\": ref_data}, index=ref_index)\n", | ||
"\n", | ||
"# create other time series as dataframe\n", | ||
"match_index = pd.date_range(\"2007-01-01 09:00:00\", \"2007-01-05 09:00:00\", freq=\"D\")\n", | ||
"match_data = np.arange(len(match_index))\n", | ||
"match_df = pd.DataFrame({\"matched_data\": match_data}, index=match_index)\n", | ||
"\n", | ||
"# match time series\n", | ||
"matched = df_match(ref_df, match_df)\n", | ||
"\n", | ||
"# test if data and index are correct\n", | ||
"print(matched)\n", | ||
"np.testing.assert_allclose(5 * [9/24.], matched.distance.values)\n", | ||
"np.testing.assert_allclose(np.arange(5), matched.matched_data)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 43, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
" distance matched_data index\n", | ||
"2007-01-01 0.166667 0.0 2007-01-01 04:00:00\n", | ||
"2007-01-02 -0.083333 1.0 2007-01-01 22:00:00\n", | ||
"2007-01-03 NaN NaN NaT\n", | ||
"2007-01-04 NaN NaN NaT\n", | ||
"2007-01-05 NaN NaN NaT\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"# create other (irregular) time series as dataframe\n", | ||
"match_irr_index = pd.to_datetime([\"2007-01-01 04:00:00\", \"2007-01-01 22:00:00\", \n", | ||
" \"2007-01-02 06:00:00\", \"2007-01-03 12:00:00\"])\n", | ||
"match_irr_data = np.arange(len(match_irr_index))\n", | ||
"match_irr_df = pd.DataFrame({\"matched_data\": match_irr_data}, index=match_irr_index)\n", | ||
"\n", | ||
"# match time series with 8 hour time window\n", | ||
"matched = df_match(ref_df, match_irr_df, window=8/24.)\n", | ||
"\n", | ||
"# test if data and index are correct\n", | ||
"print(matched)\n", | ||
"np.testing.assert_allclose([4/24., -2/24., np.nan, np.nan, np.nan], matched.distance.values)\n", | ||
"np.testing.assert_allclose([0, 1, np.nan, np.nan, np.nan], matched.matched_data)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 44, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
" distance matched_data index\n", | ||
"2007-01-01 0.166667 0.0 2007-01-01 04:00:00\n", | ||
"2007-01-02 -0.083333 1.0 2007-01-01 22:00:00\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"# match time series with 8 hour time window and drop nan\n", | ||
"matched = df_match(ref_df, match_irr_df, window=8/24., dropna=True)\n", | ||
"\n", | ||
"# test if data and index are correct\n", | ||
"print(matched)\n", | ||
"np.testing.assert_allclose([4/24., -2/24.], matched.distance.values)\n", | ||
"np.testing.assert_allclose([0, 1], matched.matched_data)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Special case of duplicated neighbor\n", | ||
"---------------------------------------------------" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 45, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
" distance matched_data index\n", | ||
"2007-01-01 04:00:00 0.208333 0 2007-01-01 09:00:00\n", | ||
"2007-01-01 06:00:00 0.125000 0 2007-01-01 09:00:00\n", | ||
"2007-01-02 06:00:00 0.125000 1 2007-01-02 09:00:00\n", | ||
"2007-01-02 08:00:00 0.041667 1 2007-01-02 09:00:00\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"# create reference time series as dataframe\n", | ||
"ref_index = pd.to_datetime([\"2007-01-01 04:00:00\", \"2007-01-01 06:00:00\", \n", | ||
" \"2007-01-02 06:00:00\", \"2007-01-02 08:00:00\"])\n", | ||
"ref_data = np.arange(len(ref_index))\n", | ||
"ref_df = pd.DataFrame({\"data\": ref_data}, index=ref_index)\n", | ||
"\n", | ||
"# create other time series as dataframe\n", | ||
"ref_index = pd.date_range(\"2007-01-01 00:00:00\", \"2007-01-05 00:00:00\", freq=\"3h\")\n", | ||
"match_data = np.arange(len(match_index))\n", | ||
"match_df = pd.DataFrame({\"matched_data\": match_data}, index=match_index)\n", | ||
"\n", | ||
"# match time series\n", | ||
"matched = df_match(ref_df, match_df)\n", | ||
"\n", | ||
"print(matched)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 46, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
" distance matched_data index\n", | ||
"2007-01-01 06:00:00 0.125000 0 2007-01-01 09:00:00\n", | ||
"2007-01-02 08:00:00 0.041667 1 2007-01-02 09:00:00\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"# match time series and drop duplicates\n", | ||
"matched = df_match(ref_df, match_df, dropduplicates=True)\n", | ||
"\n", | ||
"print(matched)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Matching three or more time series\n", | ||
"--------------------------------------------------" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 47, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
" distance matched_data index\n", | ||
"2007-01-01 04:00:00 -0.041667 1 2007-01-01 03:00:00\n", | ||
"2007-01-01 06:00:00 0.000000 2 2007-01-01 06:00:00\n", | ||
"2007-01-02 06:00:00 0.000000 10 2007-01-02 06:00:00\n", | ||
"2007-01-02 08:00:00 0.041667 11 2007-01-02 09:00:00\n", | ||
"2007-01-03 09:00:00 0.000000 19 2007-01-03 09:00:00\n", | ||
"2007-01-03 10:00:00 -0.041667 19 2007-01-03 09:00:00\n", | ||
" distance matched_data index\n", | ||
"2007-01-01 04:00:00 0.083333 1 2007-01-01 06:00:00\n", | ||
"2007-01-01 06:00:00 0.000000 1 2007-01-01 06:00:00\n", | ||
"2007-01-02 06:00:00 0.000000 5 2007-01-02 06:00:00\n", | ||
"2007-01-02 08:00:00 -0.083333 5 2007-01-02 06:00:00\n", | ||
"2007-01-03 09:00:00 -0.125000 9 2007-01-03 06:00:00\n", | ||
"2007-01-03 10:00:00 0.083333 10 2007-01-03 12:00:00\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"# create reference time series as dataframe\n", | ||
"ref_index = pd.to_datetime([\"2007-01-01 04:00:00\", \"2007-01-01 06:00:00\", \n", | ||
" \"2007-01-02 06:00:00\", \"2007-01-02 08:00:00\",\n", | ||
" \"2007-01-03 09:00:00\", \"2007-01-03 10:00:00\"])\n", | ||
"ref_data = np.arange(len(ref_index))\n", | ||
"ref_df = pd.DataFrame({\"data\": ref_data}, index=ref_index)\n", | ||
"\n", | ||
"# create other time series as dataframe\n", | ||
"match_index = pd.date_range(\"2007-01-01 00:00:00\", \"2007-01-05 00:00:00\", freq=\"3h\")\n", | ||
"match_data = np.arange(len(match_index))\n", | ||
"match_df1 = pd.DataFrame({\"matched_data\": match_data}, index=match_index)\n", | ||
"\n", | ||
"# create other time series as dataframe\n", | ||
"match_index = pd.date_range(\"2007-01-01 00:00:00\", \"2007-01-05 00:00:00\", freq=\"6h\")\n", | ||
"match_data = np.arange(len(match_index))\n", | ||
"match_df2 = pd.DataFrame({\"matched_data\": match_data}, index=match_index)\n", | ||
"\n", | ||
"# match time series\n", | ||
"matched = df_match(ref_df, match_df1, match_df2)\n", | ||
"\n", | ||
"print(matched[0])\n", | ||
"\n", | ||
"print(matched[1])" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 48, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
" distance matched_data index\n", | ||
"2007-01-01 04:00:00 -0.041667 1 2007-01-01 03:00:00\n", | ||
"2007-01-01 06:00:00 0.000000 2 2007-01-01 06:00:00\n", | ||
"2007-01-02 06:00:00 0.000000 10 2007-01-02 06:00:00\n", | ||
"2007-01-02 08:00:00 0.041667 11 2007-01-02 09:00:00\n", | ||
"2007-01-03 09:00:00 0.000000 19 2007-01-03 09:00:00\n", | ||
" distance matched_data index\n", | ||
"2007-01-01 06:00:00 0.000000 1 2007-01-01 06:00:00\n", | ||
"2007-01-02 06:00:00 0.000000 5 2007-01-02 06:00:00\n", | ||
"2007-01-03 09:00:00 -0.125000 9 2007-01-03 06:00:00\n", | ||
"2007-01-03 10:00:00 0.083333 10 2007-01-03 12:00:00\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"# match time series and drop duplicates\n", | ||
"matched = df_match(ref_df, match_df1, match_df2, dropduplicates=True)\n", | ||
"\n", | ||
"print(matched[0])\n", | ||
"\n", | ||
"print(matched[1])" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 49, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
" distance matched_data index\n", | ||
"2007-01-01 04:00:00 -0.041667 1.0 2007-01-01 03:00:00\n", | ||
"2007-01-01 06:00:00 0.000000 2.0 2007-01-01 06:00:00\n", | ||
"2007-01-02 06:00:00 0.000000 10.0 2007-01-02 06:00:00\n", | ||
"2007-01-02 08:00:00 0.041667 11.0 2007-01-02 09:00:00\n", | ||
"2007-01-03 09:00:00 0.000000 19.0 2007-01-03 09:00:00\n", | ||
" distance matched_data index\n", | ||
"2007-01-01 06:00:00 0.000000 1.0 2007-01-01 06:00:00\n", | ||
"2007-01-02 06:00:00 0.000000 5.0 2007-01-02 06:00:00\n", | ||
"2007-01-03 10:00:00 0.083333 10.0 2007-01-03 12:00:00\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"# match time series, 2 hour window and drop duplicates\n", | ||
"matched = df_match(ref_df, match_df1, match_df2, window=2/24., dropduplicates=True)\n", | ||
"\n", | ||
"print(matched[0])\n", | ||
"\n", | ||
"print(matched[1])" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.6.8" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
Oops, something went wrong.