\n
Using ML to define an astrometrically clean sample of stars
\n
Follows Gaia EDR3 performance verification paper DPACP-81 (Smart et al.) in classifying astrometric solutions as good or bad
via supervised ML. Employs a Random Forrest classifier plus appropriately defined training sets - see
\n
https://arxiv.org/abs/2012.02061
\n
for further details. The work flow implemented here follows closely that described in Section 2, “GCNS Generation”
(GCNS = Gaia Catalogue of Nearby Stars) and is designed to clean up a 100pc (= nearby) sample.
\n
Version employing newer, richer dataframe API in pyspark ML
\n
IMPORTANT NOTE: current deployment has Spark 2.4.7 installed. That specific version’s API is documented here:
\n
https://spark.apache.org/docs/2.4.7/ml-classification-regression.html#random-forest-classifier
\n
Beware of following on-line message board and other fora posts for help and examples as they more often than not describe and link to different versions, and the API is evolving all the time.
\n
"
+ > }
+ > ]
+ > },
+ > "apps": [],
+ > "jobName": "paragraph_1613126076679_1211627861",
+ > "id": "20201013-131059_546082898",
+ > "dateCreated": "Feb 12, 2021 10:34:36 AM",
+ > "dateStarted": "Feb 15, 2021 8:15:17 AM",
+ > "dateFinished": "Feb 15, 2021 8:15:17 AM",
+ > "status": "FINISHED",
+ > "progressUpdateIntervalMs": 500
+ > }
+ > ....
+ > ....
+ > {
+ > "text": "%spark.pyspark\n\n# where are the NULLs in raw_sources features selection?\nfor feature in astrometric_features: print (spark.sql('SELECT COUNT(*) AS ' + feature + '_nulls FROM raw_sources WHERE ' + feature + ' IS NULL').show())\n# scan_direction_strength_k2 is the culprit!\n \n# alternatively could try:\n#Dict_Null = {col:df.filter(df[col].isNull()).count() for col in df.columns}\n#Dict_Null\n \n",
+ > "user": "gaiauser",
+ > "dateUpdated": "Feb 13, 2021 6:14:46 PM",
+ > "config": {
+ > "editorSetting": {
+ > "language": "python",
+ > "editOnDblClick": false,
+ > "completionKey": "TAB",
+ > "completionSupport": true
+ > },
+ > "colWidth": 12,
+ > "editorMode": "ace/mode/python",
+ > "fontSize": 9,
+ > "results": {},
+ > "enabled": true
+ > },
+ > "settings": {
+ > "params": {},
+ > "forms": {}
+ > },
+ > "apps": [],
+ > "jobName": "paragraph_1613126076687_1356332997",
+ > "id": "20201124-171324_1960205489",
+ > "dateCreated": "Feb 12, 2021 10:34:36 AM",
+ > "dateStarted": "Feb 13, 2021 6:14:46 PM",
+ > "dateFinished": "Feb 13, 2021 6:29:25 PM",
+ > "status": "FINISHED",
+ > "errorMessage": "",
+ > "progressUpdateIntervalMs": 500
+ > }
+
+ #
+ # Select specific fields.
+ #
+
+ curl \
+ --silent \
+ --cookie '/tmp/cookies' \
+ 'http://128.232.227.222:8080/api/notebook/2FYRDDR17' \
+ | jq '
+ .body.paragraphs[] | {
+ title,
+ status,
+ dateStarted,
+ dateFinished,
+ status: .results.code
+ }
+ '
+
+ > {
+ > "title": "Paragraph 001",
+ > "status": "SUCCESS",
+ > "dateStarted": "Feb 15, 2021 11:51:15 AM",
+ > "dateFinished": "Feb 15, 2021 11:51:17 AM"
+ > }
+ > ....
+ > ....
+
+ #
+ # Select text message lines that begin with '-'.
+ #
+
+ curl \
+ --silent \
+ --cookie '/tmp/cookies' \
+ 'http://128.232.227.222:8080/api/notebook/2FYRDDR17' \
+ | jq '
+ .body.paragraphs[] | {
+ title,
+ status,
+ dateStarted,
+ dateFinished,
+ status: .results.code,
+ output: (.results | select(.msg | length > 0) | .msg[] | select(.type == "TEXT") | .data | split("\n") | map(select(startswith("-"))))
+ }
+ '
+
+ > {
+ > "title": "Paragraph 001",
+ > "status": "SUCCESS",
+ > "dateStarted": "Feb 15, 2021 11:51:15 AM",
+ > "dateFinished": "Feb 15, 2021 11:51:17 AM",
+ > "output": [
+ > "-rw-------. 1 fedora fedora 503 Feb 13 03:21 .bash_history",
+ > "-rw-r--r--. 1 fedora fedora 18 Feb 16 2019 .bash_logout",
+ > "-rw-r--r--. 1 fedora fedora 141 Feb 16 2019 .bash_profile",
+ > "-rw-r--r--. 1 fedora fedora 376 Feb 16 2019 .bashrc",
+ > "-rw-------. 1 fedora fedora 0 Feb 12 10:34 .scala_history"
+ > ]
+ > }
+
+ #
+ # Add the elapsed time calculation.
+ #
+
+ curl \
+ --silent \
+ --cookie '/tmp/cookies' \
+ 'http://128.232.227.222:8080/api/notebook/2FYRDDR17' \
+ | jq '.' \
+ | sed '
+ s/\("dateStarted":[[:space:]]*\)"\([[:alpha:]]*\)[[:space:]]*\([[:alpha:]]*\)[[:space:]]*\([[:digit:]]*\)[[:space:]]*\([[:digit:]]*:[[:digit:]]*:[[:digit:]]*\)[[:space:]]*\([[:alpha:]]*\)[[:space:]]*\([[:digit:]]*\)"/\1"\7 \3 \4 \5"/
+ s/\("dateFinished":[[:space:]]*\)"\([[:alpha:]]*\)[[:space:]]*\([[:alpha:]]*\)[[:space:]]*\([[:digit:]]*\)[[:space:]]*\([[:digit:]]*:[[:digit:]]*:[[:digit:]]*\)[[:space:]]*\([[:alpha:]]*\)[[:space:]]*\([[:digit:]]*\)"/\1"\7 \3 \4 \5"/
+ /"started":/ {
+ h
+ s/\([[:space:]]*\)"dateStarted":[[:space:]]*\("[^"]*"\).*$/\1\2/
+ x
+ }
+ /"finished":/ {
+ H
+ x
+ s/[[:space:]]*"dateFinished":[[:space:]]*\("[^"]*"\).*$/ \1/
+ s/\([[:space:]]*\)\(.*\)/\1echo "\1\\"elapsedTime\\": \\"$(datediff --format "%H:%M:%S" --input-format "%Y %b %d %H:%M:%S" \2)\\","/e
+ x
+ G
+ }
+ ' \
+
+ #
+ # !!!!! the dates are in a different format !
+ #
+
+ #
+ # Back to square one with the date format .. although they are not as bad as the previous ones.
+ # Might be able to do it without the sed processing step.
+ #
+
+
+ curl \
+ --silent \
+ --cookie '/tmp/cookies' \
+ 'http://128.232.227.222:8080/api/notebook/2FYRDDR17' \
+ | jq '.' \
+ | sed '
+ /"dateStarted":/ {
+ h
+ s/\([[:space:]]*\)"dateStarted":[[:space:]]*\("[^"]*"\).*$/\1\2/
+ x
+ }
+ /"dateFinished":/ {
+ H
+ x
+ s/[[:space:]]*"dateFinished":[[:space:]]*\("[^"]*"\).*$/ \1/
+ s/\([[:space:]]*\)\(.*\)/\1echo "\1\\"elapsedTime\\": \\"$(datediff --format "%H:%M:%S" --input-format "%b %d, %Y %H:%M:%S %p" \2)\\","/e
+ x
+ G
+ }
+ ' \
+
+ > {
+ > "status": "OK",
+ > "message": "",
+ > "body": {
+ > "paragraphs": [
+ > {
+ > "title": "Paragraph 001",
+ > "text": "%sh\nls -al\n",
+ > "user": "gaiauser",
+ > "dateUpdated": "Feb 15, 2021 11:51:15 AM",
+ > ....
+ > ....
+ > "id": "20210215-115033_1362835282",
+ > "dateCreated": "Feb 15, 2021 11:50:33 AM",
+ > "dateStarted": "Feb 15, 2021 11:51:15 AM",
+ > "dateFinished": "Feb 15, 2021 11:51:17 AM",
+ > "elapsedTime": "0:0:2",
+ > "status": "FINISHED",
+ > "progressUpdateIntervalMs": 500
+ > },
+ > ....
+ > ....
+ > "info": {}
+ > }
+ > }
+
+ #
+ # Add the field selection and output parser.
+ #
+
+ curl \
+ --silent \
+ --cookie '/tmp/cookies' \
+ 'http://128.232.227.222:8080/api/notebook/2FYRDDR17' \
+ | jq '
+ .body.paragraphs[] | {
+ title,
+ status,
+ dateStarted,
+ dateFinished,
+ status: .results.code,
+ output: (.results | select(.msg | length > 0) | .msg[] | select(.type == "TEXT") | .data | split("\n") | map(select(startswith("-"))))
+ }
+ ' \
+ | sed '
+ /"dateStarted":/ {
+ h
+ s/\([[:space:]]*\)"dateStarted":[[:space:]]*\("[^"]*"\).*$/\1\2/
+ x
+ }
+ /"dateFinished":/ {
+ H
+ x
+ s/[[:space:]]*"dateFinished":[[:space:]]*\("[^"]*"\).*$/ \1/
+ s/\([[:space:]]*\)\(.*\)/\1echo "\1\\"elapsedTime\\": \\"$(datediff --format "%H:%M:%S" --input-format "%b %d, %Y %H:%M:%S %p" \2)\\","/e
+ x
+ G
+ }
+ '
+
+ > {
+ > "title": "Paragraph 001",
+ > "status": "SUCCESS",
+ > "dateStarted": "Feb 15, 2021 11:51:15 AM",
+ > "dateFinished": "Feb 15, 2021 11:51:17 AM",
+ > "elapsedTime": "0:0:2",
+ > "output": [
+ > "-rw-------. 1 fedora fedora 503 Feb 13 03:21 .bash_history",
+ > "-rw-r--r--. 1 fedora fedora 18 Feb 16 2019 .bash_logout",
+ > "-rw-r--r--. 1 fedora fedora 141 Feb 16 2019 .bash_profile",
+ > "-rw-r--r--. 1 fedora fedora 376 Feb 16 2019 .bashrc",
+ > "-rw-------. 1 fedora fedora 0 Feb 12 10:34 .scala_history"
+ > ]
+ > }
+
+
+# -----------------------------------------------------
+# Try it on the real notebook.
+#[user@desktop]
+
+ curl \
+ --silent \
+ --cookie '/tmp/cookies' \
+ 'http://128.232.227.222:8080/api/notebook/2FX82FMTH' \
+ | jq '
+ .body.paragraphs[] | {
+ title,
+ status,
+ dateStarted,
+ dateFinished,
+ status: .results.code,
+ output: (.results | select(.msg | length > 0) | .msg[] | select(.type == "TEXT") | .data | split("\n") | map(select(startswith("-"))))
+ }
+ ' \
+ | sed '
+ /"dateStarted":/ {
+ h
+ s/\([[:space:]]*\)"dateStarted":[[:space:]]*\("[^"]*"\).*$/\1\2/
+ x
+ }
+ /"dateFinished":/ {
+ H
+ x
+ s/[[:space:]]*"dateFinished":[[:space:]]*\("[^"]*"\).*$/ \1/
+ s/\([[:space:]]*\)\(.*\)/\1echo "\1\\"elapsedTime\\": \\"$(datediff --format "%H:%M:%S" --input-format "%b %d, %Y %H:%M:%S %p" \2)\\","/e
+ x
+ G
+ }
+ '
+
+ > {
+ > "title": "Select sources",
+ > "status": "SUCCESS",
+ > "dateStarted": "Feb 15, 2021 2:25:36 PM",
+ > "dateFinished": "Feb 15, 2021 2:27:03 PM",
+ > "elapsedTime": "0:1:27",
+ > "output": []
+ > }
+ > {
+ > "title": "Hertzsprung-Russell diagram",
+ > "status": "SUCCESS",
+ > "dateStarted": "Feb 15, 2021 2:27:03 PM",
+ > "dateFinished": "Feb 15, 2021 2:30:44 PM",
+ > "elapsedTime": "0:3:41",
+ > "output": []
+ > }
+ > {
+ > "title": "Selecting training data",
+ > "status": "SUCCESS",
+ > "dateStarted": "Feb 15, 2021 2:30:44 PM",
+ > "dateFinished": "Feb 15, 2021 2:37:18 PM",
+ > "elapsedTime": "0:6:34",
+ > "output": []
+ > }
+
+ #
+ # Missing some elements ..
+ #
+
+ curl \
+ --silent \
+ --cookie '/tmp/cookies' \
+ 'http://128.232.227.222:8080/api/notebook/2FX82FMTH' \
+ | jq '
+ .body.paragraphs[] | {
+ title,
+ status,
+ dateStarted,
+ dateFinished,
+ status: .results.code
+ }
+ ' \
+ | sed '
+ /"dateStarted":/ {
+ h
+ s/\([[:space:]]*\)"dateStarted":[[:space:]]*\("[^"]*"\).*$/\1\2/
+ x
+ }
+ /"dateFinished":/ {
+ H
+ x
+ s/[[:space:]]*"dateFinished":[[:space:]]*\("[^"]*"\).*$/ \1/
+ s/\([[:space:]]*\)\(.*\)/\1echo "\1\\"elapsedTime\\": \\"$(datediff --format "%H:%M:%S" --input-format "%b %d, %Y %H:%M:%S %p" \2)\\","/e
+ x
+ G
+ }
+ '
+
+ # Skips rows if output is null ?
+
+ curl \
+ --silent \
+ --cookie '/tmp/cookies' \
+ 'http://128.232.227.222:8080/api/notebook/2FX82FMTH' \
+ | jq '
+ .body.paragraphs[] | {
+ title,
+ status,
+ dateStarted,
+ dateFinished,
+ status: .results.code,
+ output: (.results | select(.msg | length > 0) | .msg[] | select(.type == "TEXT"))
+ }
+ '
+
+ > {
+ > "title": "Select sources",
+ > "status": "SUCCESS",
+ > "dateStarted": "Feb 15, 2021 2:25:36 PM",
+ > "dateFinished": "Feb 15, 2021 2:27:03 PM",
+ > "output": {
+ > "type": "TEXT",
+ > "data": "1724028"
+ > }
+ > }
+ > {
+ > "title": "Hertzsprung-Russell diagram",
+ > "status": "SUCCESS",
+ > "dateStarted": "Feb 15, 2021 2:27:03 PM",
+ > "dateFinished": "Feb 15, 2021 2:30:44 PM",
+ > "output": {
+ > "type": "TEXT",
+ > "data": "