-
Notifications
You must be signed in to change notification settings - Fork 8
/
Copy pathexercise_2_blanks.Rmd
1063 lines (750 loc) · 40.7 KB
/
exercise_2_blanks.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "FHIR for Research - Exercise 2: Kids First (R version)"
output:
html_document:
df_print: paged
---
## Learning Objectives and Key Concepts
Workshop attendees will learn how to query FHIR resources in various ways, to enable visualizing and analyzing data.
What will participants do as part of the exercise?
- Connecting to Kids First
- Fetching and Examining Demographic Data
- Finding a ResearchStudy
- Fetching Patients enrolled in a ResearchStudy
- Dealing with Extensions (e.g., age of onset)
- Identifying Patients with desired diagnosis and data elements across multiple studies/datasets
- Utilize APIs to explore the data (e.g., demographics)
- Utilize APIs for research analyses (e.g., phenotype analysis)
- Building Graphs from FHIR data
- Demographics
- Most Frequent Diagnoses
- Age at Diagnosis
- Overall Survival
### Icons in this Guide
📘 A link to a useful external reference related to the section the icon appears in
🖐 A hands-on section where you will code something or interact with the server
## Scenario
In this exercise we're going to explore how to access the data needed to generate the summary information from the Kids First dashboard in a few different ways. A snapshot of the Kids First dashboard is shown below:
![KF Dashboard](img/kf_dashboard.png)
The Kids First Data Portal is accessible at https://portal.kidsfirstdrc.org/explore (login required, though signup is free with any Google account)
For this exercise we'll be focusing on the following 4 graphs:
- Demographics
- Most frequent diagnoses
- Age at diagnosis
- Overall survival
(Note that the image shown depicts the statistics for the entire Kids First population, whereas all graphs in this exercise will be based on specific sub-cohorts of the population, so the graphs we generate today will look a little different.)
## Environment setup
Load needed libraries:
```{r setup}
library(fhircrackr)
# Support cookie authentication required for access to Kids First data
kf_cookie_url <<- "https://github.com/mitre/fhir-exercises/raw/main/kf_cookie.txt"
source("exercise_2_fhircrackr_patch.R")
library(tidyverse)
library(skimr)
library(summarytools)
library(table1)
# Used for direct RESTful queries against the FHIR server
library(httr)
library(jsonlite)
# Visualizations
library(ggthemes)
theme_set(ggthemes::theme_economist_white())
# Survival analysis
library(survival)
library(survminer)
```
Kids First uses an [HTTP cookie](https://en.wikipedia.org/wiki/HTTP_cookie) for authentication, which isn't supported natively by `fhircrackr`. The `setup` block above loads a patched version of `fhircrackr` to support this.
If you see the message "Could not authenticate with Kids First. The cookie may need to be updated"
when running the code block above, then let the instructors know ASAP so they can fetch a new cookie, or [see these instructions to fetch a cookie](https://github.com/kids-first/kf-api-fhir-service#authenticate-to-access-server-environment) and then re-run the setup block above.
## 1. Demographics
Our first step will be show how to review basic demographic information for a patient cohort. Let's explore a few approaches for constructing a patient cohort.
### 1.1. Just the first N patients on the server
For the simplest example, let's just query for the first set of Patients on the server and see what that looks like.
🖐 Knowledge Check: Fill in the query to select Patients on the server.
(Note that there are over 10,000 Patient resources on this server, so we don't want to query them all or follow all the pagination. For performance reasons, all the examples in this notebook are intended to run with only a single page of results, but in a real-world use case, you would want to follow the pagination as shown in the previous exercise, to make sure you fetched all the requested data for a given query.)
```{r}
fhir_server <- "https://kf-api-fhir-service.kidsfirstdrc.org"
request <- fhir_url(______________)
patient_bundle <- fhir_search(request = request, max_bundles = 10)
```
Let's filter the bundle down to just the first Patient resource to see what it contains:
```{r}
xml2::xml_find_first(x = patient_bundle[[1]], xpath = "./entry[1]/resource") %>%
paste0 %>%
cat
```
Looking at this XML, it appears to contain the data to construct a data frame of patients with some basic demographics:
|`id`|`gender`|`race`|`ethnicity`|
|-|-|-|-|
|103070|male|Not Reported|Not Reported|
|...|...|...|...|
Gender is relatively easy to extract, but race and Ethnicity are a little trickier to extract because they are recorded as _extensions_. Extensions are used to represent information that is not part of the basic definition of a resource.
Every element in a resource or data type includes an optional "extension" child element that may be present any number of times. Extensions contain a defining `url` and either a `value[x]` or sub-extensions (but not both).
This also leads into choice types, ie, that `value[x]`. Choice types allow for different instances to use different data types as appropriate. Only one of the choices is allowed at a time on a given resource instance.
A simple example of choice types is the `Patient.deceased[x]` field indicating if the individual is deceased or not. `deceased[x]` is allowed to be either a `boolean` or `dateTime`.
Note that extensions are also allowed on primitive types. If you are looking at the JSON representation of FHIR resources (see Exercise 1), extensions on primitive types are represented by prepending the field name with an underscore `_` to create a new object-type field where the extension field can be added. The following example demonstrates the "birthTime" extension on the `Patient.birthDate` field:
```
{
"resourceType": "Patient",
...
"birthDate": "1987-06-05",
"_birthDate": {
"extension": [
{
"url": "http://hl7.org/fhir/StructureDefinition/patient-birthTime",
"valueDateTime": "1987-06-05T04:32:01Z"
}
]
}
}
```
The XML version looks like this:
```
<birthDate value="1987-06-05">
<extension url="http://hl7.org/fhir/StructureDefinition/patient-birthTime">
<valueDateTime value="1987-06-05T04:32:01Z"/>
</extension>
</birthDate>
```
We'll see more instances like this later in the exercise.
📘[Read more about Extensions in FHIR](https://www.hl7.org/fhir/extensibility.html)
Getting back to Race and Ethnicity, these extensions are defined within [US Core](https://www.hl7.org/fhir/us/core/) which is an implementation guide that defines the base set of requirements for FHIR implementation in the US and reflects the ONC U.S. Core Data for Interoperability required data fields. Further details about US Core are outside the scope of this exercise, but for now understand that nearly all FHIR data within the US will use US Core.
Both the Race and Ethnicity extension use subextensions to represent the concept in 3 possible ways:
- OMB Category, based on the (https://www.govinfo.gov/content/pkg/FR-1997-10-30/pdf/97-28653.pdf)
- `url` is "ombCategory"
- `valueCoding` from the [OMB Race Categories ValueSet](https://hl7.org/fhir/us/core/STU4/ValueSet-omb-race-category.html) or [OMB Ethnicity Categories ValueSet](https://www.hl7.org/fhir/us/core/ValueSet-omb-ethnicity-category.html)
- Detailed, based on CDC Race and Ethnicity codes
- `url` is "detailed"
- `valueCoding` from the [Detailed race ValueSet](https://www.hl7.org/fhir/us/core/ValueSet-detailed-race.html) or [Detailed ethnicity ValueSet](https://www.hl7.org/fhir/us/core/ValueSet-detailed-ethnicity.html)
- Text, free text (required)
- `url` is "text"
- `valueString` is free text
📘[Read more about the US Core Race Extension](https://hl7.org/fhir/us/core/STU4/StructureDefinition-us-core-race.html)
📘[Read more about the US Core Ethnicity Extension](https://hl7.org/fhir/us/core/STU4/StructureDefinition-us-core-ethnicity.html)
----
Given the above let's define functions to find the Race and Ethnicity on a Patient resource.
🖐 Fill in the blank XPath queries below to extract the race and ethnicity values out of the extensions on a Patient resource:
```{r}
# Identify which elements of the FHIR resource we want to capture in our data frame - see Exercise 0 for details
table_desc_patient <- fhir_table_description(
resource = "Patient",
cols = c(
id = "id",
gender = "gender",
race_string = str_c(
"extension[@url=\"http://hl7.org/fhir/us/core/StructureDefinition/us-core-race\"]",
"/extension[@url=\"text\"]",
"/valueString"
),
# The resources we are working with store race and ethincity as strings rather than
# codes. If you did need to extract the codes, this is what the XPath queries would
# look like:
#
# race_coding_display = str_c(
# "extension[@url=\"http://hl7.org/fhir/us/core/StructureDefinition/us-core-race\"]",
# "/extension[@url=\"text\"]",
# "/valueCoding",
# "/display"
# ),
# race_coding_code = str_c(
# "extension[@url=\"http://hl7.org/fhir/us/core/StructureDefinition/us-core-race\"]",
# "/extension[@url=\"text\"]",
# "/valueCoding",
# "/code"
# ),
# 🖐 Fill in the XPath query to extract the ethnicity from the `valueString` of the extension:
ethnicity_string = str_c(
_____
)
)
)
# Convert to R data frame
df_patient <- fhir_crack(bundles = patient_bundle, design = table_desc_patient, verbose = 0)
df_patient
```
Let's look at some descriptive statistics:
```{r}
df_patient %>% freq(gender)
```
```{r}
df_patient %>% freq(race_string)
```
```{r}
df_patient %>% freq(ethnicity_string)
```
This data frame can also easily produce charts:
```{r}
ggplot(df_patient, aes(x="", y=factor(1), fill=gender)) +
geom_bar(stat="identity", width=1) +
coord_polar("y", start=0) +
theme_void() +
scale_fill_brewer(palette="Blues")
```
```{r}
ggplot(df_patient, aes(x="", y=factor(1), fill=race_string)) +
geom_bar(stat="identity", width=1) +
coord_polar("y", start=0) +
theme_void() +
scale_fill_brewer(palette="Blues")
```
```{r}
ggplot(df_patient, aes(x="", y=factor(1), fill=ethnicity_string)) +
geom_bar(stat="identity", width=1) +
coord_polar("y", start=0) +
theme_void() +
scale_fill_brewer(palette="Blues")
```
### 1.2. Patients with a given Condition
In the previous steps we reviewed what is essentially a random set of Patients, just the first set that the server returned when we asked for all Patients. Now let's get more targeted and query for just patients who have a diagnosis of a particular Condition. Then we can use the same process and functions we've already defined to analyze/visualize it.
Kids First uses the [Mondo Disease Ontology](https://mondo.monarchinitiative.org/) for describing Conditions. Other servers may use different one or code systems such as [SNOMED-CT](), [ICD-10](), or others. A simple browser for finding Mondo codes by description is available at https://www.ebi.ac.uk/ols/ontologies/mondo . Using this browser, we can look at a few sample codes:
| code | description |
| --- | --- |
| MONDO:0005015 | diabetes mellitus |
| MONDO:0005961 | sinusitis |
| MONDO:0008903 | lung cancer |
| MONDO:0021640 | grade III glioma |
Let's use grade III glioma as our condition of interest, with **MONDO:0021640** as our code of interest going forward.
----
In Exercise 1 we saw an instance of basic querying, when we searched for MedicationRequests associated to a given Patient. (Reminder: `"{FHIR_SERVER}/MedicationRequest?patient=10098"`) This is one of the most basic and fundamental types of query, where we get resources from a server, filtered by some aspect of the resource itself. In the previous example with medications, the MedicationRequest resource has a reference back to the Patient in the `patient` field so we can query that directly.
But what if we want to go in the other direction? For example, find all Patients that are taking a given Medication, or Patients that have been diagnosed with a given Condition?
Enter "chaining" and "reverse chaining". These are capabilities of FHIR that allow for more complex queries that can save a client and/or server from having to perform a series of operations.
The FHIR documentation offers the following examples of chaining:
> In order to save a client from performing a series of search operations, reference parameters may be "chained" by appending them with a period (.) followed by the name of a search parameter defined for the target resource. This can be done recursively, following a logical path through a graph of related resources, separated by `.`. For instance, given that the resource `DiagnosticReport` has a search parameter named `subject`, which is usually a reference to a `Patient` resource, and the `Patient` resource includes a parameter `name` which searches on patient name, then the search
>
> `GET [base]/DiagnosticReport?subject.name=peter`
>
> is a request to return all the lab reports that have a `subject` whose `name` includes "peter". Because the Diagnostic Report subject can be one of a set of different resources, it's necessary to limit the search to a particular type:
>
> `GET [base]/DiagnosticReport?subject:Patient.name=peter`
>
> This request returns all the lab reports that have a subject which is a patient, whose name includes "peter".
In the case of "Patients diagnosed with a given Condition", we want the opposite direction - search for resources based on what links back to them. This is done with the `_has` search parameter.
The `_has` search parameter uses the colon character `:` to separate fields, and requires a few sub-parameters:
- the resource type to search for references back from
- the field on that resource which would link back to the current resource
- a field on that resource to filter by
A complete example is:
`[base]/Patient?_has:Observation:patient:code=1234-5`
This requests the server to return Patient resources, where the patient resource is referred to by at least one Observation where the observation has a code of 1234, and where the Observation refers to the patient resource in the patient search parameter.
Unfortunately we acknowledge the syntax is a little confusing. It may be easiest to read this query as as "Get Patients that have an Observation that links back to this Patient having a code of 1234-5"
📘 [Read more about FHIR Search Chaining and Reverse Chaining](https://hl7.org/fhir/r4/search.html#chaining)
Let's use this approach to find Patients based on a diagnosis.
🖐 Fill in the search query (in the `parameters` argument) to find Patients that have a Condition of grade III glioma.
```{r}
request <- fhir_url(url = fhir_server, resource = "Patient", parameters = list(________))
patient_bundle <- fhir_search(request = request, max_bundles = 1)
# Can use the same table description as we set up above
df_patient_glioma <- fhir_crack(bundles = patient_bundle, design = table_desc_patient, verbose = 0)
```
Let's look at the descriptive statistics for the first 50 glioma patients -- will use the excellent `table1` library this time:
```{r}
table1(~ gender + race_string + ethnicity_string, data = df_patient_glioma, overall = "Glioma")
```
### 1.3. Patients within a given Research Study
The Kids First portal is comprised of multiple research studies.
See more at <https://portal.kidsfirstdrc.org/studies> or <https://www.notion.so/Studies-and-Access-a5d2f55a8b40461eac5bf32d9483e90f>
In this step we'll explore how to query for patients specifically associated to one of these research studies. Let's pick the "Pediatric Brain Tumor Atlas: CBTTC" as an example, because it has a large number of participants.
First let's find the study we are interested in as a ResearchStudy. There are a few possible ways we can do this, for example a search on ResearchStudy.title, but we don't necessarily know the title of the FHIR resource is going to match those other lists.
Let's list all the ResearchStudies on the server and see what we can find.
```{r}
request <- fhir_url(url = fhir_server, resource = "ResearchStudy")
research_study_bundle <- fhir_search(request = request)
```
Let's look at the XML for the first ResearchStudy resource instance returned:
```{r}
xml2::xml_find_first(x = research_study_bundle[[1]], xpath = "./entry[1]/resource") %>%
paste0 %>%
cat
```
Based on this, we can construct the XPath queries to pull these resources into a data frame:
```{r}
table_desc_research_study <- fhir_table_description(
resource = "ResearchStudy",
cols = c(
id = "id",
title = "title"
)
)
# Convert to R data frame
df_study <- fhir_crack(bundles = research_study_bundle, design = table_desc_research_study, verbose = 0)
df_study
```
We want ID **76758**, which actually has title "Pediatric Brain Tumor Atlas - Children's Brain Tumor Tissue Consortium". We'll continue to use this ResearchStudy for future steps in this exercise.
```{r}
df_study %>% filter(id == 76758)
```
We can query for Patient resources by ResearchStudy via those ResearchSubjects (notice the reference to a Patient in the `individual` field), and again run our same analysis. (hint: sounds like reverse-chaining again!)
🖐 Fill in the query to find Patients that are associated to ResearchStudy 76758
```{r}
request <- fhir_url(url = fhir_server, resource = "Patient", parameters = list(_______))
patient_bundle <- fhir_search(request = request, max_bundles = 1)
# Can use the same table description as we set up above
df_patient_study <- fhir_crack(bundles = patient_bundle, design = table_desc_patient, verbose = 0)
table1(~ gender + race_string + ethnicity_string, data = df_patient_study, overall = "Study 76758")
```
## 2. Most Frequent Diagnoses
Our second step will be show how to perform queries that enable basic prevalence analysis. Again there are a few different ways we can build a cohort for this.
In this step we'll be looking at diagnoses, which are represented by the Condition resource.
📘 Read more about the [FHIR Condition resource](https://www.hl7.org/fhir/condition.html).
### 2.1. Just the first conditions on the server
As before, let's start with the simplest possible approach of just selecting an unfiltered and unsorted set of Condition resources. This time, let's tell the server we want 250 Conditions.
(Why 250? In this case it's the most the server will return in one response.)
📘 Refresher: read more about [requesting a certain number of resources](https://www.hl7.org/fhir/search.html#count).
🖐 Fill in the query to select 250 Condition resources from the server
```{r}
request <- fhir_url(url = fhir_server, resource = "Condition", parameters = list(_____))
condition_bundle <- fhir_search(request = request, max_bundles = 1)
# The first few only had `code.text` - change $n `entry[$n]` to integers until
# you see the expected nested `code.coding.code` structure
xml2::xml_find_first(x = condition_bundle[[1]], xpath = "./entry[4]/resource") %>%
paste0 %>%
cat
```
The key to what this Condition represents is nested within the `code` field, but there's a lot of information there. Let's dig into three very important types in FHIR: `code`, `Coding`, and `CodeableConcept`.
#### code
`code` is a FHIR primitive based on string. `code`s are generally taken from a controlled set of strings defined elsewhere, and are restricted in that `code`s may not contain leading whitespace, trailing whitespace, or more than 1 consecutive whitespace character. `"9283-4"` is an example of a `code`.
#### Coding
[`Coding`](https://www.hl7.org/fhir/datatypes.html#Coding) is a general purpose datatype that builds on top of `code`. A `Coding` is a representaton of a defined concept using a symbol from a defined code system. `Coding` includes fields for `code`, the code `system` it comes from, the `version` of the system, a human-readable `display`, and `userSelected` to indicate if this coding was chosen directly by the user. An example `Coding`:
In JSON:
```
{
"system": "http://snomed.info/sct",
"code": "444814009",
"display": "Viral sinusitis (disorder)"
}
```
In XML:
```
<coding>
<system value="http://snomed.info/sct"/>
<code value="444814009"/>
<display value="Viral sinusitis (disorder)"/>
</coding>
```
#### CodeableConcept
[`CodeableConcept`](https://www.hl7.org/fhir/datatypes.html#CodeableConcept) is a general purpose datatype builds further on top of `Coding`. A `CodeableConcept` represents a value that is usually supplied by providing a reference to one or more terminologies or ontologies but may also be defined by the provision of text. Most resources that are defined by specific clinical concepts will include a `CodeableConcept` type field.
`CodeableConcept` includes fields for an array of `coding`s and optional `text`.
An example `CodeableConcept` in JSON:
```
{
"coding": [
{
"system": "http://snomed.info/sct",
"code": "260385009",
"display": "Negative"
}, {
"system": "https://acme.lab/resultcodes",
"code": "NEG",
"display": "Negative"
}
],
"text": "Negative for Chlamydia Trachomatis rRNA"
}
```
And in XML:
```
<valueCodeableConcept>
<coding>
<system value="http://snomed.info/sct"/>
<code value="260385009"/>
<display value="Negative"/>
</coding>
<coding>
<system value="https://acme.lab/resultcodes"/>
<code value="NEG"/>
<display value="Negative"/>
</coding>
<text value="Negative for Chlamydia Trachomatis rRNA"/>
</valueCodeableConcept>
```
In this case all we really want is a consistent human-readable display, so let's get these into a data frame and map that `code` field into something appropriate.
🖐 Fill in the XPath queries below to extract the `text` of the CodeableConcept, and the `code`, `display`, and `system` of the contained Coding.
```{r}
table_desc_condition <- fhir_table_description(
resource = "Condition",
cols = c(
id = "id",
patient_id = "subject/reference",
codeableconcept_text = "___",
coding_code = "___",
coding_display = "___",
coding_system = "___"
)
)
# Convert to R data frame
df_condition <- fhir_crack(bundles = condition_bundle, design = table_desc_condition, verbose = 0)
df_condition
```
Now let's create a table of the top 10 most prevalent conditions:
```{r}
df_condition %>% count(codeableconcept_text, sort = TRUE)
```
Now let's create a graph of the top 10 most prevalent conditions:
```{r}
ggplot(
df_condition %>% count(codeableconcept_text, sort = TRUE) %>% head(10),
aes(x = reorder(codeableconcept_text, n), y = n)
) +
geom_bar(stat="identity") +
coord_flip() +
xlab("Condition") +
scale_y_continuous(breaks=c(0,2,4,6,8,10))
```
**************** stopped
### 2.2. Patients in the Research Study
In the previous steps, we looked at just a random sampling of Conditions: the first 250 that the server happened to return. Now let's return to the Research Study and see how we can query for just those Conditions.
One might expect we can just chain even further, for example:
```
/Condition?subject._has:ResearchSubject:individual:study=76758
```
However, that's not going to work here. (it seems to hang the entire server for about 2 minutes so we request you not to actually run it)
Instead, let's combine two search concepts:
- get the Patients by ResearchStudy, as we saw before ("reverse chaining")
- include the Conditions that reference back to each Patient
We've seen how to find a resource, based on another resource that references it, but we haven't yet seen how to include multiple resource types in a single search. This leads us to new search parameters we haven't seen before: `_include` and `_revinclude`.
`_include` allows for including resources that the queried resource references out to. (For example, Condition references out to a Patient and Encounter)
`_revinclude` ie, "reverse include", allows for including resources that reference back to the queried resource. (For example, Patient is referenced by Condition)
These parameters specify a search parameter to search on, which includes 3 parts:
- The name of the source resource where the reference field exists
- The field name of the reference
- (optionally) a specific type of target resource, for cases when multiple resource types are allowed.
Some simple examples:
```
GET [base]/MedicationRequest?_include=MedicationRequest:patient
GET [base]/MedicationRequest?_revinclude=Provenance:target
```
The first search requests all matching MedicationRequests, to include any patient that the medication prescriptions in the result set refer to. The second search requests all matching prescriptions, return all the provenance resources that refer to them.
📘[Read more about including other resources in search results](https://www.hl7.org/fhir/search.html#include)
🖐 Implement the query to select Patients within the ResearchStudy of interest and include their Conditions
Reminder: the ResearchStudy id = **76758**
```{r}
request <- fhir_url(url = fhir_server, resource = "Patient", parameters = list(_____))
bundle <- fhir_search(request = request, max_bundles = 1)
# Can use the same table description as we set up above
df_condition_study <- fhir_crack(bundles = bundle, design = table_desc_condition, verbose = 0)
df_condition_study %>% count(codeableconcept_text, sort = TRUE)
```
Here's the graph version:
```{r}
ggplot(
df_condition_study %>% count(codeableconcept_text, sort = TRUE) %>% head(10),
aes(x = reorder(codeableconcept_text, n), y = n)
) +
geom_bar(stat="identity") +
coord_flip() +
xlab("Condition")
```
Now we have a more useful graph - the most common diagnoses among a research study cohort. (Note however that this represents only the first page of results from the server, not necessarily the entire cohort. Pagination, as seen in the previous exercise, may be necessary to fetch the entire cohort.)
## 3. Age at Diagnosis
Our third step will be to see how we can recreate the Age at Diagnosis chart.
To calculate age at diagnosis, we need two pieces of information:
- Date of Birth
- Date of Diagnosis
However in order to de-identify the data, Kids First has removed date of birth information from Patient resources. Instead they use relative dates via an extension.
In FHIR these may be captured in different resources that we may need to cross-reference:
- `Patient.birthDate`
- `Condition.onset[x]`
- `Condition.recordedDate`
Let's take a look at how the Kids First server represents these important concepts
### 3.1. Diagnoses of a particular Condition
Let's start by querying for Conditions of a given code. We'll stick with **MONDO:0021640** (grade III glioma) as our condition of interest.
🖐 Fill in the query to select Conditions by this code
Then we'll look at one instance to see what it contains.
```{r}
request <- fhir_url(url = fhir_server, resource = "Condition", parameters = list(_____))
bundle <- fhir_search(request = request, max_bundles = 1)
xml2::xml_find_first(x = bundle[[1]], xpath = "./entry[1]/resource") %>%
paste0 %>%
cat
```
What we see here is that the Condition has a `recordedDate` field with an extension "http://hl7.org/fhir/StructureDefinition/relative-date", then nested below that are 3 sub-extensions representing the parts of a "relative date":
- The event that this Condition is relative to
- The relationship (before/after)
- The numerical offset
See more about the relative-date extension here: http://hl7.org/fhir/R4/extension-relative-date.html
Now let's put this into a data frame:
🖐 Fill in the blank parts of the XPath query to extract the value and units.
```{r}
table_desc_condition_glioma <- fhir_table_description(
resource = "Condition",
cols = c(
id = "id",
patient_id = "subject/reference",
recorded_duration = str_c(
"recordedDate",
"/extension[@url=\"http://hl7.org/fhir/StructureDefinition/relative-date\"]",
"/extension[@url=\"offset\"]",
______,
______
),
recorded_duration_units = str_c(
____
)
)
)
df_condition_glioma <- fhir_crack(bundles = bundle, design = table_desc_condition_glioma, verbose = 0)
df_condition_glioma
```
Note: for data aggregated from multiple sources, you may encounter data in very different forms. For the dataset we are working with in this step, we can safely assume that all recordedDate extensions will be of this form if present: relative to birth, after birth, and recorded in days.
Given this assumption, convert the `recorded_duratrion` column into age in years:
```{r}
df_condition_glioma <- df_condition_glioma %>%
mutate(
onsetAgeInYears = as.numeric(recorded_duration) / 365
)
df_condition_glioma
```
Now let's graph the ages with a basic histogram:
```{r}
ggplot(df_condition_glioma, aes(onsetAgeInYears)) +
geom_histogram(binwidth = 1)
```
### 3.2. Patients in the Research Study
Now let's go back to our selected Research Study and see how we can get the Conditions for those Patients in the study. We've seen before that doubly-nested references may not work, so instead we can combine multiple approaches as we saw in section 2.2, to fetch Patients by ResearchStudy, and then include their diagnosed Conditions.
(Note: this is the same query we did back in Section 2.2.)
```{r}
request <- fhir_url(url = fhir_server, resource = "Patient", parameters = list("_has:ResearchSubject:individual:study" = "76758", "_revinclude" = "Condition:subject"))
bundle <- fhir_search(request = request, max_bundles = 1)
# Can use the same table description as we set up above
df_condition_study <- fhir_crack(bundles = bundle, design = table_desc_condition_glioma, verbose = 0)
df_condition_study
```
Note that this query also gets the Patient resources, but we don't need these for our analysis so we can ignore them.
Not all Conditions may have a `recordedDate`, so filter to just those that do and convert to onset in age:
```{r}
df_condition_study <- df_condition_study %>%
mutate(
recorded_duration = as.numeric(recorded_duration)
) %>%
filter(
!is.na(recorded_duration)
) %>%
mutate(
onsetAgeInYears = recorded_duration / 365
)
```
Now let's graph the ages again with a basic histogram:
```{r}
ggplot(df_condition_study, aes(onsetAgeInYears)) +
geom_histogram(binwidth = 1)
```
## 4. Overall Survival
### 4.1. Patients in the Research Study
Our final step in this exercise will be to reproduce the Overall Survival graph. The data requirements for this graph build on top of the previous steps, so now we need to know the relationship between date of death, or last recorded survival, and date of onset.
As before, Kids First data has been deidentified so there generally are no absolute dates, but relative dates are enough as long as there is a common reference point. Fortunately most of KF uses dates relative to birth or enrollment into a clinical trial.
First let's see how KF reports death information. One possibility is in the `Patient.deceased[x]` field, so let's see if anything on the server has that populated.
```{r}
request <- fhir_url(url = fhir_server, resource = "Patient", parameters = c("deceased" = "true"))
bundle <- fhir_search(request = request, max_bundles = 1)
# Show total records returned by the query
xml2::xml_find_first(x = bundle[[1]], xpath = "./total") %>%
paste0 %>%
cat
```
Looks like that's a no. That's fine, there are other options. We'll spare the reader the full exploration process, but we know that in this case, Clinical Status of "Alive" or "Dead" is captured as an Observation with SNOMED code "263493007". Observations can be thought of as a clinical question of sorts, where the question is captured as the `code` and the answer is captured as the `value`.
📘 [Read more about the Observation resource](https://www.hl7.org/fhir/observation.html)
Let's look at an example of one of these:
```{r}
request <- fhir_url(url = fhir_server, resource = "Observation", parameters = c("code" = "263493007"))
bundle <- fhir_search(request = request, max_bundles = 1)
xml2::xml_find_first(x = bundle[[1]], xpath = "./entry[1]/resource") %>%
paste0 %>%
cat
```
Note: there are other ways this query could have been run. For example the code system could have been specified like `fhir_url(url = fhir_server, resource = "Observation", parameters = c("code" = "http://snomed.info/sct|263493007"))`.
📘 Read more about [token search](https://www.hl7.org/fhir/search.html#token)
Like with the Condition resources in previous examples, we see an `_effectiveDate` with extensions describing a date relative to birth. There's our common reference point, so let's gather all our data and put it together.
In this case, we want Patients, Conditions, and Observations. There are multiple possible approaches we could take here. One possible approach is to make 1 query to find Patients, 1 query to find all Conditions, then 1 query to find all Observations, then join the results and drop any mismatches. In this case let's see if we can do it in one single query.
🖐 Fill in the query to fetch Patients, Conditions, and Observations, for Patients in our ResearchStudy of interest.
Reminder: the ResearchStudy id = **76758**
```{r}
request <- fhir_url(url = fhir_server, resource = "Patient", parameters = list())
bundle <- fhir_search(request = request, max_bundles = 1)
```
Note that this query we just ran selected ALL Conditions and Observations linked to the selected Patients. If we need to filter the results further, we can only do that by post-processing and not within the FHIR query itself.
Fortunately there only appears to be one Observation per Patient in this dataset, so there is no need to filter further.
Let's break this Bundle out into separate data frames.
We'll first inspect an Observation resource first because this is the first time we're seeing Observations.
```{r}
xml2::xml_find_first(x = bundle[[1]], xpath = "//*[contains (name(), \"Observation\")]") %>%
paste0 %>%
cat
```
To calculate survival, we have to get subtract onset date from the latest clinical status date (Observation).
As with Condition `_recordedDate`, these Observations use a relative date via an extension on `_effectiveDateTime`.
Let's break that out into a single number. Fortunately the format is exactly the same as before, so we can reuse the same approach we used earlier with Condition.
```{r}
table_desc_observation <- fhir_table_description(
resource = "Observation",
cols = c(
id = "id",
patient_id = "subject/reference",
effectiveDateTime_duration = str_c(
"effectiveDateTime",
"/extension[@url=\"http://hl7.org/fhir/StructureDefinition/relative-date\"]",
"/extension[@url=\"offset\"]",
"/valueDuration",
"/value"
),
effectiveDateTime_duration_units = str_c(
"effectiveDateTime",
"/extension[@url=\"http://hl7.org/fhir/StructureDefinition/relative-date\"]",
"/extension[@url=\"offset\"]",
"/valueDuration",
"/unit"
),
# Get the code identifying the Observation as well
code = "code/coding/code",
code_display = "code/coding/display",
code_system = "code/coding/system",
# Get the value for the observation
value_code = "valueCodeableConcept/coding/code",
value_display = "valueCodeableConcept/coding/display",
value_system = "valueCodeableConcept/coding/system"
)
)
df_observation <- fhir_crack(bundles = bundle, design = table_desc_observation, verbose = 0)
df_observation
```
We expect all observations to have `code=263493007 (Clinical status)`. Let's verify:
```{r}
df_observation$code %>% freq
```
And we expect all observations to have either `alive` or `deceased` as the value (stored in `valueCodeableConcept`):
```{r}
ctable(df_observation$value_code, df_observation$value_display)
```
We also expect only one observation per Patient -- let's verify:
```{r}
(df_observation %>% count(patient_id))$n %>% max
```
Looks like this is true, so we can use `df_observation` both to calculate the time under observation and the endpoint for the survival analysis.
For time under observation, we will use the `effectiveDateTime_duration` variable, which is time since birth. Let's verify the units are consistent:
```{r}
df_observation$effectiveDateTime_duration_units %>% freq
```
If there are any `NA` values for the units, that means no `effectiveDateTime` is recorded. Let's drop any such records for simplicity of this exercise, but for research this would warrant deeper investigation.
```{r}
df_observation <- df_observation %>%
filter(!is.na(effectiveDateTime_duration_units))
df_observation$effectiveDateTime_duration_units %>% freq
```
For easier interpretability, let's change this from days to years:
```{r}
df_observation <- df_observation %>%
mutate(
effectiveDateTime_duration = as.numeric(effectiveDateTime_duration)
) %>%
mutate(
observationEndAgeInYears = as.numeric(effectiveDateTime_duration) / 365.25
)
df_observation
```
The Condition resource gives us the age at which observation began, so let's extract what we need:
```{r}
table_desc_condition <- fhir_table_description(
resource = "Condition",
cols = c(
id = "id",
patient_id = "subject/reference",
recorded_duration = str_c(
"recordedDate",
"/extension[@url=\"http://hl7.org/fhir/StructureDefinition/relative-date\"]",
"/extension[@url=\"offset\"]",
"/valueDuration",
"/value"
),
recorded_duration_units = str_c(
"recordedDate",
"/extension[@url=\"http://hl7.org/fhir/StructureDefinition/relative-date\"]",
"/extension[@url=\"offset\"]",
"/valueDuration",
"/unit"
),
code_code = "code/coding/code",
code_display = "code/text"
)
)
df_condition<- fhir_crack(bundles = bundle, design = table_desc_condition, verbose = 0)
df_condition
```
There are multiple Conditions for each Patient. For the purposes of this analysis, we will assume the smallest `recorded_duration` (i.e., closest to birth) Condition represents the beginning of observed time.
First, let's sanity check the units:
```{r}
df_condition$recorded_duration_units %>% freq
```
```{r}
df_condition_min_recorded_duration <- df_condition %>%
mutate(
recorded_duration = as.numeric(recorded_duration)
) %>%
# Remove rows with null recorded duration
filter(!is.na(recorded_duration_units)) %>%
# Get the minimum recorded duration for each patient_id
group_by(patient_id) %>%
summarize(
min_recorded_duration_years = min(recorded_duration) / 365.25
)
df_condition_min_recorded_duration
```
Now we can merge with the observations to get our final data frame for input into the survival analysis:
```{r}
df_survival <- df_observation %>%
left_join(
df_condition_min_recorded_duration,
by = "patient_id"
)
df_survival
```
Let's sanity check the two key variables we need for time in observation:
```{r}
df_survival %>% select(min_recorded_duration_years, observationEndAgeInYears) %>% skim
```