Skip to content

Collections

Aaron D Borden edited this page Mar 6, 2021 · 4 revisions

An email to VA explaining collections July 15, 2019:

I believe everything is in order, despite the odd counts. I've checked the counts against your data.json and they tally up. The confusion comes from a quirky feature called Collections. Currently, I see 3793 datasets in the VA data.json. In catalog, I see 1527 non-collection members and 2266 collection members, totaling 3793. I'll explain:

catalog.data.gov groups some datasets together, called Collections. When we group datasets, the total count appears lower, because we're essentially counting all the datasets in a collection as a single dataset. This is a historical decision to avoid inflating the dataset counts on catalog with datasets that are very similar.

VA can specify collections with their data.json. We read the isPartOf field on each dataset. The identifier specified with isPartOf is what we call the parent dataset. The parent dataset is what appears on the catalog (this is why you see only 1527 datasets on in the VA organization). The collection members are the datasets that have an isPartOf attribute and are accessible from the parent dataset in the catalog. So datasets are either collection members (having isPartOf) or they are non-collection members (not having an isPartOf).

Here's an example of a collection: VA Veterans Health Administration Access Data which appears as a single dataset.

You can view all 197 collection members from here: https://catalog.data.gov/dataset?collection_package_id=f02ac089-2b1f-47e8-9b1d-71317c488724

Collections have a special visual marker: Screenshot from 2019-07-15 16-39-24.png

Unfortunately, it's not easy to get all the collection member counts through the UI. I actually got the counts above through the catalog API and documented the technical details. I'm happy to followup with questions, I know this feature isn't very intuitive.

Collection Package ID

You can get an entire organizations list of datasets inside a collection via the API: https://catalog.data.gov/api/action/package_search?fq=collection_package_id:*%20AND%20organization:gsa-gov&rows=1000.

This works across harvest types, for example USGS (which is harvested via geospatial metadata and WAFs) can also be seen in this way: https://catalog.data.gov/api/action/package_search?fq=collection_package_id:*%20AND%20organization:usgs-gov&rows=1000

Collections by harvest source

Each harvester might implement collections in its own way and not all of them to.

TODO

Clone this wiki locally