Production Data to Power Homepage Visualization #6238

djbrooke · 2019-10-01T00:32:47Z

In #5603 we're adding a visualization to the Dataverse home page. In IQSS/dataverse-sample-data#8 (comment) we settled on the format and Jess from Harvard Library requested MORE DATA. Let's provide the sample data to her, in the structure defined in #8.

pdurbin · 2019-10-03T20:10:51Z

@scolapasta and I met to review the format discussed in IQSS/dataverse-sample-data#8 and he's going to help me with the SQL (phew!). This is the main visual we were looking at:

This is what I had so far, based mostly on work from the Gustavo's previous (longer) script.

select fmd.label as filename, dsfv.value as dataset_name, 'N/A' as dataverse_level_1_alias, dsv.releasetime as publication_date
from filemetadata fmd, datasetversion dsv, datasetfieldvalue dsfv, datasetfield dsf, dvobject dvo
where fmd.datasetversion_id = dsv.id
and dsv.id = dsf.datasetversion_id
and dsf.id = dsfv.datasetfield_id
and fmd.datafile_id = dvo.id
and dsf.datasetfieldtype_id=1;

      filename       |              dataset_name               | dataverse_level_1_alias |    publication_date     
---------------------+-----------------------------------------+-------------------------+-------------------------
 triad-data-5466.tab | 10+ FITS files and subsetted data files | N/A                     | 2019-10-02 12:34:41.422
 N.fits              | 10+ FITS files and subsetted data files | N/A                     | 2019-10-02 12:34:41.422
 N_err.fits          | 10+ FITS files and subsetted data files | N/A                     | 2019-10-02 12:34:41.422
 test.pdf            | 10+ FITS files and subsetted data files | N/A                     | 2019-10-02 12:34:41.422
 M_err.fits          | 10+ FITS files and subsetted data files | N/A                     | 2019-10-02 12:34:41.422
 M.fits              | 10+ FITS files and subsetted data files | N/A                     | 2019-10-02 12:34:41.422
 DatasetDiagram.png  | 10+ FITS files and subsetted data files | N/A                     | 2019-10-02 12:34:41.422
 T.fits              | 10+ FITS files and subsetted data files | N/A                     | 2019-10-02 12:34:41.422
 L_over_M.fits       | 10+ FITS files and subsetted data files | N/A                     | 2019-10-02 12:34:41.422
 L.fits              | 10+ FITS files and subsetted data files | N/A                     | 2019-10-02 12:34:41.422
(10 rows)

On my list is to ask @TaniaSchlatter if "publication date" is supposed to be for the file or the dataset. I'm pretty sure it's supposed to be for the file.

pdurbin · 2019-10-03T21:00:47Z

@scolapasta cooked up some good stuff for me fast so I'm back to my hacking! Thanks!!

pdurbin · 2019-10-04T21:45:04Z

@sekmiller helped me a ton too. Thanks! 🎉

I just emailed the following to Jess:

"Subject: 50,000 files, 329,267 files, no subjects

Attached please find a zip called files.zip that contains two files:

50k.tsv 50,000 files)
all.tsv (329,267 files)

Please note that unlike what we talked about, subjects are not yet included. I'll work on this next but I thought you might like playing around with the three levels of hierarchy, which is included."

I don't think I'll attach the files here because it's production data from Harvard Dataverse. Well, it's about a week old since we refresh our local copy on Sundays. I think I'm properly only including files from datasets that are published but some code review might be nice before we make this data available publicly.

Also, I always forget the syntax for using psql to create a TSV file so here it is so I have it handy next week when I start working on subjects:

psql -U dvnapp dvndb -f file-parents.sql -F $'\t' --no-align --pset footer -o all.tsv

pdurbin · 2019-10-07T21:44:18Z

@sekmiller helped me add subjects to the query in e8686c8 and it works fine on my laptop with a small database but it's taking way longer to run in the "copy of production" database. Before I made this change the query was taking a minute or two but now it's still going. I'm heading home for the day but I guess I'll check the query in the morning.

I guess I'll attach here the results from my small local database so @TaniaSchlatter or others (Jess) can take a look: devdb.tsv.txt . I'm using a semicolon as a delimiter when there are multiple subjects.

pdurbin · 2019-10-09T22:04:47Z

I used 4b56213 to create "2018.zip" which I sent to @TaniaSchlatter and Jess. Please let me know if you're happy with the data.

TaniaSchlatter · 2019-10-10T16:08:39Z

I think I wrote too quickly. I see 3 date columns in the file: File Creation, File Publication, Dataset Publication. I see some files with dates published before the dataset date published. I see a lot of files with a publication date of 2019. I'd like to review the dates in more detail with someone from the team.

djbrooke · 2019-10-10T16:26:05Z

(see above, I take this back :))

Closing, as data looks good (great even) and has been sent to Jess. We will create another issue for the next iteration.

sekmiller · 2019-10-10T17:51:51Z

As we discussed we will add a filter to only select those files that were originally published in 2018. Additionally we will verify the dataset publish date to make sure it makes sense compared to the file publish date.

sekmiller · 2019-10-10T17:59:01Z

By the way, the dataset publication date we are selecting is actually the publication date of the latest version, which explains why in some(many?) cases the date is after the file publication date (which is the original file pub date.)

pdurbin · 2019-10-11T17:31:46Z

In fce1289 I implemented the new requirements (with much help, as always from @sekmiller ) and sent a new file to @TaniaSchlatter and Jess.

TaniaSchlatter · 2019-10-11T19:30:13Z

This looks great! Thank you!

djbrooke mentioned this issue Oct 1, 2019

tabular file for Zoomable Circle Packing visualization IQSS/dataverse-sample-data#8

Closed

djbrooke changed the title ~~Production Data to Power Homepage Visualization, Method for Updating~~ Production Data to Power Homepage Visualization Oct 1, 2019

djbrooke assigned pdurbin Oct 2, 2019

pdurbin added a commit that referenced this issue Oct 2, 2019

add SQL script for activity #6238

c6683aa

pdurbin assigned scolapasta Oct 3, 2019

pdurbin added a commit that referenced this issue Oct 3, 2019

stub out sql script for file activity #6238

dddfd23

pdurbin added a commit that referenced this issue Oct 3, 2019

new file: file-parent-hierarchy.sql #6238

2976c8a

pdurbin unassigned scolapasta Oct 3, 2019

pdurbin added a commit that referenced this issue Oct 4, 2019

add recursive query, dv levels 1-3 #6238

1fca085

pdurbin added a commit that referenced this issue Oct 4, 2019

get latest published version, non-harvested, add comments #6238

5b72b9a

pdurbin added a commit that referenced this issue Oct 7, 2019

get subjects, separated by semicolon #6238

e8686c8

pdurbin added a commit that referenced this issue Oct 9, 2019

move slow maxid query to temporary table #6238

1fc3d3c

pdurbin added a commit that referenced this issue Oct 9, 2019

add creation date and optional 2018 filter #6238

4b56213

pdurbin assigned TaniaSchlatter and unassigned pdurbin Oct 9, 2019

TaniaSchlatter assigned djbrooke and unassigned TaniaSchlatter Oct 10, 2019

djbrooke closed this as completed Oct 10, 2019

TaniaSchlatter unassigned djbrooke Oct 10, 2019

TaniaSchlatter self-assigned this Oct 10, 2019

TaniaSchlatter reopened this Oct 10, 2019

pdurbin self-assigned this Oct 10, 2019

pdurbin added a commit that referenced this issue Oct 11, 2019

add file data files (suggest 2018) #6238

fce1289

pdurbin removed their assignment Oct 11, 2019

TaniaSchlatter closed this as completed Oct 11, 2019

TaniaSchlatter assigned djbrooke and unassigned TaniaSchlatter Oct 11, 2019

pdurbin mentioned this issue Nov 15, 2019

Visualization for Harvard Dataverse home page #5603

Closed

pdurbin added a commit that referenced this issue Dec 9, 2019

add script to get direct children of root #6238

9a68f5f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Production Data to Power Homepage Visualization #6238

Production Data to Power Homepage Visualization #6238

djbrooke commented Oct 1, 2019 •

edited by TaniaSchlatter

Loading

pdurbin commented Oct 3, 2019

pdurbin commented Oct 3, 2019

pdurbin commented Oct 4, 2019

pdurbin commented Oct 7, 2019

pdurbin commented Oct 9, 2019

TaniaSchlatter commented Oct 10, 2019 •

edited

Loading

djbrooke commented Oct 10, 2019 •

edited

Loading

sekmiller commented Oct 10, 2019

sekmiller commented Oct 10, 2019

pdurbin commented Oct 11, 2019

TaniaSchlatter commented Oct 11, 2019

Production Data to Power Homepage Visualization #6238

Production Data to Power Homepage Visualization #6238

Comments

djbrooke commented Oct 1, 2019 • edited by TaniaSchlatter Loading

pdurbin commented Oct 3, 2019

pdurbin commented Oct 3, 2019

pdurbin commented Oct 4, 2019

pdurbin commented Oct 7, 2019

pdurbin commented Oct 9, 2019

TaniaSchlatter commented Oct 10, 2019 • edited Loading

djbrooke commented Oct 10, 2019 • edited Loading

sekmiller commented Oct 10, 2019

sekmiller commented Oct 10, 2019

pdurbin commented Oct 11, 2019

TaniaSchlatter commented Oct 11, 2019

djbrooke commented Oct 1, 2019 •

edited by TaniaSchlatter

Loading

TaniaSchlatter commented Oct 10, 2019 •

edited

Loading

djbrooke commented Oct 10, 2019 •

edited

Loading