Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Production Data to Power Homepage Visualization #6238

Closed
djbrooke opened this issue Oct 1, 2019 · 11 comments
Closed

Production Data to Power Homepage Visualization #6238

djbrooke opened this issue Oct 1, 2019 · 11 comments
Assignees

Comments

@djbrooke
Copy link
Contributor

djbrooke commented Oct 1, 2019

In #5603 we're adding a visualization to the Dataverse home page. In IQSS/dataverse-sample-data#8 (comment) we settled on the format and Jess from Harvard Library requested MORE DATA. Let's provide the sample data to her, in the structure defined in #8.

@djbrooke djbrooke changed the title Production Data to Power Homepage Visualization, Method for Updating Production Data to Power Homepage Visualization Oct 1, 2019
pdurbin added a commit that referenced this issue Oct 2, 2019
@pdurbin
Copy link
Member

pdurbin commented Oct 3, 2019

@scolapasta and I met to review the format discussed in IQSS/dataverse-sample-data#8 and he's going to help me with the SQL (phew!). This is the main visual we were looking at:

65719225-5301d280-e073-11e9-9e44-d1afc7868b63

This is what I had so far, based mostly on work from the Gustavo's previous (longer) script.

select fmd.label as filename, dsfv.value as dataset_name, 'N/A' as dataverse_level_1_alias, dsv.releasetime as publication_date
from filemetadata fmd, datasetversion dsv, datasetfieldvalue dsfv, datasetfield dsf, dvobject dvo
where fmd.datasetversion_id = dsv.id
and dsv.id = dsf.datasetversion_id
and dsf.id = dsfv.datasetfield_id
and fmd.datafile_id = dvo.id
and dsf.datasetfieldtype_id=1;
      filename       |              dataset_name               | dataverse_level_1_alias |    publication_date     
---------------------+-----------------------------------------+-------------------------+-------------------------
 triad-data-5466.tab | 10+ FITS files and subsetted data files | N/A                     | 2019-10-02 12:34:41.422
 N.fits              | 10+ FITS files and subsetted data files | N/A                     | 2019-10-02 12:34:41.422
 N_err.fits          | 10+ FITS files and subsetted data files | N/A                     | 2019-10-02 12:34:41.422
 test.pdf            | 10+ FITS files and subsetted data files | N/A                     | 2019-10-02 12:34:41.422
 M_err.fits          | 10+ FITS files and subsetted data files | N/A                     | 2019-10-02 12:34:41.422
 M.fits              | 10+ FITS files and subsetted data files | N/A                     | 2019-10-02 12:34:41.422
 DatasetDiagram.png  | 10+ FITS files and subsetted data files | N/A                     | 2019-10-02 12:34:41.422
 T.fits              | 10+ FITS files and subsetted data files | N/A                     | 2019-10-02 12:34:41.422
 L_over_M.fits       | 10+ FITS files and subsetted data files | N/A                     | 2019-10-02 12:34:41.422
 L.fits              | 10+ FITS files and subsetted data files | N/A                     | 2019-10-02 12:34:41.422
(10 rows)

On my list is to ask @TaniaSchlatter if "publication date" is supposed to be for the file or the dataset. I'm pretty sure it's supposed to be for the file.

@pdurbin
Copy link
Member

pdurbin commented Oct 3, 2019

@scolapasta cooked up some good stuff for me fast so I'm back to my hacking! Thanks!!

@pdurbin
Copy link
Member

pdurbin commented Oct 4, 2019

@sekmiller helped me a ton too. Thanks! 🎉

I just emailed the following to Jess:

"Subject: 50,000 files, 329,267 files, no subjects

Attached please find a zip called files.zip that contains two files:

  • 50k.tsv 50,000 files)
  • all.tsv (329,267 files)

Please note that unlike what we talked about, subjects are not yet included. I'll work on this next but I thought you might like playing around with the three levels of hierarchy, which is included."

I don't think I'll attach the files here because it's production data from Harvard Dataverse. Well, it's about a week old since we refresh our local copy on Sundays. I think I'm properly only including files from datasets that are published but some code review might be nice before we make this data available publicly.

Also, I always forget the syntax for using psql to create a TSV file so here it is so I have it handy next week when I start working on subjects:

psql -U dvnapp dvndb -f file-parents.sql -F $'\t' --no-align --pset footer -o all.tsv

@pdurbin
Copy link
Member

pdurbin commented Oct 7, 2019

@sekmiller helped me add subjects to the query in e8686c8 and it works fine on my laptop with a small database but it's taking way longer to run in the "copy of production" database. Before I made this change the query was taking a minute or two but now it's still going. I'm heading home for the day but I guess I'll check the query in the morning.

I guess I'll attach here the results from my small local database so @TaniaSchlatter or others (Jess) can take a look: devdb.tsv.txt . I'm using a semicolon as a delimiter when there are multiple subjects.

@pdurbin
Copy link
Member

pdurbin commented Oct 9, 2019

I used 4b56213 to create "2018.zip" which I sent to @TaniaSchlatter and Jess. Please let me know if you're happy with the data.

@TaniaSchlatter
Copy link
Member

TaniaSchlatter commented Oct 10, 2019

I think I wrote too quickly. I see 3 date columns in the file: File Creation, File Publication, Dataset Publication. I see some files with dates published before the dataset date published. I see a lot of files with a publication date of 2019. I'd like to review the dates in more detail with someone from the team.

@djbrooke
Copy link
Contributor Author

djbrooke commented Oct 10, 2019

(see above, I take this back :))

Closing, as data looks good (great even) and has been sent to Jess. We will create another issue for the next iteration.

@TaniaSchlatter TaniaSchlatter self-assigned this Oct 10, 2019
@sekmiller
Copy link
Contributor

As we discussed we will add a filter to only select those files that were originally published in 2018. Additionally we will verify the dataset publish date to make sure it makes sense compared to the file publish date.

@sekmiller
Copy link
Contributor

By the way, the dataset publication date we are selecting is actually the publication date of the latest version, which explains why in some(many?) cases the date is after the file publication date (which is the original file pub date.)

@pdurbin pdurbin self-assigned this Oct 10, 2019
@pdurbin
Copy link
Member

pdurbin commented Oct 11, 2019

In fce1289 I implemented the new requirements (with much help, as always from @sekmiller ) and sent a new file to @TaniaSchlatter and Jess.

@pdurbin pdurbin removed their assignment Oct 11, 2019
@TaniaSchlatter
Copy link
Member

This looks great! Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants