Merge pull request #3963 from IQSS/3767-CSV-injest-code

3767: CSV ingest improvements
IQSS · Aug 2, 2017 · 5622911 · 5622911
2 parents e6a8b99 + 4e45054
commit 5622911
Show file tree

Hide file tree

Showing 11 changed files with 889 additions and 556 deletions.
diff --git a/doc/sphinx-guides/source/user/appendix.rst b/doc/sphinx-guides/source/user/appendix.rst
@@ -24,6 +24,6 @@ Detailed below are what metadata schemas we support for Citation and Domain Spec
   : These metadata elements can be mapped/exported to the International Virtual Observatory Alliance’s (IVOA) 
   `VOResource Schema format <http://www.ivoa.net/documents/latest/RM.html>`__ and is based on 
   `Virtual Observatory (VO) Discovery and Provenance Metadata <http://perma.cc/H5ZJ-4KKY>`__ (`see .tsv version <https://github.com/IQSS/dataverse/blob/master/scripts/api/data/metadatablocks/astrophysics.tsv>`__).
-- `Life Sciences Metadata <https://docs.google.com/spreadsheet/ccc?key=0AjeLxEN77UZodHFEWGpoa19ia3pldEFyVFR0aFVGa0E#gid=2>`__: based on `ISA-Tab Specification <http://isatab.sourceforge.net/format.html>`__, along with controlled vocabulary from subsets of the `OBI Ontology <http://bioportal.bioontology.org/ontologies/OBI>`__ and the `NCBI Taxonomy for Organisms <http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/>`__ (`see .tsv version <https://github.com/IQSS/dataverse/blob/master/scripts/api/data/metadatablocks/biomedical.tsv>`__).
+- `Life Sciences Metadata <https://docs.google.com/spreadsheet/ccc?key=0AjeLxEN77UZodHFEWGpoa19ia3pldEFyVFR0aFVGa0E#gid=2>`__: based on `ISA-Tab Specification <http://isa-tools.org/format/specification/>`__, along with controlled vocabulary from subsets of the `OBI Ontology <http://bioportal.bioontology.org/ontologies/OBI>`__ and the `NCBI Taxonomy for Organisms <http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/>`__ (`see .tsv version <https://github.com/IQSS/dataverse/blob/master/scripts/api/data/metadatablocks/biomedical.tsv>`__).
 
 See also the `Dataverse 4.0 Metadata Crosswalk: DDI, DataCite, DC, DCTerms, VO, ISA-Tab <https://docs.google.com/spreadsheets/d/10Luzti7svVTVKTA-px27oq3RxCUM-QbiTkm8iMd5C54/edit?usp=sharing>`__ document.
diff --git a/doc/sphinx-guides/source/user/tabulardataingest/csv.rst b/doc/sphinx-guides/source/user/tabulardataingest/csv.rst
@@ -7,24 +7,83 @@ CSV
 Ingest of Comma-Separated Values files as tabular data. 
 -------------------------------------------------------
 
-Dataverse will make an attempt to turn CSV files uploaded by the user into tabular data. 
+Dataverse will make an attempt to turn CSV files uploaded by the user into tabular data, using the `Apache CSV parser <https://commons.apache.org/proper/commons-csv/>`_. 
 
-Main formatting requirements: 
+Main formatting requirements:
 -----------------------------
 
-The first line must contain a comma-separated list of the variable names; 
+The first row in the document will be treated as the CSV's header, containing variable names for each column.
 
-All the lines that follow must contain the same number of comma-separated values as the first, variable name line. 
+Each following row must contain the same number of comma-separated values ("cells") as that header.
 
-Limitations:
+As of the Dataverse 4.8 release, we allow ingest of CSV files with commas and line breaks within cells. A string with any number of commas and line breaks enclosed within double quotes is recognized as a single cell. Double quotes can be encoded as two double quotes in a row (``""``). 
+
+For example, the following lines:
+
+.. code-block:: none
+
+    a,b,"c,d
+    efgh""ijk""l",m,n
+
+are recognized as a **single** row with **5** comma-separated values (cells):
+
+.. code-block:: none
+
+    a
+    b 
+    c,d\nefgh"ijk"l
+    m
+    n 
+
+(where ``\n`` is a new line character)
+
+
+Limitations: 
 ------------
 
-Except for the variable names supplied in the top line, very little information describing the data can be obtained from a CSV file. We strongly recommend using one of the supported rich files formats (Stata, SPSS and R) to provide more descriptive metadata (informatinve lables, categorical values and labels, and more) that cannot be encoded in a CSV file. 
+Compared to other formats, relatively little information about the data ("variable-level metadata") can be extracted from a CSV file. Aside from the variable names supplied in the top line, the ingest will make an educated guess about the data type of each comma-separated column. One of the supported rich file formats (Stata, SPSS and R) should be used if you need to provide more descriptive variable-level metadata (variable labels, categorical values and labels, explicitly defined data types, etc.). 
+
+Recognized data types and formatting:
+-------------------------------------
+
+The application will attempt to recognize numeric, string, and date/time values in the individual comma-separated columns.
+
+
+For dates, the ``yyyy-MM-dd`` format is recognized. 
+
+For date-time values, the following 2 formats are recognized: 
+
+``yyyy-MM-dd HH:mm:ss``
+
+``yyyy-MM-dd HH:mm:ss z`` (same format as the above, with the time zone specified)
+
+For numeric variables, the following special values are recognized:
+
+``inf``, ``+inf`` - as a special IEEE 754 "positive infinity" value;
+
+``NaN`` - as a special IEEE 754 "not a number" value; 
+
+An empty value (i.e., a comma followed immediately by another comma, or the line end), or ``NA`` - as a *missing value*.
+
+``null`` - as a numeric *zero*. 
+
+(any combinations of lower and upper cases are allowed in the notations above). 
+
+In character strings, an empty value (a comma followed by another comma, or the line end) is treated as an empty string (NOT as a *missing value*). 
+
+Any non-Latin characters are allowed in character string values, **as long as the encoding is UTF8**. 
+
+
+**Note:** When the ingest recognizes a CSV column as a numeric vector, or as a date/time value, this information is reflected and saved in the database as the *data variable metadata*. To inspect that metadata, click on the *Download* button next to a tabular data file, and select *Variable Metadata*. This will export the variable records in the DDI XML format. (Alternatively, this metadata fragment can be downloaded via the Data Access API; for example: ``http://localhost:8080/api/access/datafile/<FILEID>/metadata/ddi``). 
+
+The most immediate implication is in the calculation of the UNF signatures for the data vectors, as different normalization rules are applied to numeric, character, and date/time values. (see the :doc:`/developers/unf/index` section for more information). If it is important to you that the UNF checksums of your data are accurately calculated, check that the numeric and date/time columns in your file were recognized as such (as ``type=numeric`` and ``type=character, category=date(time)``, respectively). If, for example, a column that was supposed to be numeric is recognized as a vector of character values (strings), double-check that the formatting of the values is consistent. Remember, a single improperly-formatted value in the column will turn it into a vector of character strings, and result in a different UNF. Fix any formatting errors you find, delete the file from the dataset, and try to ingest it again.
 
-The application will however make an attempt to recognize numeric, string and date/time values in CSV files. 
 
 Tab-delimited Data Files:
 -------------------------
 
-Tab-delimited files could be ingested by replacing the TABs with commas. 
+Presently, tab-delimited files can be ingested by replacing the TABs with commas. 
+(We are planning to add direct support for tab-delimited files in an upcoming release). 
+
+
 
diff --git a/src/main/java/Bundle.properties b/src/main/java/Bundle.properties
@@ -1689,4 +1689,8 @@ authenticationProvider.name.github=GitHub
 authenticationProvider.name.google=Google
 authenticationProvider.name.orcid=ORCiD
 authenticationProvider.name.orcid-sandbox=ORCiD Sandbox
-authenticationProvider.name.shib=Shibboleth
+authenticationProvider.name.shib=Shibboleth
+ingest.csv.invalidHeader=Invalid header row. One of the cells is empty.
+ingest.csv.lineMismatch=Mismatch between line counts in first and final passes!, {0} found on first pass, but {1} found on second.
+ingest.csv.recordMismatch=Reading mismatch, line {0} of the Data file: {1} delimited values expected, {2} found.
+ingest.csv.nullStream=Stream can't be null.
diff --git a/src/main/java/edu/harvard/iq/dataverse/api/TestIngest.java b/src/main/java/edu/harvard/iq/dataverse/api/TestIngest.java
@@ -7,14 +7,14 @@
 package edu.harvard.iq.dataverse.api;
 
 import edu.harvard.iq.dataverse.DataFile;
-import edu.harvard.iq.dataverse.DataFileServiceBean;
 import edu.harvard.iq.dataverse.DataTable;
 import edu.harvard.iq.dataverse.DatasetServiceBean;
 import edu.harvard.iq.dataverse.FileMetadata;
 import edu.harvard.iq.dataverse.ingest.IngestServiceBean;
 import edu.harvard.iq.dataverse.ingest.tabulardata.TabularDataFileReader;
 import edu.harvard.iq.dataverse.ingest.tabulardata.TabularDataIngest;
 import edu.harvard.iq.dataverse.util.FileUtil;
+import edu.harvard.iq.dataverse.util.StringUtil;
 import java.io.BufferedInputStream;
 import java.util.logging.Logger;
 import javax.ejb.EJB;
@@ -32,6 +32,7 @@
 import javax.ws.rs.core.HttpHeaders;
 import javax.ws.rs.core.UriInfo;
 import javax.servlet.http.HttpServletResponse;
+import javax.ws.rs.QueryParam;
 
 
 
@@ -56,49 +57,38 @@
 public class TestIngest {
     private static final Logger logger = Logger.getLogger(TestIngest.class.getCanonicalName());
 
-    @EJB
-    DataFileServiceBean dataFileService;
     @EJB 
     DatasetServiceBean datasetService; 
     @EJB
     IngestServiceBean ingestService;
 
     //@EJB
 
-    @Path("test/{fileName}/{fileType}")
+    @Path("test/file")
     @GET
     @Produces({ "text/plain" })
-    public String datafile(@PathParam("fileName") String fileName, @PathParam("fileType") String fileType, @Context UriInfo uriInfo, @Context HttpHeaders headers, @Context HttpServletResponse response) /*throws NotFoundException, ServiceUnavailableException, PermissionDeniedException, AuthorizationRequiredException*/ {        
+    public String datafile(@QueryParam("fileName") String fileName, @QueryParam("fileType") String fileType, @Context UriInfo uriInfo, @Context HttpHeaders headers, @Context HttpServletResponse response) /*throws NotFoundException, ServiceUnavailableException, PermissionDeniedException, AuthorizationRequiredException*/ {        
         String output = "";
 
-        if (fileName == null || fileType == null || "".equals(fileName) || "".equals(fileType)) {
-            output = output.concat("Usage: java edu.harvard.iq.dataverse.ingest.IngestServiceBean <file> <type>.");
+        if (StringUtil.isEmpty(fileName) || StringUtil.isEmpty(fileType)) {
+            output = output.concat("Usage: /api/ingest/test/file?fileName=PATH&fileType=TYPE");
             return output; 
         }
 
         BufferedInputStream fileInputStream = null; 
 
-        String absoluteFilePath = null; 
-        if (fileType.equals("x-stata")) {
-            absoluteFilePath = "/usr/share/data/retest_stata/reingest/" + fileName;
-        } else if (fileType.equals("x-spss-sav")) {
-            absoluteFilePath = "/usr/share/data/retest_sav/reingest/" + fileName;
-        } else if (fileType.equals("x-spss-por")) {
-            absoluteFilePath = "/usr/share/data/retest_por/reingest/" + fileName; 
-        }
 
         try {
-            fileInputStream = new BufferedInputStream(new FileInputStream(new File(absoluteFilePath)));
+            fileInputStream = new BufferedInputStream(new FileInputStream(new File(fileName)));
         } catch (FileNotFoundException notfoundEx) {
             fileInputStream = null; 
         }
 
         if (fileInputStream == null) {
-            output = output.concat("Could not open file "+absoluteFilePath+".");
+            output = output.concat("Could not open file "+fileName+".");
             return output;
         }
 
-        fileType = "application/"+fileType; 
         TabularDataFileReader ingestPlugin = ingestService.getTabDataReaderByMimeType(fileType);
 
         if (ingestPlugin == null) {
@@ -123,7 +113,7 @@ public String datafile(@PathParam("fileName") String fileName, @PathParam("fileT
                         && tabFile != null
                         && tabFile.exists()) {
 
-                    String tabFilename = FileUtil.replaceExtension(absoluteFilePath, "tab");
+                    String tabFilename = FileUtil.replaceExtension(fileName, "tab");
 
                     java.nio.file.Files.copy(Paths.get(tabFile.getAbsolutePath()), Paths.get(tabFilename), StandardCopyOption.REPLACE_EXISTING);
 

diff --git a/src/main/java/edu/harvard/iq/dataverse/ingest/IngestReport.java b/src/main/java/edu/harvard/iq/dataverse/ingest/IngestReport.java
@@ -15,6 +15,7 @@
 import javax.persistence.Id;
 import javax.persistence.Index;
 import javax.persistence.JoinColumn;
+import javax.persistence.Lob;
 import javax.persistence.ManyToOne;
 import javax.persistence.Table;
 import javax.persistence.Temporal;
@@ -39,86 +40,87 @@ public Long getId() {
     public void setId(Long id) {
         this.id = id;
     }
-    
-    public static int INGEST_TYPE_TABULAR = 1; 
-    public static int INGEST_TYPE_METADATA = 2; 
-    
-    public static int INGEST_STATUS_INPROGRESS = 1; 
-    public static int INGEST_STATUS_SUCCESS = 2; 
-    public static int INGEST_STATUS_FAILURE = 3; 
-    
+
+    public static int INGEST_TYPE_TABULAR = 1;
+    public static int INGEST_TYPE_METADATA = 2;
+
+    public static int INGEST_STATUS_INPROGRESS = 1;
+    public static int INGEST_STATUS_SUCCESS = 2;
+    public static int INGEST_STATUS_FAILURE = 3;
+
     @ManyToOne
     @JoinColumn(nullable=false)
     private DataFile dataFile;
-
-    private String report; 
-
-    private int type; 
-
+
+    @Lob
+    private String report;
+
+    private int type;
+
     private int status;
-    
+
     @Temporal(value = TemporalType.TIMESTAMP)
-    private Date startTime; 
-    
+    private Date startTime;
+
     @Temporal(value = TemporalType.TIMESTAMP)
-    private Date endTime; 
-    
+    private Date endTime;
+
     public int getType() {
-        return type; 
+        return type;
     }
-    
+
     public void setType(int type) {
         this.type = type;
     }
-    
+
     public int getStatus() {
-        return status; 
+        return status;
     }
-    
+
     public void setStatus(int status) {
         this.status = status;
     }
-    
+
     public boolean isFailure() {
         return status == INGEST_STATUS_FAILURE;
     }
-    
+
     public void setFailure() {
         this.status = INGEST_STATUS_FAILURE;
     }
-    
+
     public String getReport() {
         return report;
     }
-    
+
     public void setReport(String report) {
-        this.report = report; 
+        this.report = report;
     }
 
     public DataFile getDataFile() {
         return dataFile;
     }
-    
+
     public void setDataFile(DataFile dataFile) {
-        this.dataFile = dataFile; 
+        this.dataFile = dataFile;
     }
-    
+
     public Date getStartTime() {
-        return startTime; 
+        return startTime;
     }
-    
+
     public void setStartTime(Date startTime) {
         this.startTime = startTime;
     }
-    
+
     public Date getEndTime() {
-        return endTime; 
+        return endTime;
     }
-    
+
     public void setEndTime(Date endTime) {
         this.endTime = endTime;
     }
-    
+
     @Override
     public int hashCode() {
         int hash = 0;
@@ -143,5 +145,5 @@ public boolean equals(Object object) {
     public String toString() {
         return "edu.harvard.iq.dataverse.ingest.IngestReport[ id=" + id + " ]";
     }
-    
+
 }
diff --git a/src/main/java/edu/harvard/iq/dataverse/ingest/IngestServiceBean.java b/src/main/java/edu/harvard/iq/dataverse/ingest/IngestServiceBean.java
@@ -544,8 +544,8 @@ public void produceContinuousSummaryStatistics(DataFile dataFile, File generated
         for (int i = 0; i < dataFile.getDataTable().getVarQuantity(); i++) {
             if (dataFile.getDataTable().getDataVariables().get(i).isIntervalContinuous()) {
                 logger.fine("subsetting continuous vector");
-                DataFileIO dataFileIO = dataFile.getDataFileIO();
-                dataFileIO.open();
+                //DataFileIO dataFileIO = dataFile.getDataFileIO();
+                //dataFileIO.open();
                 if ("float".equals(dataFile.getDataTable().getDataVariables().get(i).getFormat())) {
                     Float[] variableVector = TabularSubsetGenerator.subsetFloatVector(new FileInputStream(generatedTabularFile), i, dataFile.getDataTable().getCaseQuantity().intValue());
                     logger.fine("Calculating summary statistics on a Float vector;");
@@ -576,8 +576,8 @@ public void produceDiscreteNumericSummaryStatistics(DataFile dataFile, File gene
             if (dataFile.getDataTable().getDataVariables().get(i).isIntervalDiscrete()
                     && dataFile.getDataTable().getDataVariables().get(i).isTypeNumeric()) {
                 logger.fine("subsetting discrete-numeric vector");
-                DataFileIO dataFileIO = dataFile.getDataFileIO();
-                dataFileIO.open();
+                //DataFileIO dataFileIO = dataFile.getDataFileIO();
+                //dataFileIO.open();
                 Long[] variableVector = TabularSubsetGenerator.subsetLongVector(new FileInputStream(generatedTabularFile), i, dataFile.getDataTable().getCaseQuantity().intValue());
                 // We are discussing calculating the same summary stats for 
                 // all numerics (the same kind of sumstats that we've been calculating
@@ -610,8 +610,8 @@ public void produceCharacterSummaryStatistics(DataFile dataFile, File generatedT
 
         for (int i = 0; i < dataFile.getDataTable().getVarQuantity(); i++) {
             if (dataFile.getDataTable().getDataVariables().get(i).isTypeCharacter()) {
-                DataFileIO dataFileIO = dataFile.getDataFileIO();
-                dataFileIO.open();
+                //DataFileIO dataFileIO = dataFile.getDataFileIO();
+                //dataFileIO.open();
                 logger.fine("subsetting character vector");
                 String[] variableVector = TabularSubsetGenerator.subsetStringVector(new FileInputStream(generatedTabularFile), i, dataFile.getDataTable().getCaseQuantity().intValue());
                 //calculateCharacterSummaryStatistics(dataFile, i, variableVector);