diff --git a/doc/sphinx-guides/source/user/appendix.rst b/doc/sphinx-guides/source/user/appendix.rst index ba38c51fd6f..f21e6600b5f 100755 --- a/doc/sphinx-guides/source/user/appendix.rst +++ b/doc/sphinx-guides/source/user/appendix.rst @@ -24,6 +24,6 @@ Detailed below are what metadata schemas we support for Citation and Domain Spec : These metadata elements can be mapped/exported to the International Virtual Observatory Alliance’s (IVOA) `VOResource Schema format `__ and is based on `Virtual Observatory (VO) Discovery and Provenance Metadata `__ (`see .tsv version `__). -- `Life Sciences Metadata `__: based on `ISA-Tab Specification `__, along with controlled vocabulary from subsets of the `OBI Ontology `__ and the `NCBI Taxonomy for Organisms `__ (`see .tsv version `__). +- `Life Sciences Metadata `__: based on `ISA-Tab Specification `__, along with controlled vocabulary from subsets of the `OBI Ontology `__ and the `NCBI Taxonomy for Organisms `__ (`see .tsv version `__). See also the `Dataverse 4.0 Metadata Crosswalk: DDI, DataCite, DC, DCTerms, VO, ISA-Tab `__ document. diff --git a/doc/sphinx-guides/source/user/tabulardataingest/csv.rst b/doc/sphinx-guides/source/user/tabulardataingest/csv.rst index f716a6d4428..c29708f31c9 100644 --- a/doc/sphinx-guides/source/user/tabulardataingest/csv.rst +++ b/doc/sphinx-guides/source/user/tabulardataingest/csv.rst @@ -7,24 +7,83 @@ CSV Ingest of Comma-Separated Values files as tabular data. ------------------------------------------------------- -Dataverse will make an attempt to turn CSV files uploaded by the user into tabular data. +Dataverse will make an attempt to turn CSV files uploaded by the user into tabular data, using the `Apache CSV parser `_. -Main formatting requirements: +Main formatting requirements: ----------------------------- -The first line must contain a comma-separated list of the variable names; +The first row in the document will be treated as the CSV's header, containing variable names for each column. -All the lines that follow must contain the same number of comma-separated values as the first, variable name line. +Each following row must contain the same number of comma-separated values ("cells") as that header. -Limitations: +As of the Dataverse 4.8 release, we allow ingest of CSV files with commas and line breaks within cells. A string with any number of commas and line breaks enclosed within double quotes is recognized as a single cell. Double quotes can be encoded as two double quotes in a row (``""``). + +For example, the following lines: + +.. code-block:: none + + a,b,"c,d + efgh""ijk""l",m,n + +are recognized as a **single** row with **5** comma-separated values (cells): + +.. code-block:: none + + a + b + c,d\nefgh"ijk"l + m + n + +(where ``\n`` is a new line character) + + +Limitations: ------------ -Except for the variable names supplied in the top line, very little information describing the data can be obtained from a CSV file. We strongly recommend using one of the supported rich files formats (Stata, SPSS and R) to provide more descriptive metadata (informatinve lables, categorical values and labels, and more) that cannot be encoded in a CSV file. +Compared to other formats, relatively little information about the data ("variable-level metadata") can be extracted from a CSV file. Aside from the variable names supplied in the top line, the ingest will make an educated guess about the data type of each comma-separated column. One of the supported rich file formats (Stata, SPSS and R) should be used if you need to provide more descriptive variable-level metadata (variable labels, categorical values and labels, explicitly defined data types, etc.). + +Recognized data types and formatting: +------------------------------------- + +The application will attempt to recognize numeric, string, and date/time values in the individual comma-separated columns. + + +For dates, the ``yyyy-MM-dd`` format is recognized. + +For date-time values, the following 2 formats are recognized: + +``yyyy-MM-dd HH:mm:ss`` + +``yyyy-MM-dd HH:mm:ss z`` (same format as the above, with the time zone specified) + +For numeric variables, the following special values are recognized: + +``inf``, ``+inf`` - as a special IEEE 754 "positive infinity" value; + +``NaN`` - as a special IEEE 754 "not a number" value; + +An empty value (i.e., a comma followed immediately by another comma, or the line end), or ``NA`` - as a *missing value*. + +``null`` - as a numeric *zero*. + +(any combinations of lower and upper cases are allowed in the notations above). + +In character strings, an empty value (a comma followed by another comma, or the line end) is treated as an empty string (NOT as a *missing value*). + +Any non-Latin characters are allowed in character string values, **as long as the encoding is UTF8**. + + +**Note:** When the ingest recognizes a CSV column as a numeric vector, or as a date/time value, this information is reflected and saved in the database as the *data variable metadata*. To inspect that metadata, click on the *Download* button next to a tabular data file, and select *Variable Metadata*. This will export the variable records in the DDI XML format. (Alternatively, this metadata fragment can be downloaded via the Data Access API; for example: ``http://localhost:8080/api/access/datafile//metadata/ddi``). + +The most immediate implication is in the calculation of the UNF signatures for the data vectors, as different normalization rules are applied to numeric, character, and date/time values. (see the :doc:`/developers/unf/index` section for more information). If it is important to you that the UNF checksums of your data are accurately calculated, check that the numeric and date/time columns in your file were recognized as such (as ``type=numeric`` and ``type=character, category=date(time)``, respectively). If, for example, a column that was supposed to be numeric is recognized as a vector of character values (strings), double-check that the formatting of the values is consistent. Remember, a single improperly-formatted value in the column will turn it into a vector of character strings, and result in a different UNF. Fix any formatting errors you find, delete the file from the dataset, and try to ingest it again. -The application will however make an attempt to recognize numeric, string and date/time values in CSV files. Tab-delimited Data Files: ------------------------- -Tab-delimited files could be ingested by replacing the TABs with commas. +Presently, tab-delimited files can be ingested by replacing the TABs with commas. +(We are planning to add direct support for tab-delimited files in an upcoming release). + + diff --git a/src/main/java/Bundle.properties b/src/main/java/Bundle.properties index 62890b78b3f..50bf8100734 100755 --- a/src/main/java/Bundle.properties +++ b/src/main/java/Bundle.properties @@ -1689,4 +1689,8 @@ authenticationProvider.name.github=GitHub authenticationProvider.name.google=Google authenticationProvider.name.orcid=ORCiD authenticationProvider.name.orcid-sandbox=ORCiD Sandbox -authenticationProvider.name.shib=Shibboleth \ No newline at end of file +authenticationProvider.name.shib=Shibboleth +ingest.csv.invalidHeader=Invalid header row. One of the cells is empty. +ingest.csv.lineMismatch=Mismatch between line counts in first and final passes!, {0} found on first pass, but {1} found on second. +ingest.csv.recordMismatch=Reading mismatch, line {0} of the Data file: {1} delimited values expected, {2} found. +ingest.csv.nullStream=Stream can't be null. \ No newline at end of file diff --git a/src/main/java/edu/harvard/iq/dataverse/api/TestIngest.java b/src/main/java/edu/harvard/iq/dataverse/api/TestIngest.java index 971e9119adc..76b6ae9aa18 100644 --- a/src/main/java/edu/harvard/iq/dataverse/api/TestIngest.java +++ b/src/main/java/edu/harvard/iq/dataverse/api/TestIngest.java @@ -7,7 +7,6 @@ package edu.harvard.iq.dataverse.api; import edu.harvard.iq.dataverse.DataFile; -import edu.harvard.iq.dataverse.DataFileServiceBean; import edu.harvard.iq.dataverse.DataTable; import edu.harvard.iq.dataverse.DatasetServiceBean; import edu.harvard.iq.dataverse.FileMetadata; @@ -15,6 +14,7 @@ import edu.harvard.iq.dataverse.ingest.tabulardata.TabularDataFileReader; import edu.harvard.iq.dataverse.ingest.tabulardata.TabularDataIngest; import edu.harvard.iq.dataverse.util.FileUtil; +import edu.harvard.iq.dataverse.util.StringUtil; import java.io.BufferedInputStream; import java.util.logging.Logger; import javax.ejb.EJB; @@ -32,6 +32,7 @@ import javax.ws.rs.core.HttpHeaders; import javax.ws.rs.core.UriInfo; import javax.servlet.http.HttpServletResponse; +import javax.ws.rs.QueryParam; @@ -56,8 +57,6 @@ public class TestIngest { private static final Logger logger = Logger.getLogger(TestIngest.class.getCanonicalName()); - @EJB - DataFileServiceBean dataFileService; @EJB DatasetServiceBean datasetService; @EJB @@ -65,40 +64,31 @@ public class TestIngest { //@EJB - @Path("test/{fileName}/{fileType}") + @Path("test/file") @GET @Produces({ "text/plain" }) - public String datafile(@PathParam("fileName") String fileName, @PathParam("fileType") String fileType, @Context UriInfo uriInfo, @Context HttpHeaders headers, @Context HttpServletResponse response) /*throws NotFoundException, ServiceUnavailableException, PermissionDeniedException, AuthorizationRequiredException*/ { + public String datafile(@QueryParam("fileName") String fileName, @QueryParam("fileType") String fileType, @Context UriInfo uriInfo, @Context HttpHeaders headers, @Context HttpServletResponse response) /*throws NotFoundException, ServiceUnavailableException, PermissionDeniedException, AuthorizationRequiredException*/ { String output = ""; - if (fileName == null || fileType == null || "".equals(fileName) || "".equals(fileType)) { - output = output.concat("Usage: java edu.harvard.iq.dataverse.ingest.IngestServiceBean ."); + if (StringUtil.isEmpty(fileName) || StringUtil.isEmpty(fileType)) { + output = output.concat("Usage: /api/ingest/test/file?fileName=PATH&fileType=TYPE"); return output; } BufferedInputStream fileInputStream = null; - String absoluteFilePath = null; - if (fileType.equals("x-stata")) { - absoluteFilePath = "/usr/share/data/retest_stata/reingest/" + fileName; - } else if (fileType.equals("x-spss-sav")) { - absoluteFilePath = "/usr/share/data/retest_sav/reingest/" + fileName; - } else if (fileType.equals("x-spss-por")) { - absoluteFilePath = "/usr/share/data/retest_por/reingest/" + fileName; - } try { - fileInputStream = new BufferedInputStream(new FileInputStream(new File(absoluteFilePath))); + fileInputStream = new BufferedInputStream(new FileInputStream(new File(fileName))); } catch (FileNotFoundException notfoundEx) { fileInputStream = null; } if (fileInputStream == null) { - output = output.concat("Could not open file "+absoluteFilePath+"."); + output = output.concat("Could not open file "+fileName+"."); return output; } - fileType = "application/"+fileType; TabularDataFileReader ingestPlugin = ingestService.getTabDataReaderByMimeType(fileType); if (ingestPlugin == null) { @@ -123,7 +113,7 @@ public String datafile(@PathParam("fileName") String fileName, @PathParam("fileT && tabFile != null && tabFile.exists()) { - String tabFilename = FileUtil.replaceExtension(absoluteFilePath, "tab"); + String tabFilename = FileUtil.replaceExtension(fileName, "tab"); java.nio.file.Files.copy(Paths.get(tabFile.getAbsolutePath()), Paths.get(tabFilename), StandardCopyOption.REPLACE_EXISTING); diff --git a/src/main/java/edu/harvard/iq/dataverse/ingest/IngestReport.java b/src/main/java/edu/harvard/iq/dataverse/ingest/IngestReport.java index ae540ad6e28..31208abf839 100644 --- a/src/main/java/edu/harvard/iq/dataverse/ingest/IngestReport.java +++ b/src/main/java/edu/harvard/iq/dataverse/ingest/IngestReport.java @@ -15,6 +15,7 @@ import javax.persistence.Id; import javax.persistence.Index; import javax.persistence.JoinColumn; +import javax.persistence.Lob; import javax.persistence.ManyToOne; import javax.persistence.Table; import javax.persistence.Temporal; @@ -39,86 +40,87 @@ public Long getId() { public void setId(Long id) { this.id = id; } - - public static int INGEST_TYPE_TABULAR = 1; - public static int INGEST_TYPE_METADATA = 2; - - public static int INGEST_STATUS_INPROGRESS = 1; - public static int INGEST_STATUS_SUCCESS = 2; - public static int INGEST_STATUS_FAILURE = 3; - + + public static int INGEST_TYPE_TABULAR = 1; + public static int INGEST_TYPE_METADATA = 2; + + public static int INGEST_STATUS_INPROGRESS = 1; + public static int INGEST_STATUS_SUCCESS = 2; + public static int INGEST_STATUS_FAILURE = 3; + @ManyToOne @JoinColumn(nullable=false) private DataFile dataFile; - - private String report; - - private int type; - + + @Lob + private String report; + + private int type; + private int status; - + @Temporal(value = TemporalType.TIMESTAMP) - private Date startTime; - + private Date startTime; + @Temporal(value = TemporalType.TIMESTAMP) - private Date endTime; - + private Date endTime; + public int getType() { - return type; + return type; } - + public void setType(int type) { this.type = type; } - + public int getStatus() { - return status; + return status; } - + public void setStatus(int status) { this.status = status; } - + public boolean isFailure() { return status == INGEST_STATUS_FAILURE; } - + public void setFailure() { this.status = INGEST_STATUS_FAILURE; } - + public String getReport() { return report; } - + public void setReport(String report) { - this.report = report; + this.report = report; } public DataFile getDataFile() { return dataFile; } - + public void setDataFile(DataFile dataFile) { - this.dataFile = dataFile; + this.dataFile = dataFile; } - + public Date getStartTime() { - return startTime; + return startTime; } - + public void setStartTime(Date startTime) { this.startTime = startTime; } - + public Date getEndTime() { - return endTime; + return endTime; } - + public void setEndTime(Date endTime) { this.endTime = endTime; } - + @Override public int hashCode() { int hash = 0; @@ -143,5 +145,5 @@ public boolean equals(Object object) { public String toString() { return "edu.harvard.iq.dataverse.ingest.IngestReport[ id=" + id + " ]"; } - + } diff --git a/src/main/java/edu/harvard/iq/dataverse/ingest/IngestServiceBean.java b/src/main/java/edu/harvard/iq/dataverse/ingest/IngestServiceBean.java index 116f0b7472d..ccaad85577e 100644 --- a/src/main/java/edu/harvard/iq/dataverse/ingest/IngestServiceBean.java +++ b/src/main/java/edu/harvard/iq/dataverse/ingest/IngestServiceBean.java @@ -544,8 +544,8 @@ public void produceContinuousSummaryStatistics(DataFile dataFile, File generated for (int i = 0; i < dataFile.getDataTable().getVarQuantity(); i++) { if (dataFile.getDataTable().getDataVariables().get(i).isIntervalContinuous()) { logger.fine("subsetting continuous vector"); - DataFileIO dataFileIO = dataFile.getDataFileIO(); - dataFileIO.open(); + //DataFileIO dataFileIO = dataFile.getDataFileIO(); + //dataFileIO.open(); if ("float".equals(dataFile.getDataTable().getDataVariables().get(i).getFormat())) { Float[] variableVector = TabularSubsetGenerator.subsetFloatVector(new FileInputStream(generatedTabularFile), i, dataFile.getDataTable().getCaseQuantity().intValue()); logger.fine("Calculating summary statistics on a Float vector;"); @@ -576,8 +576,8 @@ public void produceDiscreteNumericSummaryStatistics(DataFile dataFile, File gene if (dataFile.getDataTable().getDataVariables().get(i).isIntervalDiscrete() && dataFile.getDataTable().getDataVariables().get(i).isTypeNumeric()) { logger.fine("subsetting discrete-numeric vector"); - DataFileIO dataFileIO = dataFile.getDataFileIO(); - dataFileIO.open(); + //DataFileIO dataFileIO = dataFile.getDataFileIO(); + //dataFileIO.open(); Long[] variableVector = TabularSubsetGenerator.subsetLongVector(new FileInputStream(generatedTabularFile), i, dataFile.getDataTable().getCaseQuantity().intValue()); // We are discussing calculating the same summary stats for // all numerics (the same kind of sumstats that we've been calculating @@ -610,8 +610,8 @@ public void produceCharacterSummaryStatistics(DataFile dataFile, File generatedT for (int i = 0; i < dataFile.getDataTable().getVarQuantity(); i++) { if (dataFile.getDataTable().getDataVariables().get(i).isTypeCharacter()) { - DataFileIO dataFileIO = dataFile.getDataFileIO(); - dataFileIO.open(); + //DataFileIO dataFileIO = dataFile.getDataFileIO(); + //dataFileIO.open(); logger.fine("subsetting character vector"); String[] variableVector = TabularSubsetGenerator.subsetStringVector(new FileInputStream(generatedTabularFile), i, dataFile.getDataTable().getCaseQuantity().intValue()); //calculateCharacterSummaryStatistics(dataFile, i, variableVector); diff --git a/src/main/java/edu/harvard/iq/dataverse/ingest/tabulardata/impl/plugins/csv/CSVFileReader.java b/src/main/java/edu/harvard/iq/dataverse/ingest/tabulardata/impl/plugins/csv/CSVFileReader.java index 655a9f93092..130b896070c 100644 --- a/src/main/java/edu/harvard/iq/dataverse/ingest/tabulardata/impl/plugins/csv/CSVFileReader.java +++ b/src/main/java/edu/harvard/iq/dataverse/ingest/tabulardata/impl/plugins/csv/CSVFileReader.java @@ -19,58 +19,69 @@ */ package edu.harvard.iq.dataverse.ingest.tabulardata.impl.plugins.csv; -import java.io.*; import java.io.FileReader; import java.io.InputStreamReader; -import java.text.*; -import java.util.logging.*; -import java.util.*; -import java.security.NoSuchAlgorithmException; - -import javax.inject.Inject; import edu.harvard.iq.dataverse.DataTable; import edu.harvard.iq.dataverse.datavariable.DataVariable; -import edu.harvard.iq.dataverse.ingest.plugin.spi.*; import edu.harvard.iq.dataverse.ingest.tabulardata.TabularDataFileReader; import edu.harvard.iq.dataverse.ingest.tabulardata.spi.TabularDataFileReaderSpi; import edu.harvard.iq.dataverse.ingest.tabulardata.TabularDataIngest; +import edu.harvard.iq.dataverse.util.BundleUtil; +import java.io.BufferedInputStream; +import java.io.BufferedReader; +import java.io.File; +import java.io.FileWriter; +import java.io.IOException; +import java.io.PrintWriter; import java.math.BigDecimal; import java.math.MathContext; import java.math.RoundingMode; -import javax.naming.Context; -import javax.naming.InitialContext; -import javax.naming.NamingException; - -import org.apache.commons.lang.RandomStringUtils; -import org.apache.commons.lang.ArrayUtils; +import java.text.ParseException; +import java.text.ParsePosition; +import java.text.SimpleDateFormat; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Date; +import java.util.HashSet; +import java.util.List; +import java.util.Map; +import java.util.Set; +import java.util.logging.Logger; +import org.apache.commons.csv.CSVFormat; import org.apache.commons.lang.StringUtils; +import org.apache.commons.csv.CSVParser; +import org.apache.commons.csv.CSVPrinter; +import org.apache.commons.csv.CSVRecord; /** * Dataverse 4.0 implementation of TabularDataFileReader for the * plain CSV file with a variable name header. * * - * @author Leonid Andreev + * @author Oscar Smith * - * This implementation uses external R-Scripts to do the bulk of the processing. + * This implementation uses the Apache CSV Parser */ public class CSVFileReader extends TabularDataFileReader { private static final Logger dbglog = Logger.getLogger(CSVFileReader.class.getPackage().getName()); - private static final int DIGITS_OF_PRECISION_DOUBLE = 15; + private static final int DIGITS_OF_PRECISION_DOUBLE = 15; private static final String FORMAT_IEEE754 = "%+#." + DIGITS_OF_PRECISION_DOUBLE + "e"; private MathContext doubleMathContext; - private char delimiterChar = ','; - + private CSVFormat inFormat = CSVFormat.EXCEL; + private Set firstNumCharSet = new HashSet<>(); + // DATE FORMATS - private static SimpleDateFormat[] DATE_FORMATS = new SimpleDateFormat[] { - new SimpleDateFormat("yyyy-MM-dd") + private static SimpleDateFormat[] DATE_FORMATS = new SimpleDateFormat[]{ + new SimpleDateFormat("yyyy-MM-dd"), //new SimpleDateFormat("yyyy/MM/dd"), + //new SimpleDateFormat("MM/dd/yyyy"), + //new SimpleDateFormat("MM-dd-yyyy"), }; - + // TIME FORMATS - private static SimpleDateFormat[] TIME_FORMATS = new SimpleDateFormat[] { + private static SimpleDateFormat[] TIME_FORMATS = new SimpleDateFormat[]{ // Date-time up to seconds with timezone, e.g. 2013-04-08 13:14:23 -0500 new SimpleDateFormat("yyyy-MM-dd HH:mm:ss z"), // Date-time up to seconds and no timezone, e.g. 2013-04-08 13:14:23 @@ -83,20 +94,23 @@ public CSVFileReader(TabularDataFileReaderSpi originator) { private void init() throws IOException { doubleMathContext = new MathContext(DIGITS_OF_PRECISION_DOUBLE, RoundingMode.HALF_EVEN); + firstNumCharSet.addAll(Arrays.asList(new Character[]{'+', '-', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'})); } - + /** * Reads a CSV file, converts it into a dataverse DataTable. * * @param stream a BufferedInputStream. - * @param ignored * @return an TabularDataIngest object * @throws java.io.IOException if a reading error occurs. */ @Override public TabularDataIngest read(BufferedInputStream stream, File dataFile) throws IOException { init(); - + + if (stream == null) { + throw new IOException(BundleUtil.getStringFromBundle("ingest.csv.nullStream")); + } TabularDataIngest ingesteddata = new TabularDataIngest(); DataTable dataTable = new DataTable(); @@ -105,13 +119,12 @@ public TabularDataIngest read(BufferedInputStream stream, File dataFile) throws File tabFileDestination = File.createTempFile("data-", ".tab"); PrintWriter tabFileWriter = new PrintWriter(tabFileDestination.getAbsolutePath()); - int lineCount = readFile(localBufferedReader, dataTable, tabFileWriter); - - dbglog.fine("CSV ingest: found "+lineCount+" data cases/observations."); - dbglog.fine("Tab file produced: "+tabFileDestination.getAbsolutePath()); - + int lineCount = readFile(localBufferedReader, dataTable, tabFileWriter); + + dbglog.fine("Tab file produced: " + tabFileDestination.getAbsolutePath()); + dataTable.setUnf("UNF:6:NOTCALCULATED"); - + ingesteddata.setTabDelimitedFile(tabFileDestination); ingesteddata.setDataTable(dataTable); return ingesteddata; @@ -119,312 +132,193 @@ public TabularDataIngest read(BufferedInputStream stream, File dataFile) throws } public int readFile(BufferedReader csvReader, DataTable dataTable, PrintWriter finalOut) throws IOException { - - String line; - String[] valueTokens; - - int lineCounter = 0; - - // Read first line: - - line = csvReader.readLine(); - line = line.replaceFirst("[\r\n]*$", ""); - valueTokens = line.split("" + delimiterChar, -2); - - if (valueTokens == null || valueTokens.length < 1) { - throw new IOException("Failed to read first, variable name line of the CSV file."); - } - int variableCount = valueTokens.length; - - // Create variables: - - List variableList = new ArrayList(); - - for (int i = 0; i < variableCount; i++) { - String varName = valueTokens[i]; - - if (varName == null || varName.equals("")) { - // TODO: + List variableList = new ArrayList<>(); + CSVParser parser = new CSVParser(csvReader, inFormat.withHeader()); + Map headers = parser.getHeaderMap(); + + int i = 0; + for (String varName : headers.keySet()) { + if (varName == null || varName.isEmpty()) { + // TODO: // Add a sensible variable name validation algorithm. // -- L.A. 4.0 alpha 1 - throw new IOException ("Invalid variable names in the first line! - First line of a CSV file must contain a comma-separated list of the names of the variables."); + throw new IOException(BundleUtil.getStringFromBundle("ingest.csv.invalidHeader")); } - + DataVariable dv = new DataVariable(); dv.setName(varName); dv.setLabel(varName); - dv.setInvalidRanges(new ArrayList()); - dv.setSummaryStatistics(new ArrayList()); + dv.setInvalidRanges(new ArrayList<>()); + dv.setSummaryStatistics(new ArrayList<>()); dv.setUnf("UNF:6:NOTCALCULATED"); - dv.setCategories(new ArrayList()); + dv.setCategories(new ArrayList<>()); variableList.add(dv); dv.setTypeCharacter(); dv.setIntervalDiscrete(); dv.setFileOrder(i); dv.setDataTable(dataTable); + i++; } - - dataTable.setVarQuantity(new Long(variableCount)); + + dataTable.setVarQuantity((long) variableList.size()); dataTable.setDataVariables(variableList); - - boolean[] isNumericVariable = new boolean[variableCount]; - boolean[] isIntegerVariable = new boolean[variableCount]; - boolean[] isTimeVariable = new boolean[variableCount]; - boolean[] isDateVariable = new boolean[variableCount]; - - for (int i = 0; i < variableCount; i++) { - // OK, let's assume that every variable is numeric; - // but we'll go through the file and examine every value; the - // moment we find a value that's not a legit numeric one, we'll - // assume that it is in fact a String. - isNumericVariable[i] = true; + + boolean[] isNumericVariable = new boolean[headers.size()]; + boolean[] isIntegerVariable = new boolean[headers.size()]; + boolean[] isTimeVariable = new boolean[headers.size()]; + boolean[] isDateVariable = new boolean[headers.size()]; + + for (i = 0; i < headers.size(); i++) { + // OK, let's assume that every variable is numeric; + // but we'll go through the file and examine every value; the + // moment we find a value that's not a legit numeric one, we'll + // assume that it is in fact a String. + isNumericVariable[i] = true; isIntegerVariable[i] = true; - isDateVariable[i] = true; - isTimeVariable[i] = true; + isDateVariable[i] = true; + isTimeVariable[i] = true; } // First, "learning" pass. // (we'll save the incoming stream in another temp file:) - - SimpleDateFormat[] selectedDateTimeFormat = new SimpleDateFormat[variableCount]; - SimpleDateFormat[] selectedDateFormat = new SimpleDateFormat[variableCount]; - - - File firstPassTempFile = File.createTempFile("firstpass-", ".tab"); - PrintWriter firstPassWriter = new PrintWriter(firstPassTempFile.getAbsolutePath()); - - - while ((line = csvReader.readLine()) != null) { - // chop the line: - line = line.replaceFirst("[\r\n]*$", ""); - valueTokens = line.split("" + delimiterChar, -2); - - if (valueTokens == null) { - throw new IOException("Failed to read line " + (lineCounter + 1) + " of the Data file."); - } - - int tokenCount = valueTokens.length; - - if (tokenCount > variableCount) { - - // we'll make another attempt to parse the fields - there could be commas - // inside character strings. The only way to disambiguate this situation - // we are going to support, for now, is to allow commas inside tokens - // wrapped in double quotes. We may potentially add other mechanisms, - // such as allowing to specify a custom string wrapper character (something other - // than the double quote), or maybe recognizing escaped commas ("\,") as - // non-separating ones. - // -- L.A. 4.0.2 - - valueTokens = null; - valueTokens = new String[variableCount]; - - int tokenStart = 0; - boolean quotedStringMode = false; - boolean potentialDoubleDoubleQuote = false; - tokenCount = 0; - - for (int i = 0; i < line.length(); i++) { - if (tokenCount > variableCount) { - throw new IOException("Reading mismatch, line " + (lineCounter + 1) + " of the data file contains more than " - + variableCount + " comma-delimited values."); - } - - char c = line.charAt(i); - - if (tokenStart == i && c == '"') { - quotedStringMode = true; - } else if (c == ',' && !quotedStringMode) { - valueTokens[tokenCount] = line.substring(tokenStart, i); - tokenCount++; - tokenStart = i+1; - } else if (i == line.length() - 1) { - valueTokens[tokenCount] = line.substring(tokenStart, line.length()); - tokenCount++; - } else if (quotedStringMode && c == '"') { - quotedStringMode = false; - //unless this is a double double quote in the middle of a quoted - // string; apparently a standard notation for encoding double - // quotes inside quoted strings (??) - potentialDoubleDoubleQuote = true; - } else if (potentialDoubleDoubleQuote && c == '"') { - // OK, that was a "double double" quote. - // going back into the quoted mode: - quotedStringMode = true; - potentialDoubleDoubleQuote = false; - // TODO: figure out what we do with such double double quote - // sequences in the final tab file. Do we want to convert - // them back to a "single double" quote? - // -- L.A. 4.0.2/4.1 - } - + SimpleDateFormat[] selectedDateTimeFormat = new SimpleDateFormat[headers.size()]; + SimpleDateFormat[] selectedDateFormat = new SimpleDateFormat[headers.size()]; + + File firstPassTempFile = File.createTempFile("firstpass-", ".csv"); + + try (CSVPrinter csvFilePrinter = new CSVPrinter( + // TODO allow other parsers of tabular data to use this parser by changin inFormat + new FileWriter(firstPassTempFile.getAbsolutePath()), inFormat)) { + //Write headers + csvFilePrinter.printRecord(headers.keySet()); + for (CSVRecord record : parser.getRecords()) { + // Checks if #records = #columns in header + if (!record.isConsistent()) { + List args = Arrays.asList(new String[]{"" + (parser.getCurrentLineNumber() - 1), + "" + headers.size(), + "" + record.size()}); + throw new IOException(BundleUtil.getStringFromBundle("ingest.csv.recordMismatch", args)); } - } - - //dbglog.info("Number of CSV tokens in the line number " + lineCounter + " : "+tokenCount); - - // final token count check: - - if (tokenCount != variableCount) { - - throw new IOException("Reading mismatch, line " + (lineCounter + 1) + " of the Data file: " - + variableCount + " delimited values expected, " + tokenCount + " found."); - } - for (int i = 0; i < variableCount; i++) { - if (isNumericVariable[i]) { - // If we haven't given up on the "numeric" status of this - // variable, let's perform some tests on it, and see if - // this value is still a parsable number: - if (valueTokens[i] != null && (!valueTokens[i].equals(""))) { - - boolean isNumeric = false; - boolean isInteger = false; - - if (valueTokens[i].equalsIgnoreCase("NaN") - || valueTokens[i].equalsIgnoreCase("NA") - || valueTokens[i].equalsIgnoreCase("Inf") - || valueTokens[i].equalsIgnoreCase("+Inf") - || valueTokens[i].equalsIgnoreCase("-Inf") - || valueTokens[i].equalsIgnoreCase("null")) { - isNumeric = true; - } else { - try { - Double testDoubleValue = new Double(valueTokens[i]); - isNumeric = true; - } catch (NumberFormatException ex) { - // the token failed to parse as a double number; - // so we'll have to assume it's just a string variable. - } - } - - if (!isNumeric) { - isNumericVariable[i] = false; - } else if (isIntegerVariable[i]) { - try { - Integer testIntegerValue = new Integer(valueTokens[i]); - isInteger = true; - } catch (NumberFormatException ex) { - // the token failed to parse as an integer number; - // we'll assume it's a non-integere numeric... - } - if (!isInteger) { - isIntegerVariable[i] = false; + for (i = 0; i < headers.size(); i++) { + String varString = record.get(i); + isIntegerVariable[i] = isIntegerVariable[i] + && varString != null + && (varString.isEmpty() + || varString.equals("null") + || (firstNumCharSet.contains(varString.charAt(0)) + && StringUtils.isNumeric(varString.substring(1)))); + if (isNumericVariable[i]) { + // If variable might be "numeric" test to see if this value is a parsable number: + if (varString != null && !varString.isEmpty()) { + + boolean isNumeric = false; + boolean isInteger = false; + + if (varString.equalsIgnoreCase("NaN") + || varString.equalsIgnoreCase("NA") + || varString.equalsIgnoreCase("Inf") + || varString.equalsIgnoreCase("+Inf") + || varString.equalsIgnoreCase("-Inf") + || varString.equalsIgnoreCase("null")) { + continue; + } else { + try { + Double testDoubleValue = new Double(varString); + continue; + } catch (NumberFormatException ex) { + // the token failed to parse as a double + // so the column is a string variable. + } } + isNumericVariable[i] = false; } } - } - - // And if we have concluded that this is not a numeric column, - // let's see if we can parse the string token as a date or - // a date-time value: - - if (!isNumericVariable[i]) { - - Date dateResult = null; - - if (isTimeVariable[i]) { - if (valueTokens[i] != null && (!valueTokens[i].equals(""))) { - boolean isTime = false; - - if (selectedDateTimeFormat[i] != null) { - dbglog.fine("will try selected format " + selectedDateTimeFormat[i].toPattern()); - ParsePosition pos = new ParsePosition(0); - dateResult = selectedDateTimeFormat[i].parse(valueTokens[i], pos); - - if (dateResult == null) { - dbglog.fine(selectedDateTimeFormat[i].toPattern() + ": null result."); - } else if (pos.getIndex() != valueTokens[i].length()) { - dbglog.fine(selectedDateTimeFormat[i].toPattern() + ": didn't parse to the end - bad time zone?"); - } else { - // OK, successfully parsed a value! - isTime = true; - dbglog.fine(selectedDateTimeFormat[i].toPattern() + " worked!"); - } - } else { - for (SimpleDateFormat format : TIME_FORMATS) { - dbglog.fine("will try format " + format.toPattern()); + + // If this is not a numeric column, see if it is a date collumn + // by parsing the cell as a date or date-time value: + if (!isNumericVariable[i]) { + + Date dateResult = null; + + if (isTimeVariable[i]) { + if (varString != null && !varString.isEmpty()) { + boolean isTime = false; + + if (selectedDateTimeFormat[i] != null) { ParsePosition pos = new ParsePosition(0); - dateResult = format.parse(valueTokens[i], pos); - if (dateResult == null) { - dbglog.fine(format.toPattern() + ": null result."); - continue; + dateResult = selectedDateTimeFormat[i].parse(varString, pos); + + if (dateResult != null && pos.getIndex() == varString.length()) { + // OK, successfully parsed a value! + isTime = true; } - if (pos.getIndex() != valueTokens[i].length()) { - dbglog.fine(format.toPattern() + ": didn't parse to the end - bad time zone?"); - continue; + } else { + for (SimpleDateFormat format : TIME_FORMATS) { + ParsePosition pos = new ParsePosition(0); + dateResult = format.parse(varString, pos); + if (dateResult != null && pos.getIndex() == varString.length()) { + // OK, successfully parsed a value! + isTime = true; + selectedDateTimeFormat[i] = format; + break; + } } - // OK, successfully parsed a value! - isTime = true; - dbglog.fine(format.toPattern() + " worked!"); - selectedDateTimeFormat[i] = format; - break; } - } - if (!isTime) { - isTimeVariable[i] = false; - // OK, the token didn't parse as a time value; - // But we will still try to parse it as a date, below. - // unless of course we have already decided that this column - // is NOT a date. - } else { - // And if it is a time value, we are going to assume it's - // NOT a date. - isDateVariable[i] = false; + if (!isTime) { + isTimeVariable[i] = false; + // if the token didn't parse as a time value, + // we will still try to parse it as a date, below. + // unless this column is NOT a date. + } else { + // And if it is a time value, we are going to assume it's + // NOT a date. + isDateVariable[i] = false; + } } } - } - if (isDateVariable[i]) { - if (valueTokens[i] != null && (!valueTokens[i].equals(""))) { - boolean isDate = false; - - // TODO: - // Strictly speaking, we should be doing the same thing - // here as with the time formats above; select the - // first one that works, then insist that all the - // other values in this column match it... but we - // only have one, as of now, so it should be ok. - // -- L.A. 4.0 beta - - for (SimpleDateFormat format : DATE_FORMATS) { - // Strict parsing - it will throw an - // exception if it doesn't parse! - format.setLenient(false); - dbglog.fine("will try format " + format.toPattern()); - try { - dateResult = format.parse(valueTokens[i]); - dbglog.fine("format " + format.toPattern() + " worked!"); - isDate = true; - selectedDateFormat[i] = format; - break; - } catch (ParseException ex) { - //Do nothing - dbglog.fine("format " + format.toPattern() + " didn't work."); + if (isDateVariable[i]) { + if (varString != null && !varString.isEmpty()) { + boolean isDate = false; + + // TODO: + // Strictly speaking, we should be doing the same thing + // here as with the time formats above; select the + // first one that works, then insist that all the + // other values in this column match it... but we + // only have one, as of now, so it should be ok. + // -- L.A. 4.0 beta + for (SimpleDateFormat format : DATE_FORMATS) { + // Strict parsing - it will throw an + // exception if it doesn't parse! + format.setLenient(false); + try { + format.parse(varString); + isDate = true; + selectedDateFormat[i] = format; + break; + } catch (ParseException ex) { + //Do nothing + } } + isDateVariable[i] = isDate; } - if (!isDate) { - isDateVariable[i] = false; - } } } } + + csvFilePrinter.printRecord(record); } - - firstPassWriter.println(line); - lineCounter++; } - - firstPassWriter.close(); + dataTable.setCaseQuantity(parser.getRecordNumber()); + parser.close(); csvReader.close(); - dataTable.setCaseQuantity(new Long(lineCounter)); - // Re-type the variables that we've determined are numerics: - - for (int i = 0; i < variableCount; i++) { + for (i = 0; i < headers.size(); i++) { if (isNumericVariable[i]) { dataTable.getDataVariables().get(i).setTypeNumeric(); @@ -444,207 +338,131 @@ public int readFile(BufferedReader csvReader, DataTable dataTable, PrintWriter f dataTable.getDataVariables().get(i).setFormatCategory("time"); } } - // Second, final pass. - - // Re-open the saved file and reset the line counter: - - BufferedReader secondPassReader = new BufferedReader(new FileReader(firstPassTempFile)); - lineCounter = 0; - String[] caseRow = new String[variableCount]; - - - while ((line = secondPassReader.readLine()) != null) { - // chop the line: - line = line.replaceFirst("[\r\n]*$", ""); - valueTokens = line.split("" + delimiterChar, -2); - - if (valueTokens == null) { - throw new IOException("Failed to read line " + (lineCounter + 1) + " during the second pass."); - } - - int tokenCount = valueTokens.length; - - if (tokenCount > variableCount) { - - // again, similar check for quote-encased strings that contain - // commas inside them. - // -- L.A. 4.0.2 - - valueTokens = null; - valueTokens = new String[variableCount]; - - int tokenStart = 0; - boolean quotedStringMode = false; - boolean potentialDoubleDoubleQuote = false; - tokenCount = 0; - - for (int i = 0; i < line.length(); i++) { - if (tokenCount > variableCount) { - throw new IOException("Reading mismatch, line " + (lineCounter + 1) + " of the data file contains more than " - + variableCount + " comma-delimited values."); - } - - char c = line.charAt(i); - - if (tokenStart == i && c == '"') { - quotedStringMode = true; - } else if (c == ',' && !quotedStringMode) { - valueTokens[tokenCount] = line.substring(tokenStart, i); - tokenCount++; - tokenStart = i+1; - } else if (i == line.length() - 1) { - valueTokens[tokenCount] = line.substring(tokenStart, line.length()); - tokenCount++; - } else if (quotedStringMode && c == '"') { - quotedStringMode = false; - potentialDoubleDoubleQuote = true; - } else if (potentialDoubleDoubleQuote && c == '"') { - quotedStringMode = true; - potentialDoubleDoubleQuote = false; - } - + try (BufferedReader secondPassReader = new BufferedReader(new FileReader(firstPassTempFile))) { + parser = new CSVParser(secondPassReader, inFormat.withHeader()); + String[] caseRow = new String[headers.size()]; + + for (CSVRecord record : parser) { + if (!record.isConsistent()) { + List args = Arrays.asList(new String[]{"" + (parser.getCurrentLineNumber() - 1), + "" + headers.size(), + "" + record.size()}); + throw new IOException(BundleUtil.getStringFromBundle("ingest.csv.recordMismatch", args)); } - } - - // TODO: - // isolate CSV parsing into its own method/class, to avoid - // code duplication in the 2 passes, above; - // do not save the result of the 1st pass - simply reopen the - // original file (?). - // -- L.A. 4.0.2/4.1 - - if (tokenCount != variableCount) { - throw new IOException("Reading mismatch, line " + (lineCounter + 1) + " during the second pass: " - + variableCount + " delimited values expected, " + valueTokens.length + " found."); - } - - for (int i = 0; i < variableCount; i++) { - if (isNumericVariable[i]) { - if (valueTokens[i] == null || valueTokens[i].equalsIgnoreCase("") || valueTokens[i].equalsIgnoreCase("NA")) { - // Missing value - represented as an empty string in - // the final tab file - caseRow[i] = ""; - } else if (valueTokens[i].equalsIgnoreCase("NaN")) { - // "Not a Number" special value: - caseRow[i] = "NaN"; - } else if (valueTokens[i].equalsIgnoreCase("Inf") - || valueTokens[i].equalsIgnoreCase("+Inf")) { - // Positive infinity: - caseRow[i] = "Inf"; - } else if (valueTokens[i].equalsIgnoreCase("-Inf")) { - // Negative infinity: - caseRow[i] = "-Inf"; - } else if (valueTokens[i].equalsIgnoreCase("null")) { - // By request from Gus - "NULL" is recognized as a - // numeric zero: - if (isIntegerVariable[i]) { - caseRow[i] = "0"; + + for (i = 0; i < headers.size(); i++) { + String varString = record.get(i); + if (isNumericVariable[i]) { + if (varString == null || varString.isEmpty() || varString.equalsIgnoreCase("NA")) { + // Missing value - represented as an empty string in + // the final tab file + caseRow[i] = ""; + } else if (varString.equalsIgnoreCase("NaN")) { + // "Not a Number" special value: + caseRow[i] = "NaN"; + } else if (varString.equalsIgnoreCase("Inf") + || varString.equalsIgnoreCase("+Inf")) { + // Positive infinity: + caseRow[i] = "Inf"; + } else if (varString.equalsIgnoreCase("-Inf")) { + // Negative infinity: + caseRow[i] = "-Inf"; + } else if (varString.equalsIgnoreCase("null")) { + // By request from Gus - "NULL" is recognized as a + // numeric zero: + caseRow[i] = isIntegerVariable[i] ? "0" : "0.0"; } else { - caseRow[i] = "0.0"; - } - } else { - /* No re-formatting is done on any other numeric values. - * We'll save them as they were, for archival purposes. - * The alternative solution - formatting in sci. notation - * is commented-out below. - */ - caseRow[i] = valueTokens[i]; - /* - if (isIntegerVariable[i]) { - try { - Integer testIntegerValue = new Integer(valueTokens[i]); - caseRow[i] = testIntegerValue.toString(); - } catch (NumberFormatException ex) { - throw new IOException ("Failed to parse a value recognized as an integer in the first pass! (?)"); + /* No re-formatting is done on any other numeric values. + * We'll save them as they were, for archival purposes. + * The alternative solution - formatting in sci. notation + * is commented-out below. + */ + caseRow[i] = varString; + /* + if (isIntegerVariable[i]) { + try { + Integer testIntegerValue = new Integer(varString); + caseRow[i] = testIntegerValue.toString(); + } catch (NumberFormatException ex) { + throw new IOException("Failed to parse a value recognized as an integer in the first pass! (?)"); + } + } else { + try { + Double testDoubleValue = new Double(varString); + if (testDoubleValue.equals(0.0)) { + caseRow[i] = "0.0"; + } else { + // One possible implementation: + // + // Round our fractional values to 15 digits + // (minimum number of digits of precision guaranteed by + // type Double) and format the resulting representations + // in a IEEE 754-like "scientific notation" - for ex., + // 753.24 will be encoded as 7.5324e2 + BigDecimal testBigDecimal = new BigDecimal(varString, doubleMathContext); + caseRow[i] = String.format(FORMAT_IEEE754, testBigDecimal); + + // Strip meaningless zeros and extra + signs: + caseRow[i] = caseRow[i].replaceFirst("00*e", "e"); + caseRow[i] = caseRow[i].replaceFirst("\\.e", ".0e"); + caseRow[i] = caseRow[i].replaceFirst("e\\+00", ""); + caseRow[i] = caseRow[i].replaceFirst("^\\+", ""); + } + } catch (NumberFormatException ex) { + throw new IOException("Failed to parse a value recognized as numeric in the first pass! (?)"); + } } + */ + } + } else if (isTimeVariable[i] || isDateVariable[i]) { + // Time and Dates are stored NOT quoted (don't ask). + if (varString != null) { + // Dealing with quotes: + // remove the leading and trailing quotes, if present: + varString = varString.replaceFirst("^\"*", ""); + varString = varString.replaceFirst("\"*$", ""); + caseRow[i] = varString; } else { - try { - Double testDoubleValue = new Double(valueTokens[i]); - if (testDoubleValue.equals(0.0)) { - caseRow[i] = "0.0"; - } else { - // One possible implementation: - // - // Round our fractional values to 15 digits - // (minimum number of digits of precision guaranteed by - // type Double) and format the resulting representations - // in a IEEE 754-like "scientific notation" - for ex., - // 753.24 will be encoded as 7.5324e2 - BigDecimal testBigDecimal = new BigDecimal(valueTokens[i], doubleMathContext); - // an experiment - what's gonna happen if we just - // use the string representation of the bigdecimal object - // above? - //caseRow[i] = testBigDecimal.toString(); -= - caseRow[i] = String.format(FORMAT_IEEE754, testBigDecimal); - - // Strip meaningless zeros and extra + signs: - caseRow[i] = caseRow[i].replaceFirst("00*e", "e"); - caseRow[i] = caseRow[i].replaceFirst("\\.e", ".0e"); - caseRow[i] = caseRow[i].replaceFirst("e\\+00", ""); - caseRow[i] = caseRow[i].replaceFirst("^\\+", ""); - } - - } catch (NumberFormatException ex) { - throw new IOException("Failed to parse a value recognized as numeric in the first pass! (?)"); - } + caseRow[i] = ""; } - */ - } - } else if (isTimeVariable[i] || isDateVariable[i]) { - // Time and Dates are stored NOT quoted (don't ask). - if (valueTokens[i] != null) { - String charToken = valueTokens[i]; - // Dealing with quotes: - // remove the leading and trailing quotes, if present: - charToken = charToken.replaceFirst("^\"*", ""); - charToken = charToken.replaceFirst("\"*$", ""); - caseRow[i] = charToken; - } else { - caseRow[i] = ""; - } - } else { - // Treat as a String: - // Strings are stored in tab files quoted; - // Missing values are stored as tab-delimited nothing - - // i.e., an empty string between two tabs (or one tab and - // the new line); - // Empty strings stored as "" (quoted empty string). - // For the purposes of this CSV ingest reader, we are going - // to assume that all the empty strings in the file are - // indeed empty strings, and NOT missing values: - if (valueTokens[i] != null) { - String charToken = valueTokens[i]; - // Dealing with quotes: - // remove the leading and trailing quotes, if present: - charToken = charToken.replaceFirst("^\"", ""); - charToken = charToken.replaceFirst("\"$", ""); - // escape the remaining ones: - charToken = charToken.replace("\"", "\\\""); - // final pair of quotes: - charToken = "\"" + charToken + "\""; - caseRow[i] = charToken; } else { - caseRow[i] = "\"\""; + // Treat as a String: + // Strings are stored in tab files quoted; + // Missing values are stored as an empty string + // between two tabs (or one tab and the new line); + // Empty strings stored as "" (quoted empty string). + // For the purposes of this CSV ingest reader, we are going + // to assume that all the empty strings in the file are + // indeed empty strings, and NOT missing values: + if (varString != null) { + // escape the quotes, newlines, and tabs: + varString = varString.replace("\"", "\\\""); + varString = varString.replace("\n", "\\n"); + varString = varString.replace("\t", "\\t"); + // final pair of quotes: + varString = "\"" + varString + "\""; + caseRow[i] = varString; + } else { + caseRow[i] = "\"\""; + } } } + finalOut.println(StringUtils.join(caseRow, "\t")); } - - finalOut.println(StringUtils.join(caseRow, "\t")); - lineCounter++; - - } - - secondPassReader.close(); + long linecount = parser.getRecordNumber(); finalOut.close(); - - if (dataTable.getCaseQuantity().intValue() != lineCounter) { - throw new IOException("Mismatch between line counts in first and final passes!"); + parser.close(); + dbglog.fine("Tmp File: " + firstPassTempFile); + // Firstpass file is deleted to prevent tmp from filling up. + firstPassTempFile.delete(); + if (dataTable.getCaseQuantity().intValue() != linecount) { + List args = Arrays.asList(new String[]{"" + dataTable.getCaseQuantity().intValue(), + "" + linecount}); + throw new IOException(BundleUtil.getStringFromBundle("ingest.csv.line_mismatch", args)); } - - return lineCounter; + return (int) linecount; } } diff --git a/src/test/java/edu/harvard/iq/dataverse/ingest/tabulardata/impl/plugins/csv/BrokenCSV.csv b/src/test/java/edu/harvard/iq/dataverse/ingest/tabulardata/impl/plugins/csv/BrokenCSV.csv new file mode 100644 index 00000000000..4b84b5d601a --- /dev/null +++ b/src/test/java/edu/harvard/iq/dataverse/ingest/tabulardata/impl/plugins/csv/BrokenCSV.csv @@ -0,0 +1,4 @@ +1,2,3,4,5,6 +1,3,4,5,6,7 +"1,2",3,4,5,6,4 +3,1,3,4 diff --git a/src/test/java/edu/harvard/iq/dataverse/ingest/tabulardata/impl/plugins/csv/CSVFileReaderTest.java b/src/test/java/edu/harvard/iq/dataverse/ingest/tabulardata/impl/plugins/csv/CSVFileReaderTest.java new file mode 100644 index 00000000000..85c8dadf2ac --- /dev/null +++ b/src/test/java/edu/harvard/iq/dataverse/ingest/tabulardata/impl/plugins/csv/CSVFileReaderTest.java @@ -0,0 +1,422 @@ +/* + * To change this license header, choose License Headers in Project Properties. + * To change this template file, choose Tools | Templates + * and open the template in the editor. + */ +package edu.harvard.iq.dataverse.ingest.tabulardata.impl.plugins.csv; + +import edu.harvard.iq.dataverse.DataTable; +import edu.harvard.iq.dataverse.dataaccess.TabularSubsetGenerator; +import edu.harvard.iq.dataverse.datavariable.DataVariable.VariableInterval; +import edu.harvard.iq.dataverse.datavariable.DataVariable.VariableType; +import edu.harvard.iq.dataverse.ingest.tabulardata.TabularDataIngest; +import edu.harvard.iq.dataverse.util.BundleUtil; +import java.io.BufferedInputStream; +import java.io.BufferedReader; +import java.io.File; +import java.io.FileInputStream; +import java.io.FileNotFoundException; +import java.io.FileReader; +import java.io.IOException; +import java.util.Arrays; +import java.util.logging.Logger; +import org.dataverse.unf.UNFUtil; +import org.dataverse.unf.UnfException; +import org.junit.Test; +import static org.junit.Assert.*; + +/** + * + * @author oscardssmith + */ +public class CSVFileReaderTest { + + private static final Logger logger = Logger.getLogger(CSVFileReaderTest.class.getCanonicalName()); + + /** + * Test CSVFileReader with a hellish CSV containing everything nasty I could + * think of to throw at it. + */ + @Test + public void testRead() { + String testFile = "src/test/java/edu/harvard/iq/dataverse/ingest/tabulardata/impl/plugins/csv/IngestCSV.csv"; + String[] expResult = {"-199 \"hello\" 2013-04-08 13:14:23 2013-04-08 13:14:23 2017-06-20 \"2017/06/20\" 0.0 1 \"2\" \"823478788778713\"", + "2 \"Sdfwer\" 2013-04-08 13:14:23 2013-04-08 13:14:23 2017-06-20 \"1100/06/20\" Inf 2 \"NaN\" \",1,2,3\"", + "0 \"cjlajfo.\" 2013-04-08 13:14:23 2013-04-08 13:14:23 2017-06-20 \"3000/06/20\" -Inf 3 \"inf\" \"\\casdf\"", + "-1 \"Mywer\" 2013-04-08 13:14:23 2013-04-08 13:14:23 2017-06-20 \"06-20-2011\" 3.141592653 4 \"4.8\" \"  \\\" \"", + "266128 \"Sf\" 2013-04-08 13:14:23 2013-04-08 13:14:23 2017-06-20 \"06-20-1917\" 0 5 \"Inf+11\" \"\"", + "0 \"null\" 2013-04-08 13:14:23 2013-04-08 13:14:23 2017-06-20 \"03/03/1817\" 123 6.000001 \"11-2\" \"\\\"adf\\0\\na\\td\\nsf\\\"\"", + "-2389 \"\" 2013-04-08 13:14:23 2013-04-08 13:14:72 2017-06-20 \"2017-03-12\" NaN 2 \"nap\" \"💩⌛👩🏻■\""}; + BufferedReader result = null; + try (BufferedInputStream stream = new BufferedInputStream( + new FileInputStream(testFile))) { + CSVFileReader instance = new CSVFileReader(new CSVFileReaderSpi()); + File outFile = instance.read(stream, null).getTabDelimitedFile(); + result = new BufferedReader(new FileReader(outFile)); + logger.fine("Final pass: " + outFile.getPath()); + } catch (IOException ex) { + fail("" + ex); + } + + String foundLine = null; + assertNotNull(result); + int line = 0; + for (String expLine : expResult) { + try { + foundLine = result.readLine(); + } catch (IOException ex) { + fail(); + } + assertEquals("Error on line " + line, expLine, foundLine); + line++; + } + + } + + /* + * This test will read the CSV File From Hell, above, then will inspect + * the DataTable object produced by the plugin, and verify that the + * individual DataVariables have been properly typed. + */ + @Test + public void testVariables() { + String testFile = "src/test/java/edu/harvard/iq/dataverse/ingest/tabulardata/impl/plugins/csv/IngestCSV.csv"; + + String[] expectedVariableNames = {"ints", "Strings", "Times", "Not quite Times", "Dates", "Not quite Dates", + "Numbers", "Not quite Ints", "Not quite Numbers", "Column that hates you, contains many comas, and is verbose and long enough that it would cause ingest to fail if ingest failed when a header was more than 256 characters long. Really, it's just sadistic. Also to make matters worse, the space at the begining of this sentance was a special unicode space designed to make you angry."}; + + VariableType[] expectedVariableTypes = {VariableType.NUMERIC, VariableType.CHARACTER, + VariableType.CHARACTER, VariableType.CHARACTER, VariableType.CHARACTER, VariableType.CHARACTER, + VariableType.NUMERIC, VariableType.NUMERIC, VariableType.CHARACTER, VariableType.CHARACTER}; + + VariableInterval[] expectedVariableIntervals = {VariableInterval.DISCRETE, VariableInterval.DISCRETE, + VariableInterval.DISCRETE, VariableInterval.DISCRETE, VariableInterval.DISCRETE, VariableInterval.DISCRETE, + VariableInterval.CONTINUOUS, VariableInterval.CONTINUOUS, VariableInterval.DISCRETE, VariableInterval.DISCRETE}; + + String[] expectedVariableFormatCategories = {null, null, "time", "time", "date", null, null, null, null, null}; + + String[] expectedVariableFormats = {null, null, "yyyy-MM-dd HH:mm:ss", "yyyy-MM-dd HH:mm:ss", "yyyy-MM-dd", null, null, null, null, null}; + + Long expectedNumberOfCases = 7L; // aka the number of lines in the TAB file produced by the ingest plugin + + DataTable result = null; + try (BufferedInputStream stream = new BufferedInputStream( + new FileInputStream(testFile))) { + CSVFileReader instance = new CSVFileReader(new CSVFileReaderSpi()); + result = instance.read(stream, null).getDataTable(); + } catch (IOException ex) { + fail("" + ex); + } + + assertNotNull(result); + + assertNotNull(result.getDataVariables()); + + assertEquals(result.getVarQuantity(), new Long(result.getDataVariables().size())); + + assertEquals(result.getVarQuantity(), new Long(expectedVariableTypes.length)); + + assertEquals(expectedNumberOfCases, result.getCaseQuantity()); + + // OK, let's go through the individual variables: + for (int i = 0; i < result.getVarQuantity(); i++) { + + assertEquals("variable " + i + ":", expectedVariableNames[i], result.getDataVariables().get(i).getName()); + + assertEquals("variable " + i + ":", expectedVariableTypes[i], result.getDataVariables().get(i).getType()); + + assertEquals("variable " + i + ":", expectedVariableIntervals[i], result.getDataVariables().get(i).getInterval()); + + assertEquals("variable " + i + ":", expectedVariableFormatCategories[i], result.getDataVariables().get(i).getFormatCategory()); + + assertEquals("variable " + i + ":", expectedVariableFormats[i], result.getDataVariables().get(i).getFormat()); + } + } + + /* + * This test will read a CSV file, then attempt to subset + * the resulting tab-delimited file and verify that the individual variable vectors + * are legit. + */ + @Test + public void testSubset() { + String testFile = "src/test/java/edu/harvard/iq/dataverse/ingest/tabulardata/impl/plugins/csv/election_precincts.csv"; + Long expectedNumberOfVariables = 13L; + Long expectedNumberOfCases = 24L; // aka the number of lines in the TAB file produced by the ingest plugin + + TabularDataIngest ingestResult = null; + + File generatedTabFile = null; + DataTable generatedDataTable = null; + + try (BufferedInputStream stream = new BufferedInputStream( + new FileInputStream(testFile))) { + CSVFileReader instance = new CSVFileReader(new CSVFileReaderSpi()); + + ingestResult = instance.read(stream, null); + + generatedTabFile = ingestResult.getTabDelimitedFile(); + generatedDataTable = ingestResult.getDataTable(); + } catch (IOException ex) { + fail("" + ex); + } + + assertNotNull(generatedDataTable); + + assertNotNull(generatedDataTable.getDataVariables()); + + assertEquals(generatedDataTable.getVarQuantity(), new Long(generatedDataTable.getDataVariables().size())); + + assertEquals(generatedDataTable.getVarQuantity(), expectedNumberOfVariables); + + assertEquals(expectedNumberOfCases, generatedDataTable.getCaseQuantity()); + + // And now let's try and subset the individual vectors + // First, the "continuous" vectors (we should be able to read these as Double[]): + int[] floatColumns = {2}; + + Double[][] floatVectors = { + {1.0, 3.0, 4.0, 6.0, 7.0, 8.0, 11.0, 12.0, 76.0, 77.0, 77.0, 77.0, 77.0, 77.0, 77.0, 77.0, 77.0, 77.0, 77.0, 77.0, 77.0, 77.0, 77.0, 77.0}, +}; + + int vectorCount = 0; + for (int i : floatColumns) { + // We'll be subsetting the column vectors one by one, re-opening the + // file each time. Inefficient - but we don't care here. + + if (!generatedDataTable.getDataVariables().get(i).isIntervalContinuous()) { + fail("Column " + i + " was not properly processed as \"continuous\""); + } + FileInputStream generatedTabInputStream = null; + try { + generatedTabInputStream = new FileInputStream(generatedTabFile); + } catch (FileNotFoundException ioex) { + fail("Failed to open generated tab-delimited file for reading" + ioex); + } + + Double[] columnVector = TabularSubsetGenerator.subsetDoubleVector(generatedTabInputStream, i, generatedDataTable.getCaseQuantity().intValue()); + + assertArrayEquals("column " + i + ":", floatVectors[vectorCount++], columnVector); + } + + // Discrete Numerics (aka, integers): + int[] integerColumns = {1, 4, 6, 7, 8, 9, 10, 11, 12}; + + Long[][] longVectors = { + {1L, 3L, 4L, 6L, 7L, 8L, 11L, 12L, 76L, 77L, 77L, 77L, 77L, 77L, 77L, 77L, 77L, 77L, 77L, 77L, 77L, 77L, 77L, 77L}, + {1L, 2L, 3L, 4L, 5L, 11L, 13L, 15L, 19L, 19L, 19L, 19L, 19L, 19L, 19L, 19L, 19L, 19L, 19L, 19L, 19L, 19L, 19L, 19L}, + {85729227L, 85699791L, 640323976L, 85695847L, 637089796L, 637089973L, 85695001L, 85695077L, 1111111L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L}, + {205871733L, 205871735L, 205871283L, 258627915L, 257444575L, 205871930L, 260047422L, 262439738L, 1111111L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L}, + {205871673L, 205871730L, 205871733L, 205872857L, 258627915L, 257444584L, 205873413L, 262439738L, 1111111L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L}, + {25025000201L, 25025081001L, 25025000701L, 25025050901L, 25025040600L, 25025000502L, 25025040401L, 25025100900L, 1111111L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L}, + {250250502002L, 250250502003L, 250250501013L, 250250408011L, 250250503001L, 250250103001L, 250250406002L, 250250406001L, 1111111L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L}, + {250251011024001L, 250251011013003L, 250251304041007L, 250251011013006L, 250251010016000L, 250251011024002L, 250251001005004L, 250251002003002L, 1111111L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L}, + {2109L, 2110L, 2111L, 2120L, 2121L, 2115L, 2116L, 2122L, 11111L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L, 4444444L} + }; + + vectorCount = 0; + + for (int i : integerColumns) { + if (!generatedDataTable.getDataVariables().get(i).isIntervalDiscrete() + || !generatedDataTable.getDataVariables().get(i).isTypeNumeric()) { + fail("Column " + i + " was not properly processed as \"discrete numeric\""); + } + FileInputStream generatedTabInputStream = null; + try { + generatedTabInputStream = new FileInputStream(generatedTabFile); + } catch (FileNotFoundException ioex) { + fail("Failed to open generated tab-delimited file for reading" + ioex); + } + + Long[] columnVector = TabularSubsetGenerator.subsetLongVector(generatedTabInputStream, i, generatedDataTable.getCaseQuantity().intValue()); + + assertArrayEquals("column " + i + ":", longVectors[vectorCount++], columnVector); + } + + // And finally, Strings: + int[] stringColumns = {0, 3, 5}; + + String[][] stringVectors = { + {"Dog", "Squirrel", "Antelope", "Zebra", "Lion", "Gazelle", "Cat", "Giraffe", "Cat", "Donkey", "Donkey", "Donkey", "Donkey", "Donkey", "Donkey", "Donkey", "Donkey", "Donkey", "Donkey", "Donkey", "Donkey", "Donkey", "Donkey", "Donkey"}, + {"East Boston", "Charlestown", "South Boston", "Bronx", "Roslindale", "Mission Hill", "Jamaica Plain", "Hyde Park", "Fenway/Kenmore", "Queens", "Queens", "Queens", "Queens", "Queens", "Queens", "Queens", "Queens", "Queens", "Queens", "Queens", "Queens", "Queens", "Queens", "Queens"}, + {"2-06", "1-09", "1-1A", "1-1B", "2-04", "3-05", "1-1C", "1-10A", "41-10A", "41-10A", "41-10A", "41-10A", "41-10A", "41-10A", "41-10A", "41-10A", "41-10A", "41-10A", "41-10A", "41-10A", "41-10A", "41-10A", "41-10A", "41-10A",} + }; + + vectorCount = 0; + + for (int i : stringColumns) { + if (!generatedDataTable.getDataVariables().get(i).isTypeCharacter()) { + fail("Column " + i + " was not properly processed as a character vector"); + } + FileInputStream generatedTabInputStream = null; + try { + generatedTabInputStream = new FileInputStream(generatedTabFile); + } catch (FileNotFoundException ioex) { + fail("Failed to open generated tab-delimited file for reading" + ioex); + } + + String[] columnVector = TabularSubsetGenerator.subsetStringVector(generatedTabInputStream, i, generatedDataTable.getCaseQuantity().intValue()); + + assertArrayEquals("column " + i + ":", stringVectors[vectorCount++], columnVector); + } + } + + /* + * UNF test; + * I'd like to use a file with more interesting values - "special" numbers, freaky dates, accents, etc. + * for this. But checking it in with this simple file, for now. + * (thinking about it, the "csv file from hell" may be a better test case for the UNF test) + */ + @Test + public void testVariableUNFs() { + String testFile = "src/test/java/edu/harvard/iq/dataverse/ingest/tabulardata/impl/plugins/csv/election_precincts.csv"; + Long expectedNumberOfVariables = 13L; + Long expectedNumberOfCases = 24L; // aka the number of lines in the TAB file produced by the ingest plugin + + String[] expectedUNFs = { + "UNF:6:wb7OATtNC/leh1sOP5IGDQ==", + "UNF:6:0V3xQ3ea56rzKwvGt9KBCA==", + "UNF:6:0V3xQ3ea56rzKwvGt9KBCA==", + "UNF:6:H9inAvq5eiIHW6lpqjjKhQ==", + "UNF:6:Bh0M6QvunZwW1VoTyioRCQ==", + "UNF:6:o5VTaEYz+0Kudf6hQEEupQ==", + "UNF:6:eJRvbDJkIeDPrfN2dYpRfA==", + "UNF:6:JD1wrtM12E7evrJJ3bRFGA==", + "UNF:6:xUKbK9hb5o0nL5/mYiy7Bw==", + "UNF:6:Mvq3BrdzoNhjndMiVr92Ww==", + "UNF:6:KkHM6Qlyv3QlUd+BKqqB3Q==", + "UNF:6:EWUVuyXKSpyllsrjHnheig==", + "UNF:6:ri9JsRJxM2xpWSIq17oWNw=="}; + + TabularDataIngest ingestResult = null; + + File generatedTabFile = null; + DataTable generatedDataTable = null; + + try (BufferedInputStream stream = new BufferedInputStream( + new FileInputStream(testFile))) { + CSVFileReader instance = new CSVFileReader(new CSVFileReaderSpi()); + + ingestResult = instance.read(stream, null); + + generatedTabFile = ingestResult.getTabDelimitedFile(); + generatedDataTable = ingestResult.getDataTable(); + } catch (IOException ex) { + fail("" + ex); + } + + assertNotNull(generatedDataTable); + + assertNotNull(generatedDataTable.getDataVariables()); + + assertEquals(generatedDataTable.getVarQuantity(), new Long(generatedDataTable.getDataVariables().size())); + + assertEquals(generatedDataTable.getVarQuantity(), expectedNumberOfVariables); + + assertEquals(expectedNumberOfCases, generatedDataTable.getCaseQuantity()); + + for (int i = 0; i < expectedNumberOfVariables; i++) { + String unf = null; + + if (generatedDataTable.getDataVariables().get(i).isIntervalContinuous()) { + FileInputStream generatedTabInputStream = null; + try { + generatedTabInputStream = new FileInputStream(generatedTabFile); + } catch (FileNotFoundException ioex) { + fail("Failed to open generated tab-delimited file for reading" + ioex); + } + + Double[] columnVector = TabularSubsetGenerator.subsetDoubleVector(generatedTabInputStream, i, generatedDataTable.getCaseQuantity().intValue()); + try { + unf = UNFUtil.calculateUNF(columnVector); + } catch (IOException | UnfException ioex) { + fail("Failed to generate the UNF for variable number " + i + ", (" + generatedDataTable.getDataVariables().get(i).getName() + ", floating point)"); + } + + } + if (generatedDataTable.getDataVariables().get(i).isIntervalDiscrete() + && generatedDataTable.getDataVariables().get(i).isTypeNumeric()) { + + FileInputStream generatedTabInputStream = null; + try { + generatedTabInputStream = new FileInputStream(generatedTabFile); + } catch (FileNotFoundException ioex) { + fail("Failed to open generated tab-delimited file for reading" + ioex); + } + + Long[] columnVector = TabularSubsetGenerator.subsetLongVector(generatedTabInputStream, i, generatedDataTable.getCaseQuantity().intValue()); + + try { + unf = UNFUtil.calculateUNF(columnVector); + } catch (IOException | UnfException ioex) { + fail("Failed to generate the UNF for variable number " + i + ", (" + generatedDataTable.getDataVariables().get(i).getName() + ", integer)"); + } + + } + if (generatedDataTable.getDataVariables().get(i).isTypeCharacter()) { + + FileInputStream generatedTabInputStream = null; + try { + generatedTabInputStream = new FileInputStream(generatedTabFile); + } catch (FileNotFoundException ioex) { + fail("Failed to open generated tab-delimited file for reading" + ioex); + } + + String[] columnVector = TabularSubsetGenerator.subsetStringVector(generatedTabInputStream, i, generatedDataTable.getCaseQuantity().intValue()); + + String[] dateFormats = null; + + // Special handling for Character strings that encode dates and times: + if ("time".equals(generatedDataTable.getDataVariables().get(i).getFormatCategory()) + || "date".equals(generatedDataTable.getDataVariables().get(i).getFormatCategory())) { + + dateFormats = new String[expectedNumberOfCases.intValue()]; + for (int j = 0; j < expectedNumberOfCases; j++) { + dateFormats[j] = generatedDataTable.getDataVariables().get(i).getFormat(); + } + } + + try { + if (dateFormats == null) { + unf = UNFUtil.calculateUNF(columnVector); + } else { + unf = UNFUtil.calculateUNF(columnVector, dateFormats); + } + } catch (IOException | UnfException iex) { + fail("Failed to generate the UNF for variable number " + i + ", (" + generatedDataTable.getDataVariables().get(i).getName() + ", " + (dateFormats == null ? "String" : "Date/Time value") + ")"); + } + } + + assertEquals("Variable number " + i + ":", expectedUNFs[i], unf); + } + + } + + /** + * Tests CSVFileReader with a CSV with one more column than header. Tests + * CSVFileReader with a null CSV. + */ + @Test + public void testBrokenCSV() { + String brokenFile = "src/test/java/edu/harvard/iq/dataverse/ingest/tabulardata/impl/plugins/csv/BrokenCSV.csv"; + try { + new CSVFileReader(new CSVFileReaderSpi()).read(null, null); + fail("IOException not thrown on null csv"); + } catch (NullPointerException ex) { + String expMessage = null; + assertEquals(expMessage, ex.getMessage()); + } catch (IOException ex) { + String expMessage = BundleUtil.getStringFromBundle("ingest.csv.nullStream"); + assertEquals(expMessage, ex.getMessage()); + } + try (BufferedInputStream stream = new BufferedInputStream( + new FileInputStream(brokenFile))) { + new CSVFileReader(new CSVFileReaderSpi()).read(stream, null); + fail("IOException was not thrown when collumns do not align."); + } catch (IOException ex) { + String expMessage = BundleUtil.getStringFromBundle("ingest.csv.recordMismatch", + Arrays.asList(new String[]{"3", "6", "4"})); + assertEquals(expMessage, ex.getMessage()); + } + } +} diff --git a/src/test/java/edu/harvard/iq/dataverse/ingest/tabulardata/impl/plugins/csv/IngestCSV.csv b/src/test/java/edu/harvard/iq/dataverse/ingest/tabulardata/impl/plugins/csv/IngestCSV.csv new file mode 100644 index 00000000000..c09b407916a --- /dev/null +++ b/src/test/java/edu/harvard/iq/dataverse/ingest/tabulardata/impl/plugins/csv/IngestCSV.csv @@ -0,0 +1,9 @@ +ints,Strings,Times,Not quite Times,Dates,Not quite Dates,Numbers,Not quite Ints,Not quite Numbers,"Column that hates you, contains many comas, and is verbose and long enough that it would cause ingest to fail if ingest failed when a header was more than 256 characters long. Really, it's just sadistic. Also to make matters worse, the space at the begining of this sentance was a special unicode space designed to make you angry." +-199,hello,2013-04-08 13:14:23,2013-04-08 13:14:23,2017-06-20,2017/06/20,null,1,2,823478788778713 +2,Sdfwer,2013-04-08 13:14:23,2013-04-08 13:14:23,2017-06-20,1100/06/20,INF,2,NaN,",1,2,3" +0,cjlajfo.,2013-04-08 13:14:23,2013-04-08 13:14:23,2017-06-20,3000/06/20,-inf,3,inf,\casdf +-1,Mywer,2013-04-08 13:14:23,2013-04-08 13:14:23,2017-06-20,06-20-2011,3.141592653,4,4.8,"  "" " +266128,Sf,2013-04-08 13:14:23,2013-04-08 13:14:23,2017-06-20,06-20-1917,0,5,Inf+11, +null,null,2013-04-08 13:14:23,2013-04-08 13:14:23,2017-06-20,03/03/1817,123,6.000001,11-2,"""adf\0\na\td +sf""" +-2389,,2013-04-08 13:14:23,2013-04-08 13:14:72,2017-06-20,2017-03-12,nan,2,nap,💩⌛👩🏻■ diff --git a/src/test/java/edu/harvard/iq/dataverse/ingest/tabulardata/impl/plugins/csv/election_precincts.csv b/src/test/java/edu/harvard/iq/dataverse/ingest/tabulardata/impl/plugins/csv/election_precincts.csv new file mode 100644 index 00000000000..f3ef4cd74c4 --- /dev/null +++ b/src/test/java/edu/harvard/iq/dataverse/ingest/tabulardata/impl/plugins/csv/election_precincts.csv @@ -0,0 +1,25 @@ +Animal,Election Precinct Int,Election Precinct Double,Neighborhood,Police District ID,Public Works ID,Tiger Line ID,Road Left Size,Road Right Size,GEOID10 Census Tract,BG_ID_10 Census Block Group,Census Block FIPS15,ZIP5 +Dog,1,1.00000000000,East Boston,1,2-06,85729227,205871733,205871673,25025000201,250250502002,250251011024001,02109 +Squirrel,3,3.00000000000,Charlestown,2,1-09,85699791,205871735,205871730,25025081001,250250502003,250251011013003,02110 +Antelope,4,4.00000000000,South Boston,3,1-1A,640323976,205871283,205871733,25025000701,250250501013,250251304041007,02111 +Zebra,6,6.00000000000,Bronx,4,1-1B,85695847,258627915,205872857,25025050901,250250408011,250251011013006,02120 +Lion,7,7.00000000000,Roslindale,5,2-04,637089796,257444575,258627915,25025040600,250250503001,250251010016000,02121 +Gazelle,8,8.00000000000,Mission Hill,11,3-05,637089973,205871930,257444584,25025000502,250250103001,250251011024002,2115 +Cat,11,11.00000000000,Jamaica Plain,13,1-1C,85695001,260047422,205873413,25025040401,250250406002,250251001005004,2116 +Giraffe,12,12.00000000000,Hyde Park,15,1-10A,85695077,262439738,262439738,25025100900,250250406001,250251002003002,02122 +Cat,76,76.00000000000,Fenway/Kenmore,19,41-10A,1111111,1111111,1111111,1111111,1111111,1111111,11111 +Donkey,77,77.00000000000,Queens,19,41-10A,4444444,4444444,4444444,4444444,4444444,4444444,4444444 +Donkey,77,77.00000000000,Queens,19,41-10A,4444444,4444444,4444444,4444444,4444444,4444444,4444444 +Donkey,77,77.00000000000,Queens,19,41-10A,4444444,4444444,4444444,4444444,4444444,4444444,4444444 +Donkey,77,77.00000000000,Queens,19,41-10A,4444444,4444444,4444444,4444444,4444444,4444444,4444444 +Donkey,77,77.00000000000,Queens,19,41-10A,4444444,4444444,4444444,4444444,4444444,4444444,4444444 +Donkey,77,77.00000000000,Queens,19,41-10A,4444444,4444444,4444444,4444444,4444444,4444444,4444444 +Donkey,77,77.00000000000,Queens,19,41-10A,4444444,4444444,4444444,4444444,4444444,4444444,4444444 +Donkey,77,77.00000000000,Queens,19,41-10A,4444444,4444444,4444444,4444444,4444444,4444444,4444444 +Donkey,77,77.00000000000,Queens,19,41-10A,4444444,4444444,4444444,4444444,4444444,4444444,4444444 +Donkey,77,77.00000000000,Queens,19,41-10A,4444444,4444444,4444444,4444444,4444444,4444444,4444444 +Donkey,77,77.00000000000,Queens,19,41-10A,4444444,4444444,4444444,4444444,4444444,4444444,4444444 +Donkey,77,77.00000000000,Queens,19,41-10A,4444444,4444444,4444444,4444444,4444444,4444444,4444444 +Donkey,77,77.00000000000,Queens,19,41-10A,4444444,4444444,4444444,4444444,4444444,4444444,4444444 +Donkey,77,77.00000000000,Queens,19,41-10A,4444444,4444444,4444444,4444444,4444444,4444444,4444444 +Donkey,77,77.00000000000,Queens,19,41-10A,4444444,4444444,4444444,4444444,4444444,4444444,4444444 \ No newline at end of file