Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configure disabling the interpretation of dates #80

Closed
jehugaleahsa opened this issue Dec 17, 2020 · 7 comments
Closed

Configure disabling the interpretation of dates #80

jehugaleahsa opened this issue Dec 17, 2020 · 7 comments

Comments

@jehugaleahsa
Copy link

It would be nice if on the SasFileReader interface if you could disable date interpretation. For what I am working on, it would be beneficial to return the original number or ISO date/time string. Even a System.setProperty would be a nice solution for backward compatibility.

For one, Java's Date is pretty legacy. Even though the conversion to Java8 date types is pretty straight-forward, it opens up a lot of questions, like whether to use LocalDate, LocalDateTime, ZonedDateTime, etc. that really are irrelevant to the SAS datasets I am processing.

Maybe more importantly, I lose whether the original value was days since the epoch or seconds since the epoch and I don't want to have to write the inverse of your format-to-date logic, since there are so many formats. Other steps later on in the process already deal with converting the numeric values in SAS for comparison purposes, so I feel like I am wasting time undoing conversions or converting from Date to a different Java8 date class.

@xantorohara
Copy link
Contributor

Implementation proposal

SAS supports huge amount of date, time and date-time formats.

In general it supports more than hundred formats (https://v8doc.sas.com/sashtml/lgref/z0309859.htm).
Each format additionally can be expanded into different variations using "x", "w" and "d" modifiers.
So, finally SAS supports... thousands of format variations.

I've implemented and unit-tested about 500+ variations.

Implementation details

  • All date-related logic moved into the separate package "com.epam.parso.date".
  • Created different formatters for date, time and datetime processing.
  • Format functions are available via a single SasTemporalFormatter class.
  • Created 50+ unit-tests that covered 500+ cases.
  • All new date-related datasets added into separate location "com/epam/parso/date".
  • Added 50+ new test datasets with different date formats.
  • SAS programs to create such datasets are also placed together with datasets.
  • Added new SasReader constructor with ability to pass these output options:
    • JAVA_DATE - Outputs date as java.util.Date.
    • SAS_FORMAT - Output date as String formatted according to columns format defined in the source SAS file.
    • SAS_VALUE - Output date as raw SAS value. It is number of days (for date type) or seconds(for datetime type) since 1 January 1960 as java double.
    • EPOCH_SECONDS - Output date as number of seconds since 1 January 1970 as java double.

This implementation should not break existing clients of the Parso, because by default it still returns dates as Java Dates as it was before.

Actually it is not yet fully support all possible SAS date formats.
For now all formats from the "com.epam.parso.impl.DateTimeConstants" are presentin the new implementation, but some of them don't support xwd-modifiers, don't have test datasets and unit-tests. There is still a lot of work to be done.

But approach itself is working and will remain the same as here: xantorohara/parso@master...xantorohara:feature/80-date-formats

@jehugaleahsa
Copy link
Author

@xantorohara The SAS_VALUE option sounds exactly like what I need. This will help simplify some code on my end. Most of the time I am just comparing start dates to end dates so, so long as they are strings that can be compared lexicographically (ISO-8601 or yyyyMMdd) or are just numbers, the comparisons work as expected. How the dates get formatted in reports is dictated by another non-SAS organization anyway, so we essentially ignore SAS formats.

Thank you for your efforts!

xantorohara added a commit to xantorohara/parso that referenced this issue Jan 23, 2021
xantorohara added a commit to xantorohara/parso that referenced this issue Jan 23, 2021
printsev pushed a commit that referenced this issue Jan 25, 2021
* #80 support output of dates as Java Date, LocalDate or LocalDateTime, Epoch seconds or SAS value
@xantorohara
Copy link
Contributor

So, for now parso represent dates as:

  • raw SAS seconds
  • unix epoch milliseconds
  • Java Date
  • Java LocalDate or LocalDateTime

Regarding to produce dates using SAS formats - it turned out to be more work than it seemed.
SAS uses very strange rounding rules. Each format has its own separate buggy set of such rules.
So the most difficult part in formatting - is to reproduce SAS bugs.
A lot of investigation required for this, and it going to be implemented separately.

@printsev
Copy link
Contributor

closed thanks to xantorohara

@jehugaleahsa
Copy link
Author

Will this be available as 2.0.14 when it gets released? I checked earlier this week and all I saw was 2.0.13 on Maven Central. Once it's available, I will test it and let you know if I find anything. Thanks!

@printsev
Copy link
Contributor

You can try this out in the latest snapshot (2.0.14-SNAPSHOT), we usually aggregate more changes before we make a release. Would it be OK for you?

@jehugaleahsa
Copy link
Author

I gave this a try tonight - I finally had some time! 😅

The good thing is switching to SAS_VALUE caused a bunch of my unit tests to start failing where I was incorrectly converting LocalDate to seconds since the epoch (1960). Oops. That was an easy fix and had nothing to do with these changes.

I had a couple blocks of code converting Date to String or Date to Double (but I am pretty sure the Date interpretation stuff only happened in the case of numbers). I verified those branches of code never get executed after switching and then removed the code. All of my tests passed. I also ran this with some real world data and made sure my downstream processes were unaffected. Everything looks good.

I did some performance testing to see if avoiding Date made any difference. It did! For some of my datasets containing millions of records, working with the raw numbers shaved off several seconds. It's not a dramatic difference but, when you are processing dozens of massive datasets in parallel, it adds up.

I look forward to using this in production. Thanks again to everyone who helped to make this possible!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants