-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wqbc incorrectly assumes that MDL_UNIT is the same as UNIT in rems data #158
Comments
So we have confirmation that this can't be altered in the database at this time, so the fix will have to happen here, or in rems. Here's a table of all the non-matching units in the two columns - we'll have to manually populate a third column with a conversion factor: library(rems)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
con <- connect_historic_db()
#> Please remember to use 'disconnect_historic_db()' when you are finished querying the historic database.
hist_tbl <- attach_historic_data(con)
unit_tbl_hist <- hist_tbl %>%
select(UNIT, MDL_UNIT) %>%
filter(UNIT != MDL_UNIT) %>%
distinct() %>%
collect()
unit_tbl_2yr <- get_ems_data("2yr") %>%
select(UNIT, MDL_UNIT) %>%
filter(UNIT != MDL_UNIT) %>%
distinct()
#> Fetching data from cache...
unit_tbl <- bind_rows(unit_tbl_hist, unit_tbl_2yr) %>%
distinct()
knitr::kable(unit_tbl, "simple")
Created on 2021-02-09 by the reprex package (v1.0.0) |
|
Good point. There are definitely some that are nonsensical (mg/L <=> Vehicles??). Here they are with counts: library(rems)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
con <- connect_historic_db()
#> Please remember to use 'disconnect_historic_db()' when you are finished querying the historic database.
hist_tbl <- attach_historic_data(con)
unit_tbl_hist <- hist_tbl %>%
filter(UNIT != MDL_UNIT) %>%
count(UNIT, MDL_UNIT) %>%
collect()
unit_tbl_2yr <- get_ems_data("2yr", dont_update = TRUE) %>%
filter(UNIT != MDL_UNIT) %>%
count(UNIT, MDL_UNIT)
#> Fetching data from cache...
unit_tbl <- bind_rows(unit_tbl_hist, unit_tbl_2yr) %>%
group_by(UNIT, MDL_UNIT) %>%
summarise(n = sum(n)) %>%
arrange(-n)
#> `summarise()` has grouped output by 'UNIT'. You can override using the `.groups` argument.
knitr::kable(unit_tbl, "simple")
Created on 2021-02-10 by the reprex package (v1.0.0) |
Units that are currently recognised by wqbc and can be used for permitted conversions:
Most cases are dealt with by |
In terms of internal package structure and future extensibility (i.e., adding more conversions as needed), would it be more efficient to use a lookup table (i.e., above, with a column of conversion factors)? |
I think it would be more efficient to expand the units recognized by wqbc - the code to do the conversions is relatively elegant. All we have to do is to add them to
|
I was looking at it, and I agree it's very elegant! One thing I wonder is if there are units within a unit type that are incompatible, and how we would treat those |
Can you provide an example? |
I can't provide an example because I'm just not sure - I don't know what a lot of those units even are or what type of quantity they represent... @JessicaPenno @KarHarker can you provide input on what unit combinations in the above table are important and/or don't make sense? One other question (which I haven't thought through, so this may be a silly question) - Using the multiplier approach currently implemented could get complicated when both the numerator and denominator are different (e.g., ug/g <=> mg/kg). |
I wonder if we can leverage the |
|
It doesn't have all the units, but it's easy to define new units and set conversion factors. Here's some experimenting I did, I think it looks pretty good. It takes care of the vast majority of them, and the rest are either really specific (e.g. relying on the molar mass of the chemical etc) or errors/nonsensical: library(rems)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(units)
#> udunits system database from /usr/local/share/udunits
install_symbolic_unit("MPN")
install_conversion_constant("MPN", "CFU", 1)
install_symbolic_unit("NTU")
install_conversion_constant("NTU", "JTU", 1)
install_conversion_constant("US_liquid_gallon", "USG", 1)
install_conversion_constant("UK_liquid_gallon", "IG", 1)
# These petroleum measures are actually only ever used
# as volumes (1m3 = 0.001 E3m3 = 1e-6), but can't use E3m3
# to install the unit so have to do it as m, units takes
# care of the rest when it is m3 <=> E3m3 etc.
install_conversion_constant("m", "E3m", .1)
install_conversion_constant("m", "E6m", .01)
install_conversion_constant("L", "E6L", 1e-6)
#> set_units(set_units(1, "m3"), "E3m3")
# 0.001 [E3m3]
#> set_units(set_units(1, "m3"), "E6m3")
# 1e-06 [E6m3]
convert_values2 <- function(x, from, to) {
to <- clean_unit(to)
from <- clean_unit(from)
x <- units::set_units(x, from, mode = "standard")
as.numeric(units::set_units(x, to, mode = "standard"))
}
clean_unit <- function(x) {
# Remove trailing A, W, wet etc. as well as percent type (V/V, W/W, Moratlity)
# Assuming they are not imporant in the unit conversion??
gsub("\\s*(W|wet|A|\\(W/W\\)|\\(V/V\\)|\\(Mortality\\))\\s*$", "", x)
}
con <- connect_historic_db()
#> Please remember to use 'disconnect_historic_db()' when you are finished querying the historic database.
hist_tbl <- attach_historic_data(con)
unit_tbl_hist <- hist_tbl %>%
filter(UNIT != MDL_UNIT) %>%
count(UNIT, MDL_UNIT) %>%
collect()
unit_tbl_2yr <- get_ems_data("2yr", dont_update = TRUE) %>%
filter(UNIT != MDL_UNIT) %>%
count(UNIT, MDL_UNIT)
#> Fetching data from cache...
unit_tbl <- bind_rows(unit_tbl_hist, unit_tbl_2yr) %>%
group_by(UNIT, MDL_UNIT) %>%
summarise(n = sum(n)) %>%
arrange(-n)
#> `summarise()` has grouped output by 'UNIT'. You can override using the `.groups` argument.
unit_tbl$converted_from_1 <- NA_real_
for (i in seq_len(nrow(unit_tbl))) {
unit_tbl$converted_from_1[i] <- tryCatch(
convert_values2(1, unit_tbl$UNIT[i], unit_tbl$MDL_UNIT[i]),
error = function(e) NA_real_
)
}
#> Warning: Could not parse expression: '`N`/'. Returning as a single symbolic
#> unit()
#> Warning: Could not parse expression: '`N`/'. Returning as a single symbolic
#> unit()
#> Warning: Could not parse expression: '`N`/'. Returning as a single symbolic
#> unit()
knitr::kable(arrange(unit_tbl, is.na(converted_from_1), desc(n)))
Created on 2021-02-11 by the reprex package (v1.0.0) |
Wow I didn't know you could set units in |
It certainly can live in rems. I think it would have to be a utility function that the user calls rather than happening automatically, as there are ways that a user can pull the data into R where we wouldn't be able to intervene with the unit conversion (e.g., calling |
Good point - maybe then its conceptually more consistent with the various package philosophies if it resides in wqbc? |
I've implemented the basics in rems. Unfortunately it's not that fast - about 11 seconds on 5000 rows... |
That's too bad - could it be run as part of the database creation process which is lengthy already anyway? |
It could, since for the SQLite backend the data does traverse through R - though ~2M rows with inconsistent units will need to be fixed, and if that timing scales linearly it will take over an hour! Also, when DuckDB (which is looking more and more promising) matures I do intend to use it as a backend - and it will create the db directly from the csv, so any R processing is then not an option. |
We can see how long it takes within wqbc/shinyrems workflows. |
There is an unexported function in library(microbenchmark)
library(units)
#> udunits system database from /usr/local/share/udunits
microbenchmark(
units:::ud_convert(1, "mg/L", "ug/L"),
set_units(set_units(1, "mg/L"), "ug/L")
)
#> Unit: microseconds
#> expr min lq mean
#> units:::ud_convert(1, "mg/L", "ug/L") 25.729 31.9175 37.05395
#> set_units(set_units(1, "mg/L"), "ug/L") 1995.143 2111.7080 2388.55648
#> median uq max neval
#> 35.0855 41.086 145.008 100
#> 2166.6645 2521.354 6220.956 100 Created on 2021-02-16 by the reprex package (v1.0.0) I'll open an issue in |
Ok I think I have a pretty good solution at bcgov/rems#57 |
Is it fast enough to apply universally to all the data? |
It does 1M rows in < 1s, but I won't do it at the db creation stage due to the likely future duckdb migration I mentioned above. |
Confirm closed by https://github.com/bcgov/rems/releases/tag/v0.7.0 |
wqbc currently assumes that MDL_UNIT is UNIT in rems data but this is not always the case.
Consider selenium concentrations on the elk river.
There are two possible fixes.
The first is to alter the rems database so that the two are always the same.
The second is to update wqbc to require both and adjust independently. This is quite a lot of work to ensure the unit types are consistent and convertible and to perform the conversion and to provide informative error messages when this is not the case.
The text was updated successfully, but these errors were encountered: