-
Notifications
You must be signed in to change notification settings - Fork 646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Column-wise DataFrame-like operations #538
Comments
@lwhite1 updated the answer below slightly, for posterity.
This:
could also be written in tablesaw as:
(If the sign on the days value isn't what you want, you can call .neg() on the result of daysUntil.) The second one is probably as good as it could be, given the tablesaw functionality at the time. I'm very impressed by how concise the python code is. I wasn't aware that today is a function or that you can divide a number by a column. The wonders of operator overloading :)
|
Thanks. I will work though the steps. One quick question: is there a .power(double n) method to get x to the power n? Or, the opposite would be .root() to get the nth root? The .multiply() is not the same as .power(). |
There is a power function: .power().
There is a square root function: .sqrt()
and a cube root function: .cubeRoot().
There is no general root function currently.
You can find these functions in
tech.tablesaw.columns.numbers.NumberMapFunctions
…On Tue, Jun 18, 2019 at 2:17 PM fasialbd ***@***.***> wrote:
Thanks. I will work though the steps. One quick question: is there a
.power(double n) method to get x to the power n? Or, the opposite would be
.root() to get the nth root? The .multiply() is not the same as .power().
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#538?email_source=notifications&email_token=AA2FPATMUZFXWXD56OHKXDLP3ERCRA5CNFSM4HZBHCO2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODX7QINY#issuecomment-503252023>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA2FPATNIO7O7HJJTVD6UYDP3ERCRANCNFSM4HZBHCOQ>
.
|
Great! I tried the second line first. Everything works except this part (my test data frame is called x): |
It looks like you have to do |
Thanks again. I do have one more question. |
You could probably do something like:
It's a little verbose. We should probably add a method to let you stream and map table rows which would simplify. |
Wow! It is indeed a bit verbose. It would be really useful to have the scope of the method expanded given that pandas allow this as a simple method call. Thanks for your help again. |
Just noticed that when you add a column that involves transformation of columnar data, the name of the newly created column is auto generated (which show the operations involved but they are more like expressions than names). Can we replace these names with something simpler that we can work with later? |
What would you expect the default column name to be? |
Well, the default is fine (it does indicate how the column was created) but there could be an option to override that and assign a user-defined name. |
You can. Just call |
I sent #540 which makes a calculation based on arbitrary number of columns easier:
I wonder if we can make this easier still. Perhaps by adding a custom
And then make it easier to add computed columns leveraging this new functionality:
|
These are all good suggestions! I will try out and see how it goes. As a new user of Tablesaw, I do think that at some point, the documentation needs to be updated to make it more detailed with examples. I know some people who wanted to use Tablesaw but got frustrated with what they thought to be "sparse" documentation. Please don't get me wrong - I greatly appreciate your prompt responses. In addition, thorough documentation will be a great help to new developers. |
We've had a hard time keeping the code examples in the documentation working (#457). We'd welcome help on improving |
I would like to send a PR to help make these kinds of DataFrame operations easier. Here are two options that I started to play around with. Both are relatively easy to implement and I am happy to send a PR for whatever we decide. Option 1: Create Custom Collectors (as Ben suggested above)Map single column: IntColumn epochDays = IntColumn.create("epochDays", new int[] {1, 2, 3});
DateColumn dates = epochDays.stream().map(d -> LocalDate.ofEpochDay(d).plusDays(1))
.collect(DateColumn.toDateColumn("colName")); Map entire Table: table.stream().map(r -> Math.pow(r.getInt("a"), r.getInt("b")))
.collect(DoubleColumn.toDoubleColumn("colName")); Notice how the toDoubleCol Collector is a static method on DoubleColumn. This is how Guava does it with toImmutableList. Handling of Nulls/Missing When mapping a single column. Easy to get an NPE when there is missing data. // throws NPE
DateColumn col1 = DateColumn.create("col1", new LocalDate[] {LocalDate.of(2000, 1, 1), null});
col1.stream().map(c -> c.plusDays(1)).collect(DateColumn.toDateColumn("colName")); Users can explicitly pass nulls through. // This works. No NPE.
DateColumn col1 = DateColumn.create("col1", new LocalDate[] {LocalDate.of(2000, 1, 1), null});
col1.stream().map(c -> (c == null) ? null : c.plusDays(1)).collect(DateColumn.toDateColumn("colName")); Note: In order to make this work we will have to add an append method that accepts nulls. Option 2: Add mapToColumn methods to Table and to every concrete Column type (DoubleColumn, IntColumn...).Map single column: IntColumn epochDays = IntColumn.create("epochDays", new int[] {1, 2, 3});
DateColumn dates = epochDays.mapToDateColumn("colName", (d) -> LocalDate.ofEpochDay(d).plusDays(1)); Map entire Table: table.mapToDoubleColumn("colName", row -> Math.pow(row.getInt("a"), row.getInt("b"))); Handling of Nulls/Missing When mapping a single column. I like ignoring nulls/missing values. This is what Pandas does. This is also what Column.map does. // Just works. No NPE.
DateColumn col1 = DateColumn.create("col1", new LocalDate[] {LocalDate.of(2000, 1, 1), null});
col1.mapToDateColumn("colName", c -> c.plusDays(1)); // Does Nothing. This is a possible downside. Could surprise users.
stringColumn.mapToString("colName", s -> s == null ? "" : s); Nulls when mapping over entire table. // This will still throw an NPE. I think that is fine.
DateColumn col1 = DateColumn.create("date", new LocalDate[] {LocalDate.of(2000, 1, 1), null});
Table table = Table.create("table", col1);
table.mapToDateColumn("colName", row -> row.getDate("date").plusDays(1)); |
I do like that Option 2 is less verbose. The main downside is that we've tried to avoid creating a cross product of column methods. I.e. where we have a variant of a method on each column type for every other column type. This is a common enough operation that perhaps it might be worth it, but I also wonder if we should add these methods only to the Table? Or maybe do a hybrid approach where we go with Option 1 for the columns and Option 2 for the Table? One thing to think about with Option 2 is whether we could also support custom columns (i.e. if a user has built their own column type that's not in the core library). Maybe we could have a generic method that takes the column type that the others call. In Option 1 could we also make it skip nulls? It seems like whether we skip nulls or not should be independent of the collectors vs mapToColType decision. Thanks for volunteering to help with this! This is a fairly core API, so I'm very interested in getting @lwhite1's thoughts as well. (I think Larry said he was on vacation in another thread, so he might be slow to respond though) |
Thanks for the response Ben. I understand this is an important change and as such the need to go slow and make sure we get it right. Here are a few thoughts. For Option 2.
It is possible to add the mapToColType to Table and AbstractColumn (or maybe some other super type). This is what I did when I was experimenting. That means two methods per type not a cross product. Below is a possible implementation. // Proposed implementation.
class AbstractColumn {
public DateColumn mapToDateColumn(String name, Function<? super T, ? extends LocalDate> fun) {
DateColumn col = DateColumn.create(name, size());
mapInto(fun, col);
return col;
}
...
}
Can mapInto work here? // Current functionality.
Column<YearMonth> customYearMonthColumn = table.dateColumn("date")
.mapInto((x) -> YearMonth.from(x), YearMonthColumn.create("month", table.rowCount())); For Option 1:
I am a little out of my depth here, but I do not see how. If we filtered out the nulls prior to the map step the resulting column would be the wrong size and I can not think of a way to "re-insert" the nulls. I think this is a limitation of the Streams API. |
Good call on adding to The other thing I wonder a bit about is whether the methods on the table (e.g. |
I could use some help figuring out a good way to handle missing values for the proposed mapToType methods on Table. (e.g. Issue This kind of operation is complicated by the fact short, integer and long use their MIN_VALUE as a missing value indicator. A naive implementation that looped through all rows whether or not they are missing and applied the mapping function would result in something like this. // Issue missing Long values
LongColumn first = LongColumn.create("c1", new long[]{1L});
first.appendMissing();
Table t = Table.create("table", first);
LongColumn actual = t.mapToLongColumn("c3", (row) -> row.getLong("c1") + 1L );
assertArrayEquals(new long[]{2L, Long.MIN_VALUE}, actual.asLongArray()); test result: Instead of skipping the missing row it adds 1 to LONG.MIN_VALUE. This is not what we want. Possible Solution Force users to pass a Selection where the map will be applied. Selection notMissing = t.column("c1").isNotMissing();
t.mapToLongColumn("c3", notMissing, (row) -> row.getLong("c1") + 1L ); We could create a t.mapToLongColumn("c3", t.notMissing("c1"), (row) -> row.getLong("c1") + 1L ); Downsides:
What do you think? Is there a better way? Is this worth adding at all? |
The ColumnType classes have |
I am happy to pick this back up and send a PR, but since this is a pretty big API change I would like to get some more feedback before I do. @benmccann What do you think about the below function signature for a mapToDouble method on Table? It is slightly different than what I proposed above. /**
* Maps the supplied functions across rows in the table appending the results to a new DoubleColumn.
*
* If you do not want the supplied function applied to rows with missing values you can supply a
* list of columns to check for missing values. For each row, it will check for a missing value in any of the
* listed columns. If there is a missing value a missing value will be appended to the results, otherwise
* the result of the function will be appended.
*
*
* @param name The name of the new column.
* @param function to map a table row to a double.
* @param skipIfMissingColNames the column names to check for missing values. If any of the
* columns contain a missing value (for a particular row) a missing value will be appended.
* @return DoubleColumn with the results appended.
*/
public DoubleColumn mapToDoubleColumn(String name, Function<? super Row, ? extends Double> function, String... skipIfMissingColNames) {} |
#545 added mapToType methods to AbstractColumn. Those methods are for mapping a single column to another column. Above is a proposal to add mapToType columns to Table . These methods will map a table to a column. It will help with transformations such as the one listed by the op. data1['CAGR'] = ((data1['curValue'] / data1['bookValue']) ** (1.0 / data1['nYears']) - 1.0) * 100 I am not totally convinced these methods should be added, but I think it is worth discussing. The best alternative is probably a simple for loop. |
This is cool and general purpose and standard java, so all good. Is it much easier than writing the equivalent code using
with n = 1 and declaring the destination column outside the consumer? I guess you're handling the missing values in a systematic way, which saves effort and trouble. This all made me think it would be cool to have a way to reuse all the math functions we have for aggregating numeric columns and make them available to aggregate rows instead. e.g.
We're currently lacking a good way do multi-column (n > 2) math. It kinda sucks to have to write a function to sum all your monthly data into an annual total. Maybe that's too special purpose for table. Some people have asked about it though, and it kinda seems handy. |
I agree that would be handy, and is probably to special purpose for Table. With #579 this is now possible. t1.mapToDoubleColumn("col3",
r -> r.getDouble("jan") + r.getDouble("feb") + r.getDouble("mar"), "jan", "feb", "mar"); |
I realized one shortcoming of the
I probably wouldn't use Finally, I wonder if we really want |
We decided not to move forward with the mapToType columns. If anyone is following along these are my two recommended approaches. Consumer Table data1 = Table.create("table");
DoubleColumn cagr = DoubleColumn.create("CAGR");
Consumer<Row> consumer = row -> cagr.append(Math.pow(
row.getDouble("curValue") / row.getDouble("bookValue"),
1.0 / row.getDouble("nYears") - 1) * 100);
data1.apply(consumer);
table1.addColumns(cagr); For loop Table data1 = Table.create("table");
DoubleColumn cagr = DoubleColumn.create("CAGR");
for(Row row: data1) {
double value = Math.pow(
row.getDouble("curValue") / row.getDouble("bookValue"),
1.0 / row.getDouble("nYears") - 1) * 100;
cagr.append(value);
}
table1.addColumns(cagr); |
I'll add my current favorite way as well: Stream Table data1 = Table.create("table");
table1.addColumns(DoubleColumn.create("CAGR", data1.stream().map(row -> {
return Math.pow(
row.getDouble("curValue") / row.getDouble("bookValue"),
1.0 / row.getDouble("nYears") - 1) * 100;
}).toArray(Double[]::new))); |
I've sent #634 which would make this a bit cleaner: Table data1 = Table.create("table");
table1.addColumns(DoubleColumn.create("CAGR", data1.stream().mapToDouble(row -> {
return Math.pow(
row.getDouble("curValue") / row.getDouble("bookValue"),
1.0 / row.getDouble("nYears") - 1) * 100;
}))); |
For amusement only, I give you my two favorites: In Kotlin, with a tablesaw wrapper that's mostly vapor but works:
And this is a thing I needed to allow non-programmers to calc new columns (because they're losing their spreadsheet).
It uses a subclass of DoubleColumn. For a long time I've thought about columns that hold a function and compute their values on demand (In my mind "spreadsheet columns"). I used a math parser lib for that math part. CalcColumn just loops using rows, substitutes for the column names, and handles missing values. |
Hi,
I am new to Tablesaw. I am exploring options for recreating some Pyhton DataFrame operations in Tablesaw. For example, I have a DataFrame object called data1 and I use existing columns of this data frame to create new ones (and update existing ones). Here is a couple of lines of Pyhton codes:
data1['days'] = data1['buyDate'].apply(lambda x: (today - x).days)
...
data1['CAGR'] = ((data1['curValue'] / data1['bookValue']) ** (1.0 / data1['nYears']) - 1.0) * 100
Is there a way to implement something similar using Tablesaw?
Thanks a lot in advance.
The text was updated successfully, but these errors were encountered: