Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Column-wise DataFrame-like operations #538

Closed
imfaisalmb opened this issue Jun 18, 2019 · 31 comments · Fixed by #583
Closed

Column-wise DataFrame-like operations #538

imfaisalmb opened this issue Jun 18, 2019 · 31 comments · Fixed by #583
Labels
core in core sub-project enhancement

Comments

@imfaisalmb
Copy link

Hi,
I am new to Tablesaw. I am exploring options for recreating some Pyhton DataFrame operations in Tablesaw. For example, I have a DataFrame object called data1 and I use existing columns of this data frame to create new ones (and update existing ones). Here is a couple of lines of Pyhton codes:

data1['days'] = data1['buyDate'].apply(lambda x: (today - x).days)
...
data1['CAGR'] = ((data1['curValue'] / data1['bookValue']) ** (1.0 / data1['nYears']) - 1.0) * 100

Is there a way to implement something similar using Tablesaw?

Thanks a lot in advance.

@benmccann
Copy link
Collaborator

benmccann commented Jun 18, 2019

@lwhite1 updated the answer below slightly, for posterity.

This:

data1['days'] = data1['buyDate'].apply(lambda x: (today - x).days)

could also be written in tablesaw as:

data1.addColumns(
    data1.dateColumn("buyDate")
         .daysUntil(LocalDate.now())
              .setName("days"));

(If the sign on the days value isn't what you want, you can call .neg() on the result of daysUntil.)

The second one is probably as good as it could be, given the tablesaw functionality at the time. I'm very impressed by how concise the python code is. I wasn't aware that today is a function or that you can divide a number by a column. The wonders of operator overloading :)

I haven't tested this code, so it probably has some minor issues, but you should be able to do something roughly like:

data1.addColumn(data1.dateColumn("buyDate").mapInto(x -> ChronoUnit.DAYS.between(Instant.now(), x.toInstant()), IntColumn.create(data1.rowCount(), "days")));
data1.addColumn(data1.nCol("curValue").divide(data.nCol("bookValue")).multiply(DoubleColumn.create(data1.rowCount(), "tmp").fillWith(1.0).divide(data1.nCol("nYears").subtract(1.0).multiply(100))).setName("CAGR"));

@imfaisalmb
Copy link
Author

Thanks. I will work though the steps. One quick question: is there a .power(double n) method to get x to the power n? Or, the opposite would be .root() to get the nth root? The .multiply() is not the same as .power().

@lwhite1
Copy link
Collaborator

lwhite1 commented Jun 18, 2019 via email

@imfaisalmb
Copy link
Author

Great! I tried the second line first. Everything works except this part (my test data frame is called x):
x.addColumns(
...
DoubleColumn.create("tmp", x.rowCount()).fillWith(1.0)
.devide(x.nCol("nyears"))
...
);
I get a "...no suitable method found for fillWith(double) error.

@benmccann
Copy link
Collaborator

It looks like you have to do fillWith(() -> 1.0). I'm not sure why you need to provide a function instead of just the value, so I've sent a PR to propose simplifying that: #539

@imfaisalmb
Copy link
Author

Thanks again. I do have one more question.
.power() seems to take only a number (e.g., double) as an argument. What I need to do is apply the values of one column (saved in different rows) as arguments of .power() to transform another column. I want this to work like .add() or .subtract() where one element of a column is added to or subtracted from the corresponding element of a different column. Is that possible?

@benmccann
Copy link
Collaborator

You could probably do something like:

DoubleColumn.create("result", IntStream.range(0, data.size()).mapToDouble({i ->
  return Math.pow(data.nCol("col1").getDouble(i), data.nCol("col2").getDouble(i));
}).toArray());

It's a little verbose. We should probably add a method to let you stream and map table rows which would simplify.

@imfaisalmb
Copy link
Author

Wow! It is indeed a bit verbose. It would be really useful to have the scope of the method expanded given that pandas allow this as a simple method call. Thanks for your help again.

@imfaisalmb
Copy link
Author

Just noticed that when you add a column that involves transformation of columnar data, the name of the newly created column is auto generated (which show the operations involved but they are more like expressions than names). Can we replace these names with something simpler that we can work with later?

@benmccann
Copy link
Collaborator

What would you expect the default column name to be?

@imfaisalmb
Copy link
Author

Well, the default is fine (it does indicate how the column was created) but there could be an option to override that and assign a user-defined name.

@benmccann
Copy link
Collaborator

You can. Just call setName on the resulting column

@benmccann
Copy link
Collaborator

benmccann commented Jun 19, 2019

I sent #540 which makes a calculation based on arbitrary number of columns easier:

DoubleColumn.create("colName", table.stream().map(row -> Math.pow(row.getDouble("a"), row.getDouble("b"))).toArray(Double[]::new));

I wonder if we can make this easier still. Perhaps by adding a custom Collector to allow you to do something like:

table.stream().map(row -> Math.pow(row.getDouble("a"), row.getDouble("b"))).collect(TSCollectors.toDoubleCol("colName"));

And then make it easier to add computed columns leveraging this new functionality:

table.addColumn(row -> Math.pow(row.getDouble("a"), row.getDouble("b")), TSCollectors.toDoubleCol("colName"));

@imfaisalmb
Copy link
Author

These are all good suggestions! I will try out and see how it goes. As a new user of Tablesaw, I do think that at some point, the documentation needs to be updated to make it more detailed with examples. I know some people who wanted to use Tablesaw but got frustrated with what they thought to be "sparse" documentation. Please don't get me wrong - I greatly appreciate your prompt responses. In addition, thorough documentation will be a great help to new developers.

@benmccann
Copy link
Collaborator

We've had a hard time keeping the code examples in the documentation working (#457). We'd welcome help on improving

@ryancerf
Copy link
Collaborator

ryancerf commented Jul 2, 2019

I would like to send a PR to help make these kinds of DataFrame operations easier. Here are two options that I started to play around with. Both are relatively easy to implement and I am happy to send a PR for whatever we decide.

Option 1: Create Custom Collectors (as Ben suggested above)

Map single column:

IntColumn epochDays = IntColumn.create("epochDays", new int[] {1, 2, 3});
DateColumn dates = epochDays.stream().map(d -> LocalDate.ofEpochDay(d).plusDays(1))
  .collect(DateColumn.toDateColumn("colName"));

Map entire Table:

table.stream().map(r -> Math.pow(r.getInt("a"), r.getInt("b"))) 
 .collect(DoubleColumn.toDoubleColumn("colName"));

Notice how the toDoubleCol Collector is a static method on DoubleColumn. This is how Guava does it with toImmutableList.

Handling of Nulls/Missing When mapping a single column.

Easy to get an NPE when there is missing data.

// throws NPE
DateColumn col1 = DateColumn.create("col1", new LocalDate[] {LocalDate.of(2000, 1, 1), null});
col1.stream().map(c -> c.plusDays(1)).collect(DateColumn.toDateColumn("colName"));

Users can explicitly pass nulls through.

// This works. No NPE.
DateColumn col1 = DateColumn.create("col1", new LocalDate[] {LocalDate.of(2000, 1, 1), null});
col1.stream().map(c -> (c == null) ? null : c.plusDays(1)).collect(DateColumn.toDateColumn("colName"));

Note: In order to make this work we will have to add an append method that accepts nulls.

Option 2: Add mapToColumn methods to Table and to every concrete Column type (DoubleColumn, IntColumn...).

Map single column:

IntColumn epochDays = IntColumn.create("epochDays", new int[] {1, 2, 3});
DateColumn dates = epochDays.mapToDateColumn("colName", (d) -> LocalDate.ofEpochDay(d).plusDays(1));

Map entire Table:

table.mapToDoubleColumn("colName", row -> Math.pow(row.getInt("a"), row.getInt("b")));

Handling of Nulls/Missing When mapping a single column.

I like ignoring nulls/missing values. This is what Pandas does. This is also what Column.map does.

// Just works. No NPE.
DateColumn col1 = DateColumn.create("col1", new LocalDate[] {LocalDate.of(2000, 1, 1), null});
col1.mapToDateColumn("colName", c -> c.plusDays(1));
// Does Nothing. This is a possible downside. Could surprise users.
stringColumn.mapToString("colName", s -> s == null ? "" : s);

Nulls when mapping over entire table.

// This will still throw an NPE. I think that is fine.
DateColumn col1 = DateColumn.create("date", new LocalDate[] {LocalDate.of(2000, 1, 1), null});
Table table = Table.create("table", col1);
table.mapToDateColumn("colName", row -> row.getDate("date").plusDays(1));

@benmccann
Copy link
Collaborator

I do like that Option 2 is less verbose. The main downside is that we've tried to avoid creating a cross product of column methods. I.e. where we have a variant of a method on each column type for every other column type. This is a common enough operation that perhaps it might be worth it, but I also wonder if we should add these methods only to the Table? Or maybe do a hybrid approach where we go with Option 1 for the columns and Option 2 for the Table?

One thing to think about with Option 2 is whether we could also support custom columns (i.e. if a user has built their own column type that's not in the core library). Maybe we could have a generic method that takes the column type that the others call.

In Option 1 could we also make it skip nulls? It seems like whether we skip nulls or not should be independent of the collectors vs mapToColType decision.

Thanks for volunteering to help with this! This is a fairly core API, so I'm very interested in getting @lwhite1's thoughts as well. (I think Larry said he was on vacation in another thread, so he might be slow to respond though)

@ryancerf
Copy link
Collaborator

ryancerf commented Jul 3, 2019

Thanks for the response Ben. I understand this is an important change and as such the need to go slow and make sure we get it right. Here are a few thoughts.

For Option 2.

The main downside is that we've tried to avoid creating a cross product of column methods

It is possible to add the mapToColType to Table and AbstractColumn (or maybe some other super type). This is what I did when I was experimenting. That means two methods per type not a cross product. Below is a possible implementation.

// Proposed implementation.
class AbstractColumn {
  public DateColumn mapToDateColumn(String name, Function<? super T, ? extends LocalDate> fun) {
      DateColumn col = DateColumn.create(name, size());
      mapInto(fun, col);
      return col;
  }
...
}

One thing to think about with Option 2 is whether we could also support custom columns

Can mapInto work here?

// Current functionality.
Column<YearMonth> customYearMonthColumn = table.dateColumn("date")
            .mapInto((x) -> YearMonth.from(x), YearMonthColumn.create("month", table.rowCount()));

For Option 1:

could we also make it skip nulls?

I am a little out of my depth here, but I do not see how. If we filtered out the nulls prior to the map step the resulting column would be the wrong size and I can not think of a way to "re-insert" the nulls. I think this is a limitation of the Streams API.

@benmccann
Copy link
Collaborator

Good call on adding to AbstractColumn for Option 2. I like that option

The other thing I wonder a bit about is whether the methods on the table (e.g. table.mapToDateColumn) should add the column to the table. It seems like that's usually the next thing we do with the column. Potentially we could have variants both for creating only or for creating and adding

@ryancerf ryancerf mentioned this issue Jul 6, 2019
1 task
@ryancerf
Copy link
Collaborator

ryancerf commented Jul 6, 2019

I could use some help figuring out a good way to handle missing values for the proposed mapToType methods on Table. (e.g. table.mapToDateColumn)

Issue

This kind of operation is complicated by the fact short, integer and long use their MIN_VALUE as a missing value indicator.

A naive implementation that looped through all rows whether or not they are missing and applied the mapping function would result in something like this.

// Issue missing Long values 
LongColumn first = LongColumn.create("c1", new long[]{1L});
first.appendMissing();
Table t = Table.create("table", first);
LongColumn actual = t.mapToLongColumn("c3", (row) -> row.getLong("c1") + 1L );
assertArrayEquals(new long[]{2L, Long.MIN_VALUE}, actual.asLongArray());

test result:
array contents differ at index [1], expected: <-9223372036854775808> but was: <-9223372036854775807>

Instead of skipping the missing row it adds 1 to LONG.MIN_VALUE. This is not what we want.

Possible Solution

Force users to pass a Selection where the map will be applied.

Selection notMissing =  t.column("c1").isNotMissing();
t.mapToLongColumn("c3", notMissing, (row) -> row.getLong("c1") + 1L );

We could create a notMissing(String... columns) helper method that would make this a little more concise.

t.mapToLongColumn("c3", t.notMissing("c1"), (row) -> row.getLong("c1") + 1L );

Downsides:

  1. The obvious implementation utilizes contains on bitMapBackedSelection which is O(log(n)) and makes the entire mapping operation O(nlog(n)). Ideally it would be O(n). It will also take extra time compute the selection.
  2. Forcing users to provide the Selection feels like we are asking for redundant information.

What do you think? Is there a better way? Is this worth adding at all?

@benmccann
Copy link
Collaborator

The ColumnType classes have isMissingValue and missingValueIndicator methods. Does that help? I'm running around and didn't read your post super closely...

@ryancerf
Copy link
Collaborator

ryancerf commented Aug 3, 2019

I am happy to pick this back up and send a PR, but since this is a pretty big API change I would like to get some more feedback before I do.

@benmccann What do you think about the below function signature for a mapToDouble method on Table? It is slightly different than what I proposed above.

    /**
     * Maps the supplied functions across rows in the table appending the results to a new DoubleColumn.
     *
     * If you do not want the supplied function applied to rows with missing values you can supply a
     * list of columns to check for missing values. For each row, it will check for a missing value in any of the
     * listed columns. If there is a missing value a missing value will be appended to the results, otherwise
     * the result of the function will be appended.
     *
     *
     * @param name The name of the new column.
     * @param function to map a table row to a double.
     * @param skipIfMissingColNames the column names to check for missing values. If any of the
     * columns contain a missing value (for a particular row) a missing value will be appended.
     * @return DoubleColumn with the results appended.
     */
    public DoubleColumn mapToDoubleColumn(String name, Function<? super Row, ? extends Double> function, String... skipIfMissingColNames) {}

@lwhite1
Copy link
Collaborator

lwhite1 commented Aug 3, 2019

@ryancerf How does this improve on PR #545 Add mapToType Columns?

@ryancerf
Copy link
Collaborator

ryancerf commented Aug 4, 2019

#545 added mapToType methods to AbstractColumn. Those methods are for mapping a single column to another column. Above is a proposal to add mapToType columns to Table . These methods will map a table to a column. It will help with transformations such as the one listed by the op.

data1['CAGR'] = ((data1['curValue'] / data1['bookValue']) ** (1.0 / data1['nYears']) - 1.0) * 100

I am not totally convinced these methods should be added, but I think it is worth discussing. The best alternative is probably a simple for loop.

@lwhite1 lwhite1 added the core in core sub-project label Aug 4, 2019
@lwhite1
Copy link
Collaborator

lwhite1 commented Aug 4, 2019

This is cool and general purpose and standard java, so all good.

Is it much easier than writing the equivalent code using

rollWithRows(Consumer<Row[]> rowConsumer, int n)

with n = 1 and declaring the destination column outside the consumer?

I guess you're handling the missing values in a systematic way, which saves effort and trouble.

This all made me think it would be cool to have a way to reuse all the math functions we have for aggregating numeric columns and make them available to aggregate rows instead.

e.g.

DoubleColumn calc(AggregateFunctions function, NumericColumn ... columns);
 
col = t.calc(sum, "jan", "feb", ...); 

We're currently lacking a good way do multi-column (n > 2) math. It kinda sucks to have to write a function to sum all your monthly data into an annual total.

Maybe that's too special purpose for table. Some people have asked about it though, and it kinda seems handy.

@ryancerf
Copy link
Collaborator

ryancerf commented Aug 8, 2019

I agree that would be handy, and is probably to special purpose for Table.

With #579 this is now possible.

t1.mapToDoubleColumn("col3",
  r -> r.getDouble("jan") + r.getDouble("feb") + r.getDouble("mar"), "jan", "feb", "mar");

@benmccann
Copy link
Collaborator

benmccann commented Aug 9, 2019

I realized one shortcoming of the mapToType interface is that half the time I do this, my functions also access data from the previous row. I wonder if we should tweak the mapToType interface to take a number of rows. Though we then might end up with twice as many methods (for single row and for row array). I can still get the previous row with the current interface with extra verbosity by making the first few lines of the function:

if (row.getRowNumber() < 2) {
   return null;
}
Row previousRow = table.row(row.getRowNumber() - 1);

rollWithRows(consumer, 2) (or the equivalent stream(2).forEach(consumer) as proposed in #581) actually handles the multiple rows part a bit better, but I overall I would much rather use stream().map or mapToType. The thing I don't like about rollWithRows is that it takes multiple lines to create the column and I have to call column.set(row[rows.length - 1].getRowNumber(), result) instead of just returning the result. That can be really burdensome when you have a complex function with three or four different return statements.

DoubleColumn column = DoubleColumn.create("column", table.rowCount());
column.doWithRows(rows -> {
  if (row[1].getDouble("x") > row[0].getDouble("x")) {
    column.set(row[1].getRowNumber(), calculation1(row));
  } else {
    column.set(row[1].getRowNumber(), calculation2(row));
  }
}, 2);
table.add(column);

I probably wouldn't use stream(n).map because it would create table.rowCount() + 1 - n rows, which isn't really what you want for creating a new column. I think you'd want to create table.rowCount() rows with the first n - 1 rows being missing. So you might actually need a method like table.mapRows(n) that handles this.

Finally, I wonder if we really want mapToType on a column (#545) and not on a table. I'm not convinced yet one way or the other that it should be on the table or column, but I do think we should probably be consistent. If anything, I would actually probably use it much more on the table than on the column.

@ryancerf
Copy link
Collaborator

ryancerf commented Aug 9, 2019

We decided not to move forward with the mapToType columns.

If anyone is following along these are my two recommended approaches.

Consumer

    Table data1 = Table.create("table");
    DoubleColumn cagr = DoubleColumn.create("CAGR");
    Consumer<Row> consumer = row -> cagr.append(Math.pow(
      row.getDouble("curValue") / row.getDouble("bookValue"),
      1.0 / row.getDouble("nYears") - 1) * 100);
    data1.apply(consumer);
    table1.addColumns(cagr);

For loop

    Table data1 = Table.create("table");
    DoubleColumn cagr = DoubleColumn.create("CAGR");
    for(Row row: data1) {
      double value = Math.pow(
        row.getDouble("curValue") / row.getDouble("bookValue"),
        1.0 / row.getDouble("nYears") - 1) * 100;
      cagr.append(value);
    }
    table1.addColumns(cagr);

@benmccann
Copy link
Collaborator

benmccann commented Aug 9, 2019

I'll add my current favorite way as well:

Stream

    Table data1 = Table.create("table");
    table1.addColumns(DoubleColumn.create("CAGR", data1.stream().map(row -> { 
        return Math.pow(
            row.getDouble("curValue") / row.getDouble("bookValue"),
            1.0 / row.getDouble("nYears") - 1) * 100;
    }).toArray(Double[]::new)));

@benmccann
Copy link
Collaborator

I've sent #634 which would make this a bit cleaner:

    Table data1 = Table.create("table");
    table1.addColumns(DoubleColumn.create("CAGR", data1.stream().mapToDouble(row -> { 
        return Math.pow(
            row.getDouble("curValue") / row.getDouble("bookValue"),
            1.0 / row.getDouble("nYears") - 1) * 100;
    })));

@lwhite1
Copy link
Collaborator

lwhite1 commented Aug 25, 2019

For amusement only, I give you my two favorites:

In Kotlin, with a tablesaw wrapper that's mostly vapor but works:

data1["CAGR"] = (((data1["curValue"] / data1["bookValue"]) pow(1.0 / data1["years"])) - 1.0) * 100

And this is a thing I needed to allow non-programmers to calc new columns (because they're losing their spreadsheet).

CalcColumn cagr = new CalcColumn("CAGR",  "(curValue / bookValue) ^ ((1 / years) - 1)  * 100");
cagr.calculate(data1);  

It uses a subclass of DoubleColumn. For a long time I've thought about columns that hold a function and compute their values on demand (In my mind "spreadsheet columns"). I used a math parser lib for that math part. CalcColumn just loops using rows, substitutes for the column names, and handles missing values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core in core sub-project enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants