Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: DataFrame.loc silently drops non-existent elements when using MultiIndex #10549

Closed
tgarc opened this issue Jul 11, 2015 · 8 comments
Closed
Labels
Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex Needs Discussion Requires discussion from core team before further action

Comments

@tgarc
Copy link

tgarc commented Jul 11, 2015

So here's my setup (using pandas 0.16.2):

>>> midx = pd.MultiIndex.from_product([['bar', 'baz', 'foo', 'qux'], ['one', 'two']],names=['first','second'])
>>> df = pd.DataFrame(np.random.randint(10,size=(8,8)),index=midx)

>>> df 
              0  1  2  3  4  5  6  7
first second                        
bar   one     0  5  5  5  6  2  6  8
      two     2  6  9  0  3  6  7  9
baz   one     9  0  9  9  2  5  7  4
      two     4  8  1  2  9  2  8  1
foo   one     2  7  3  6  5  5  5  2
      two     3  4  6  2  7  7  1  2
qux   one     0  8  5  9  5  5  7  3
      two     7  4  0  7  3  6  8  6

I recently found that I can select multiple levels by indexing with a tuple of tuples

>>> df.loc[( ('bar','baz'),  ), :]
              0  1  2  3  4  5  6  7
first second                        
bar   one     0  5  5  5  6  2  6  8
      two     2  6  9  0  3  6  7  9
baz   one     9  0  9  9  2  5  7  4
      two     4  8  1  2  9  2  8  1

Or even select at multiple depths of levels

>>> df.loc[( ('bar','baz'), ('one',) ), :]
              0  1  2  3  4  5  6  7
first second                        
bar   one     0  5  5  5  6  2  6  8
baz   one     9  0  9  9  2  5  7  4

The issue is this: if I add any levels to the index tuple that don't exist in the dataframe, pandas drops them silently

>>> df.loc[( ('bar','baz','xyz'), ('one',) ), :]
              0  1  2  3  4  5  6  7
first second                        
bar   one     0  5  5  5  6  2  6  8
baz   one     9  0  9  9  2  5  7  4

It seems to me like this should raise an exception since

  1. The shape of the dataframe that is returned in this instance is not what you'd expect
  2. There's no way to unambiguously fill the returned dataframe with NaNs where a level didn't exist (as is done in the case where there is only a single level index)
@tgarc tgarc changed the title .loc silently drops non-existent elements when using MultiIndex bug: .loc silently drops non-existent elements when using MultiIndex Jul 11, 2015
@tgarc tgarc changed the title bug: .loc silently drops non-existent elements when using MultiIndex BUG: DataFrame.loc silently drops non-existent elements when using MultiIndex Jul 11, 2015
@jreback
Copy link
Contributor

jreback commented Jul 12, 2015

xref to #6699

I am not sure how you can know when to raise an error. This comes back to the is this a reindex or a lookup issue.

IMHO this would be unexpected to the user if they had a long list of values that they are looking up, to have a KeyError raise, which is what you are suggesting. Further this would then be inconsistent with the current semantics of reindex/loc being on the same footing.

As discussed we are currently in a consistent state (meaning for getting & setting). So this comes down to are .loc and .reindex the same (for getting) or is .loc strict in ALL inputs (and thus would differ from .reindex by doing have a different getting policy).

Further though I think this could be communicated to the users, how disruptive would this be.

We certainly don't want. there they go again changing things, so things become even more unpredictable. All that said, if its the right change, then it should be done.

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves API Design MultiIndex labels Jul 12, 2015
@jreback jreback added this to the 0.17.0 milestone Jul 12, 2015
@jreback jreback added Difficulty Advanced Needs Discussion Requires discussion from core team before further action labels Jul 12, 2015
@jreback
Copy link
Contributor

jreback commented Jul 12, 2015

@shoyer
Copy link
Member

shoyer commented Jul 12, 2015

This came out of our discussions at the SciPy sprints.

IMO this is different than reindexing and filling with NaN. For reindexing, we don't fail silently -- we insert NaN. If we can't do that, it is better to raise.

I do agree that this is part of a larger discussion about how to handle indexing fallbacks. I think removing indexing fallbacks should be a top priority for pandas 1.0. Currently the indexing code is nigh unmaintainable.

@jreback jreback modified the milestones: Next Major Release, 0.17.0 Aug 20, 2015
@jreback jreback modified the milestones: 0.18.1, Next Major Release Mar 12, 2016
@jreback jreback modified the milestones: Next Major Release, 0.18.1 Apr 26, 2016
@toobaz
Copy link
Member

toobaz commented Feb 22, 2017

I don't think this is a bug. The behaviour is undocumented, true, but it is coherent with the behaviour on lists of labels, which instead is clearly documented ("raise only if no label is found").

Moreover, there is not specific "shape of the dataframe that is returned" that you would expect without knowing what you're indexing on. You seem to be implying that, for instance,

pd.Series(range(5), index=pd.MultiIndex.from_arrays([[1,1,2,2,2], ['a', 'b', 'a', 'b', 'c']])).loc[[1, 2], ['c']]

should be returning

1  c    NaN
2  c    4
dtype: int64

rather than the current

2  c    4
dtype: int64

(all labels are present in the index, but not all their combinations), but this would be really unexpected - why should I get a NaN if I try to access a label which is there?!

The desired behaviour can easily be obtained with .reindex by passing the desired index.

@phofl
Copy link
Member

phofl commented Nov 29, 2020

This is still not raising. Was this done on purpose, that get_locs is not raising when keys are partially not found?

midx = pd.MultiIndex.from_product([['bar', 'baz', 'foo', 'qux'], ['one', 'two']],names=['first','second'])
midx.get_locs((['bar','baz','xyz'], slice(None)))

cc @jbrockmendel

@toobaz
Copy link
Member

toobaz commented Nov 30, 2020

Was this done on purpose, that get_locs is not raising when keys are partially not found?

Yes, it is intended - although other cases are still waiting for a fix: #20916 (comment)

By the way: #20916 is probably a duplicate of this - that is explicitly about partial indexing, but this one also has partial indexing as examples, and #20770 already solved non-partial indexing.

@jbrockmendel
Copy link
Member

@toobaz is this closed by #42351?

@jbrockmendel
Copy link
Member

Closed by #42351.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

7 participants