Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COMPAT: remove SettingWithCopy warning, and use copy-on-write, #10954 #10973

Closed
wants to merge 4 commits into from

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Sep 2, 2015

closes #10954

  • deprecates the option mode.chained_assignment as its now unused (it shows the deprecation if you are explicity using it, more friendly this way).
  • deprecate NDFrame.is_copy property
  • remove SettingWithCopyWarning entirely

TODO:

  • some systematic testing of the chaining might help (e.g. iterate thru the iloc/loc/[], chain, and set various values including dtype changes
  • huge note explaining the rationale & change structure

semantics:

Intermediate assignment trigger copy-on-write and do not propogate.

In [1]: df = DataFrame({'col1':[1,2], 'col2':[3,4]})

In [2]: intermediate = df.loc[1:1,]

In [3]: intermediate['col1'] = -99

In [4]: intermediate
Out[4]: 
   col1  col2
1   -99     4

In [5]: df
Out[5]: 
   col1  col2
0     1     3
1     2     4

Chained assignments always work!

In [6]: df = DataFrame({'col1':[1,2], 'col2':[3,4]})

In [7]: df.loc[1:1,]['col1'] = -99

In [8]: df
Out[8]: 
   col1  col2
0     1     3
1   -99     4

Even true with cross-sections that change dtype

In [1]: df = DataFrame({'col1':[1,2], 'col2':[3,4]})

In [2]: df.loc[1]['col2'] = 'foo'

In [3]: df
Out[3]: 
   col1 col2
0     1    3
1     2  foo

except for a really egregious case, which will raise a SettingWithCopyError (maybe I could even fix this....)

In [10]: df.loc[1:1]['col2'].replace(10,5,inplace=True)
SettingWithCopyError: chained indexing detected, you can fix this ......

This is an invalid assignment. I suppose we could make it work, but we currently havent' allowed the .dt to be used as a setitem accessor

In [9]: s = Series(pd.date_range('20130101',periods=3))

In [12]: s.dt.hour[0] = 5
SettingImmutableError: modifications to a property of a datetimelike object are not supported and are discarded. Change values on the original.

cc @nickeubank
cc @JanSchulz
cc @ellisonbg
cc @CarstVaartjes
@shoyer

special thanks to @JanSchulz for some reference checking code :)

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves API Design Compat pandas objects compatability with Numpy or Python functions labels Sep 2, 2015
@shoyer
Copy link
Member

shoyer commented Sep 2, 2015

It looks like this does deal with @nickeubank's most recent issue, where changes on the original dataframe don't change the intermediate, even if it was created with a view? How do you do that?

In [1]: from pandas import DataFrame

In [2]: df = DataFrame({'col1':[1,2], 'col2':[3,4]})

In [3]: intermediate = df.loc[1:1,]

In [4]: intermediate
Out[4]:
   col1  col2
1     2     4

In [5]: df
Out[5]:
   col1  col2
0     1     3
1     2     4

In [6]: df.iloc[0, 0] = -99

In [7]: intermediate
Out[7]:
   col1  col2
1     2     4

In [8]: df
Out[8]:
   col1  col2
0   -99     3
1     2     4

It looks like currently accessing a column as a series returns a view, which is good. But is also seems like accessing a row as a series can sometimes also return a view, which is not what we want (that should always be a copy or copy-on-write):

In [14]: df = DataFrame({'col1':[1,2], 'col2':[3,4]})

# this is the first row
In [15]: s2 = df.loc[0]

In [16]: s2
Out[16]:
col1    1
col2    3
Name: 0, dtype: int64

In [17]: s2.iloc[0] = -99

In [18]: df
Out[18]:
   col1  col2
0   -99     3
1     2     4


# we have a chained assignment
# assign back to the original
self.is_copy().loc[self.index,key] = value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a better name for is_copy() if it returns the parent, then parent()?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this is a confusing name. What is the value of this variable? Maybe get_parent() would be better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeh I can prob just change that. I left for now. Its an attribute that is None or a weakref

@jreback
Copy link
Contributor Author

jreback commented Sep 2, 2015

@shoyer

In [1]: df = DataFrame({'col1':[1,2], 'col2':[3,4]})

In [2]: s2 = df.loc[0]

In [3]: s2.iloc[0] = -99

In [4]: s2
Out[4]: 
col1   -99
col2     3
Name: 0, dtype: int64

In [5]: df
Out[5]: 
   col1  col2
0     1     3
1     2     4

also deprecated is_copy -> _parent

the implementation is this:

every type of operation that returns something which is not a 'view', IOW, not a column or a slice of a column (for a DataFrame). sets _parent to a weakref back to the parent. So in the above case (after [2] but before [3]).

In [8]: s2._parent
Out[8]: <weakref at 0x106820cb0; to 'DataFrame' at 0x1068588d0>

Then [3] will do a check (so all actual set operations, including inplace) will see that we have a ref. So do this funky gc.get_referrers to eliminate cases like df = df[...]. but a case like above still has s2 in the namspace. So then you can determinate if its a chain indexing operation, if so raise SettingWithCopyError, else you copy the data (e.g. copy-on-write), and reset ._parent=None.

Now we don't actually see the SettingWithCopyErrors why, because these are caught by the actual indexing operation (e.g. __setitem__ on a frame), at which point you just df.loc[self.index,key] = value and you are done (similarly with Series).

the hard part was actually putting in place all of the checks on when to record the ._parent, which is already existing.

@shoyer
Copy link
Member

shoyer commented Sep 2, 2015

This solution is pretty dark magic. Even though it is (sort of) technically feasible, I'm not sure it's a good idea.

People certainly do put DataFrames in built-in data structures like lists, so @JanSchulz's example where this breaks will assuredly come up:

l = [df[...]]
l[0][...] = "x"

Maybe another approach would be to try to make views/copies more predictable, by, for example, disabling automatic block consolidation?

Another possibility would be to make creating a view turn both the original object AND the new selection into copy-on-write. This would eliminate the need to look at the garbage collector, though it could make repeated modification/indexing of a dataframe very inefficient, e.g.,

df = pd.DataFrame({'A': np.random.randn(10000)})
for i in df.index:
   # this assignment now does a complete copy every time!
   df.loc[i, 'B'] = df.loc[i:(i + 1), 'A'].sum() > 0

On the other hand, R users seem to deal with copy-on-write like this just fine....

@jreback
Copy link
Contributor Author

jreback commented Sep 2, 2015

@shoyer consolidation are on the agenda, but certainly not in the next week. they are a whole can of worms.

@nickeubank
Copy link
Contributor

@jreback wow, this is amazing. I can't believe you implemented this in a day! I'll leave the issues of internal implementation to you more experienced developers, but will be sure to help with the doc updates when we settle on final behaviors!

def _get_is_copy(self):
warnings.warn("is_copy is deprecated will be removed in a future release",
FutureWarning)
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, but if this really internal property is basically now meaningless, then IMO it would be better to cleanly break code which depends on this (=remove the property without warning) than just change the API contract of this property...

@jankatins
Copy link
Contributor

Here is a bit more magic (as in "I don't really understand all what's happening here, but it seems to work...") which gets even references to df slices in lists/dicts or attributes in objects:

def get_all_names_for_obj(__really_unused_name__342424__, exclude=None, __cache_123__=None, depth=2):
    """Returns True if the argument is referenced from a container"""
    removals = ["__really_unused_name__342424__", "__really_unused_name__xxxxx__", "__v__", "self", 
                "__globals__", "func_globals", "__self__", "im_self", "cell_contents", '__dict__'
                "user_ns", 'namespace', 'user_global']
    if exclude:
        removals.extend(exclude)
    if __cache_123__ is None:
        __cache_123__ = []
        __cache_123__.append(__cache_123__)

    gc.collect(2)
    refs = gc.get_referrers(__really_unused_name__342424__)
    __cache_123__.append(refs)

    names = []
    container_names = []
    for ref in refs:
        if ref in __cache_123__:
            continue
        __cache_123__.append(ref)
        if id(ref) == globals().get("__ID_D__", None):
            print(inspect.isframe(ref), isinstance(ref, collections.Mapping), 
                  isinstance(ref, collections.Sequence), isinstance(ref, object),
                  ref)
            pass
        try:
            if inspect.isframe(ref):
                for name, __really_unused_name__xxxxx__ in ref.f_locals.iteritems():
                    if __really_unused_name__xxxxx__ is __really_unused_name__342424__:
                        names.append(name)
            elif depth < 1:
                # don't try to get too deep into container ref cycles...
                continue
            elif isinstance(ref, collections.Mapping):
                _names = list(get_all_names_for_obj(ref, exclude={"ref", "refs"}, __cache_123__=__cache_123__, depth=depth-1))
                if len(_names) == 0:
                    continue
                for _name in _names:
                    name_in_map = "<unknown>"
                    for k, __v__ in ref.items():
                        if __v__ is __really_unused_name__342424__:
                            # we don't want to catch whatever['namespace'] and similar things
                            if name_in_map not in removals:
                                name_in_map = k
                                # only catch the first instance
                                break
                    if name_in_map != "<unknown>":
                        # check that we don't have class backing dict (xxx.__dict__)  
                        # -> this should end up as xxx.<name_in_map>
                        if _name[-9:] == ".__dict__":
                            container_names.append("%s.%s" % (_name[:-9], name_in_map))
                        else:
                            container_names.append("%s['%s']" % (_name, name_in_map))
            elif isinstance(ref, collections.Sequence):
                _names = list(get_all_names_for_obj(ref, exclude={"ref", "refs"} ,__cache_123__=__cache_123__, depth=depth-1))
                if len(_names) == 0:
                    continue
                for _name in _names:
                    item_pos = ref.index(__really_unused_name__342424__)
                    container_names.append("%s[%s]" % (_name, item_pos))
            elif isinstance(ref, object) or isinstance(old(), types.InstanceType):
                _names = list(get_all_names_for_obj(ref, exclude={"ref", "refs"}, __cache_123__=__cache_123__, depth=depth-1))
                if len(_names) == 0:
                    continue
                for _name in _names:
                    for n in dir(ref):
                        if getattr(ref, n, None) is __really_unused_name__342424__:
                            if n in removals:
                                continue
                            container_names.append("%s.%s" % (_name, n))
            else:
                _names = list(get_all_names_for_obj(ref, exclude={"ref"}, __cache_123__=__cache_123__, depth=depth-1))
                container_names.append("<included in item of type %s with name(s) %s>" % (type(ref), _names))
        finally:
            __cache_123__.remove(ref)
            del ref
    __cache_123__.remove(refs)
    for name, __really_unused_name__xxxxx__ in globals().iteritems():
        if __really_unused_name__xxxxx__ is __really_unused_name__342424__:
            names.append(name)

    return list((set(names) | set(container_names)) - set(removals))

Examples:

class whatever(object):
    def is_item_in_refs(self, item):
        import gc
        refs = gc.get_referrers(self)
        for ref in refs:
            if item is ref:
                print(ref)

    def print_names(self):
        print(",".join(get_all_names_for_obj(self)))

    def __getitem__(self, val):
        import gc
        ref = gc.get_referrers(self)[0]
        double_slice = getattr(ref,'im_func',None) is not None
        print("__getitem__: %s, %s" % (double_slice, get_all_names_for_obj(self)))
        ww = whatever()
        #return w2
        return ww

    def __setitem__(self, key, val):
        import gc
        ref = gc.get_referrers(self)[0]
        double_slice = getattr(ref,'im_func',None) is not None
        print("__setitem__: %s, %s" % (double_slice, get_all_names_for_obj(self)))

w = whatever()
l=[w]
d={"w":w}
class c(object):
    pass
o = c()
setattr(o, "w", w)
o.w
w.print_names()

wxw3x = whatever()
wxw3x[1] = 3
wxw3x[1][1] = 3
w2 = wxw3x[1]
w3 = w2
w2[1] = 3

d = {"w" : whatever()}
d["w"][0] = 1
d["w"][0][0] = 1
d["x"] = d["w"][0]
l = [d["x"]]
d["x"][1] = 1

The performance penalty of get_all_names_for_obj(self) is probably very measureable (at least it's not instant in the notebook, haven't use timeit yet)...

The three additional lines in each method in whatever is what @jreback started with (but I couldn't really get to work here). But if it gets really well tested it might be worth to get back to it as it is much simpler and if the tests scream when python changes the implementation details of gc.get_referrers(self)[0] or changes in pandas code result in "wrong" results, then this is IMO better...

@TomAugspurger
Copy link
Contributor

@jreback is this planned for 0.17? My first-take reaction to skimming the implementation is worry that it's fragile and will be hard to maintain (not that I'll really be the on maintaining it 😄).

@jreback
Copy link
Contributor Author

jreback commented Sep 3, 2015

@TomAugspurger I didn't tag it. Still needs some work. Its not much more fragile that the current SettingWithCopyWarning, so not sure that's really all that big of a deal.

Let me see how much can fix for next week.

@jorisvandenbossche
Copy link
Member

I would think it's not yet for 0.17? Certainly if we want to get a release candidate out soon to have a release the coming month. For such a change, I think we should take some more time.

@jreback
Copy link
Contributor Author

jreback commented Sep 3, 2015

yes 0.17 is already quite full and has lots of depreciations and such

not sure of philosophy for changes in releases
eg is it better to put lots of pain in 1 or just spread it out (pain as in adjusting to some API changes )

@shoyer
Copy link
Member

shoyer commented Sep 4, 2015

As I noted before, as much as I want this to work, I have serious reservations about the approach here. As frustrating as the status quo is, if we can't do copy-on-write right, maybe we shouldn't do it at all...

@shoyer
Copy link
Member

shoyer commented Sep 10, 2015

Explicitly calling gc.collect(2) can have serious (and surprising) performance implications: #11045

It seems like a bad idea to bake this into every __setitem__ call.

@jreback
Copy link
Contributor Author

jreback commented Sep 10, 2015

thrn kindly shoe another way to do this

correctness is the most important thing here
this is a known effect - this is user error

@nickeubank
Copy link
Contributor

Would it be possible to simplify the implementation of we didn't try to
also get single-line chain indexing to work?
On Thu, Sep 10, 2015 at 2:41 AM Jeff Reback notifications@github.com
wrote:

thrn kindly shoe another way to do this

correctness is the most important thing here
this is a known effect - this is user error


Reply to this email directly or view it on GitHub
#10973 (comment).

@jreback
Copy link
Contributor Author

jreback commented Sep 10, 2015

maybe, put this aside for bit. as going to do the release candidate for 0.17.0 shortly.

@nickeubank fee free to give a go!

@nickeubank
Copy link
Contributor

I think that makes sense -- not something to rush!

@jreback will definitely take a look! This will definitely be the farthest I've gone into the pandas internals, but worth a shot. :)

@nickeubank
Copy link
Contributor

@shoyer If we dropped the gc.collect() call, how would you feel about this? If there are still fragility issues that worry you, would you mind detailing them a little?

Re: gc.collect: unless I'm wrong, we're only using that to detect super-egregious chained indexing. If that's the constraint, we could either (a) not worry about that warning, or (b) pull back on all chained-indexing issues, and leave df.loc[1:1,]['col1'] = -99 to always fail. That would give us a good, consistent behavior without resort to gc.collect().

Also relatedly: weren't we using gc.collect on every setitem on a copy previously to run the SettingOnACopy error? (line 1276 of core/generic.py)?

@shoyer
Copy link
Member

shoyer commented Sep 17, 2015

gc.collect can have unfortunate performance implications, but gc.get_referrers is even worse -- it doesn't even work reliably! See @JanSchulz's notes about objects in lists or dicts. I was initially enthusiastic about copy-on-write, but it's a very tough fit for Python.

My inclination is that a better approach would be to come up with a simpler, more consistent set of rules that determine whether we make copies or views. This is closer to the approach of NumPy and xray. We'll hurt performance when we turn some cases that currently use views into copies, but it will be a net win for predictability.

@jankatins
Copy link
Contributor

How about adding two (additional) entries: one for setting, one for getting? Setting fails if a copy must be made, getting always returns a copy? This is kind of like the .ix vs .loc/.iloc thing...

@nickeubank
Copy link
Contributor

@shoyer Ah, I see -- thank you for clarifying! Helpful to know your specific concerns.

I agree we don't want to do anything that will introduce new fragilities, but I'm not quite ready to abandon copy-on-write just because our first attempt at implementation failed. As @jreback, there are some pretty huge performance gains from views, and if we can find a way to keep them when possible it'd be nice. Now that I understand the source of the concern, let me give a little more thought to alternatives -- if our only major concern is the reliance on gc.get_referrers(), maybe we can find a work-around.

Out of curiosity, as we're getting pretty deep into python internals, is there a forum where we might be able to seek guidance from people who do work on the language itself, rather than higher-level tools like pandas? We only have 4 people on this thread -- maybe if we widened the pool someone might have a suggestion of a more robust tool.

Similarly, any idea if this gets better in Python 3? I recognize Python 2 is likely to hang around for a while, so even if the gc analogue in Python 3 were better we would still have issues to address, just wondering if we need different solutions for the two environments.

@JanSchulz I think different indexers is a reasonable solution down the road, btw! Not quite ready to go there yet, but a good thought!

@nickeubank
Copy link
Contributor

@JanSchulz @shoyer @jreback

Brainstorming a little: the primary role of gc.get_referrers() seems to be to find all the variables that point to a view so they can be redirected when the view is converted to a copy.

Rather than trying to accomplish the redirect from view to copy by changing variable pointers, might it be possible to make the redirect happen at the other end, either by putting the new copy at the same memory address as the view, or more realistically (since I assume you can't guarantee space at that address) by leaving a redirect at the memory address of the view? In other words, do the redirect in the heap rather than the stack?

Relatedly: this seems perfectly analogous to the behavior provided by inplace. Could we leverage that machinery?

@kawochen
Copy link
Contributor

CMIIW, gc.get_referrers is used here to differentiate between chained indexing and chained indexing in two steps. This is very difficult in Python because assignment is just name binding. To solve this you'd like some global analogue of class attributes without any naming restriction, so that some action is triggered during assignment. The solution in this PR requires inspecting the surrounding scopes and frames is really going out of our way to see how the code is actually written. Even without the garbage collection calls I think calling inspect is probably not too performant.

I don't have a better solution to make all that work, and there probably isn't one. @nickeubank Assignment in Python has no modification hooks, so I think even if the C API was exposing more functionalities, this would still not be easy to pull off (assignment still happens in Python). AFAIK P3 is no different, and we shouldn't be relying on language internals anyways.

If we want df.loc[1:1,]['col1'] = -99 and a=df.loc[1:1,]; a['col1]=-99 to be different (without introspection), then a needs to be something like a class attribute, not a free name. pandas could provide that container, but probably very few would use that (instead of a=df.loc[1:1,], we'd have a.df = df.loc[1:1,]).

@nickeubank
Copy link
Contributor

@kawochen Thanks for the followup!

So would we be ok (i.e. not need gc) if we DIDN'T care about making df.loc[1:1,]['col1'] = -99 and a=df.loc[1:1,]; a['col1]=-99 different?

This has come up before, and I think most people are ok with df.loc[1:1,]['col1'] = -99 failing, since it would be consistent with our general memory model (in the same way that df.column.replace() would fail it it didn't come with any assignment).

EDIT for clarity: By "ok", I mean can we do copy-on-write safely without replying on the gc library if we're ok with df.loc[1:1,]['col1'] = -99 "failing" (i.e. not leading to any persistent changes to df)? (In principle at least -- obviously still need to implement. :)).

@kawochen
Copy link
Contributor

OK lemme read all the discussion in the original issue before polluting this PR with my nonsense--I don't seem to fully understand what the end goal is.

@nickeubank
Copy link
Contributor

:) @kawochen haha - ok! The issue that gave rise to this (#10954) may be more helpful than this PR. This PR is amazing, but also did more than I think was required in that it would support chain indexing, but if we can support copy-on-write if we drop that, that'd be great!

(Do not take confusion on my part as a sign you don't know what's going on -- I'm very interested in this from a design perspective, but am a little out of my depth with the machinery)

@nickeubank
Copy link
Contributor

@kawochen did you ever have a chance to look this over and figure out if you'd found a solution?

@nickeubank
Copy link
Contributor

@shoyer @jreback Reading this over more carefully, I'm pretty sure that we only use gc to make it possible to do chained indexing on a single line (i.e. df.loc[1:1,]['col1'] = -99). @shoyer, if that's your main objection, then I think we can still figure this out if we just let chained-indexing fail. As we've said before, failure is consistent with other pandas behaviors so doesn't necessarily need a warning as long as the behavior is consistent.

In light of that, might I suggest a way forward?

  • we stop supporting single-lined chained indexing so we don't need to rely on gc to test for the weird chained-indexing cases. (If we don't worry about that, perhaps we can do something as simple as add a if self._is_view: self.copy() in the setter?)
  • we think about a way to guard against forward propagation of change through views, which this implementation doesn't yet cover.

By the latter, I mean preventing the following:

df = pd.DataFrame({'col1':[1,2], 'col2':[3,4]})
intermediate = df.loc[1:1,]

df['col1'] = -99
intermediate

Out[2]: 
   col1  col2
1   -99       4

I've been trying to do a little implementation myself, but I'm not sure I'm quite capable yet. Nevertheless, here are some thoughts I've been wrestling with:

  • It seems like the key to avoiding forward propagation is to create a property that's analogous to @jreback 's _parent property for the opposite direction (might I suggest we call it _children?). Essentially, it would be a list, associated with each Frame, of references to any views that refer back to the Frame. Doing this is one of the places I've really struggled with, but maybe @JanSchulz has a hack for this since he seems to be the reference wizard?
  • Then, when setting values, the first thing to do is check if a frame has any _children objects. If so, before setting values, one converts the _children views into copies, and resets _children to an empty list.
  • One option is to have views remove themselves from the _children list managed by their _parent whenever they stop being views to keep that list clean, but there's little harm in not doing that -- provided that when one sets values, one checks whether children are views before copying, there wouldn't be much of a performance hit to the _children list being populated with references to items that used to be views but are now copies.

@kawochen
Copy link
Contributor

@nickeubank I'm still not sure how to reconcile the various 'desired' features with how Python works. Forbidding chained indexing sounds like a huge change.

@nickeubank
Copy link
Contributor

@kawochen I'm not sure I'd characterize it as a huge change since we don't really support it now -- it sometimes works, but not consistently. In fact, that's part of what started this discussion (see #10954 )

I think the consensus from the prior discussion (@jreback @shoyer can weigh in if they want) is:

  • The current unpredictability of behavior is a problem.
  • Copy-on-write seems like a good solution, if feasible. It would make behavior predictable, but still allow pandas to use views to maximize performance when views and copies behave in the same manner (like during read-only operations)
  • Fully supporting single-lined chained-indexing would be great if possible, but if we are unable to do so, that is an acceptable change. The reason is that df[1:2,]['col1']=-99 not making any real changes is entirely consist with other behaviors, like the fact that df.replace('z','a') (without reassignment) also fails.

@jreback made this implementation that preserved chained-indexing, but it seems to rely on gc.get_referrers(), which as the official docs point out should probably not be used except for debugging, and makes @shoyer uncomfortable, so in light of the consensus something needs to change, and copy-on-write seems desirable, I'm suggesting letting go of chained-indexing might be a good solution.

@shoyer
Copy link
Member

shoyer commented Sep 29, 2015

Fully supporting single-lined chained-indexing would be great if possible, but if we are unable to do so, that is an acceptable change. The reason is that df[1:2,]['col1']=-99 not making any real changes is entirely consist with other behaviors, like the fact that df.replace('z','a') (without reassignment) also fails.

It's simply not possible to make chained indexing always work if indexing can ever return a copy instead of a view. This is the limitation of how Python works. Nor is it really possible or desirable to make chained indexing always fail, unless we remove all use of views (which is a non-starter). Copy on write isn't really feasible with mainstream methods in Python.

The most important thing is that chained indexing should fail or succeed predictably. It doesn't need to always fail, but the few cases where it will succeed should be obviously identifiable.

@CarstVaartjes
Copy link

Hi! I understand this is a very fundamental discussion, but even semi-chaining triggers the warning at the moment:

df['a'] = df['a'].str.lstrip('0')

The results are still good though (it did save the result). Maybe there can be some way where a user can enforce a copy (as at least a temporary solution?)

BR Carst

@shoyer
Copy link
Member

shoyer commented Sep 30, 2015

@CarstVaartjes Yes, that is exactly the sort of useless warning we would like to remove. If you type df = df.copy() above that line, it should go away.

pass
p = self._parent[0]()
if p is not None:
names = get_names_for_obj(self)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback If the fragile get_names_for_obj is merely for prohibiting chained indexing, then perhaps we can get rid of it altogether? With this change, chained indexing will always fail (due to copy-on-write), so I don't think we actually need this anymore.

@BrenBarn
Copy link

Is there a summary somewhere of exactly what the semantics of the proposed new behavior are?

Currently I think there are some situations where you get SettingWithCopy warning, but the value is in fact set on the "original" object. What worries me is the possibility that with this change, those will now work by copying the original data. Is the idea that this fix will get smarter and not copy anything in those false-alarm cases, and only copy stuff when it really needs to?

Also, when working with large DataFrames in memory, a silent copy-on-write could bring things to a screeching halt. There should be some way for people to say "don't copy anything unless I tell you to".

@nickeubank
Copy link
Contributor

Hi @BrenBarn

The discussions are currently spread across one issue #10954 and two PRs (#10973 and another implementation underway #11207).

The basic consensus is that we'd love to preserve views wherever possible, but only in situation where we can consistently offer access to views. The problem at the moment is that pandas is sometimes offering views and sometimes not (as you say, there are some false alarms -- full discussion at top of #10954 ), but it's effectively unpredictable when those occur. Preserving full column slices as views, for example, we can always do, and we're working to preserve.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Compat pandas objects compatability with Numpy or Python functions Copy / view semantics Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Views, Copies, and the SettingWithCopyWarning Issue
9 participants