Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Facet value of boolean mapped fields should be "true" or "false" #2462

Closed
kroepke opened this issue Dec 4, 2012 · 25 comments
Closed

Facet value of boolean mapped fields should be "true" or "false" #2462

kroepke opened this issue Dec 4, 2012 · 25 comments
Assignees

Comments

@kroepke
Copy link

kroepke commented Dec 4, 2012

When performing the following index and search operations I would expect the facet values of boolean fields to be "true" or "false" and not "T" and "F", since "T" and "F" are implementation detail of the BooleanFieldMapper.

To reproduce (using any version I tested, including current master):

curl -XPUT http://localhost:9200/boolfacet
curl -XPUT http://localhost:9200/boolfacet/auto/1 -d '{"foo":true}'
curl -XPOST http://localhost:9200/boolfacet/auto/_search?pretty=true -d '{"query":{"match_all":{}}, "facets":{"bar":{"terms":{"field":"foo"}}}}' 
{
   "took" : 1,
   "timed_out" : false,
   "_shards" : {
     "total" : 5,
    "successful" : 5,
     "failed" : 0
   },
   "hits" : {
     "total" : 1,
     "max_score" : 1.0,
     "hits" : [ {
       "_index" : "boolfacet",
       "_type" : "auto",
       "_id" : "1",
       "_score" : 1.0, "_source" : {"foo":true}
     } ]
   },
   "facets" : {
     "bar" : {
       "_type" : "terms",
       "missing" : 0,
       "total" : 1,
       "other" : 0,
       "terms" : [ {
         "term" : "T",
         "count" : 1
       } ]
     }
   }
 }

Note the "term": "T" line in the facets.
A preliminary patch follows shortly.

@pcomte
Copy link

pcomte commented Dec 5, 2012

Hi,

Any idea when this will be included in the master branch?

Thanks!

@martijnvg
Copy link
Member

@kroepke This makes sense. We will get this in after the lucene4 refactorings.

@kroepke
Copy link
Author

kroepke commented Dec 6, 2012

@martijnvg ok, great. Until then we will build our own version with that patch then.
Thx!
Let me know if you want me to clean anything up when this comes up again.

@kroepke
Copy link
Author

kroepke commented Dec 7, 2012

A workaround for this problem is using the following script in the term fact.
Not pretty, but it solves the immediate problem for boolean mappers.

{
  "query": {
    "match_all": {}
  },
  "facets": {
    "bla": {
      "terms": {
        "term": true,
        "size": 10,
        "script": "doc['foo'].value == 'T' ? true : false"
      }
    }
  }
}

Results in:

{
  "facets": {
     "bla": {
        "_type": "terms",
        "missing": 0,
        "total": 1,
        "other": 0,
        "terms": [
             {
                 "term": "true",
                 "count": 1
              }
         ]
    }
}

@karmi
Copy link
Contributor

karmi commented Dec 7, 2012

@kroepke Just interested, how does using script affect performance, do you have any numbers or a guess?

@kroepke
Copy link
Author

kroepke commented Dec 10, 2012

@karmi We haven't done any benchmarks yet, we hope to get this deployed the coming week.

@kroepke
Copy link
Author

kroepke commented Dec 13, 2012

@karmi I've just done a very unscientific test, using the queries above.
The index has 213198 documents and is 1.9gb in size.
The value distribution is extremely uneven towards "false" (upwards of 99% false), but that shouldn't actually matter too much.
It's running on a single server, all data fits into memory, nothing else is using that machine. Unfortunately I forgot to change the shard count, which is the default 5, so this is 5 shards on one box.
Without the script response times around between 8 and 9 ms, with the script field it's consistently between 48 and 50 ms, so the overhead is considerable.
It may or may not be feasible for people to use the workaround, but we will have to go with a patched version until this gets fixed upstream.

@karmi
Copy link
Contributor

karmi commented Dec 13, 2012

@kroepke Thanks for the numbers! That is still quite a good performance, considering it has to scan all the matching documents and aggregate the counts.

@s1monw
Copy link
Contributor

s1monw commented May 3, 2013

this seems to be open for a long time... I will look into that soonish

@lmenezes
Copy link
Contributor

lmenezes commented Jun 4, 2013

@s1monw
Any updates on that?
I understand that changing this at this point could lead to problems to people who count on this values being returned the way it is.
Perhaps the mapping could be configured to either map to true/false or not, and defaulting to the current behavior?

@kimchy
Copy link
Member

kimchy commented Jun 4, 2013

our plan is to look into this for the 1.0 release. Your solution is definitely a possible way for us to support it.

@kroepke
Copy link
Author

kroepke commented Jun 5, 2013

The following request illustrates the new behavior:

curl -XPOST http://localhost:9200/boolfacet/auto/_search?pretty=true -d '{"query":{"match_all":{}}, "facets":{"bar":{"terms":{"field""foo", "map_boolean_values":"true"}}}}'
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "boolfacet",
      "_type" : "auto",
      "_id" : "1",
      "_score" : 1.0, "_source" : {"foo":true}
    } ]
  },
  "facets" : {
    "bar" : {
      "_type" : "terms",
      "missing" : 0,
      "total" : 1,
      "other" : 0,
      "terms" : [ {
        "term" : "true",
        "count" : 1
      } ]
    }
  }
}

Without the flag it's back to 'T' and 'F', also of course if it's set to 'false'.
Since you don't merge pull request here, feel free to pick that commit apart. Also I'm not sure what else needs to be done to properly support it in all code paths.
If you tell me what to test with I'd be very happy to implement the missing bits.

@s1monw
Copy link
Contributor

s1monw commented Jun 6, 2013

hey there, I just briefly looked at the pull request and I wonder if we can make this a little less intrusive. At the end of the day what counts here is what kind of representation we have to "True" and "False" so we could just load this through FieldData adding a feature next to the FieldDataFilter. I'd call it a FieldDataTransformer that you can use to transform `"T" to "True"' when field data is loaded then your facets would just naturally come back as a full boolean string?

@kroepke
Copy link
Author

kroepke commented Jun 6, 2013

I'm not sure I can follow you there :)

From what I can see you still need to have the field mapper to do that for you, since that's what it does, no?
Or do you suggest moving the responsibility for resolving the conversion somewhere else?

I'm asking because this problem affects every field that stores values in a different representation, e.g. IPs. For those you get back the int based representation, which I actually find worse than the boolean case...

@clintongormley
Copy link

Hiya @kroepke

Been talking to @s1monw about this and two features seem to make sense:

FieldData Transformer:

You would be able to specify a script on a field which allows you to transform each value as it is loaded into the fielddata cache. This would be useful for, eg:

  1. converting "T"/"F" to "true"/"false"
  2. truncating not_analyzed strings to eg 15 characters for sorting purposes (on the fly, as opposed to having to reindex with the truncate token filter
  3. capitalizing tokens for display
  4. rounding datetimes to (eg) seconds instead of milliseconds

The benefit here is that there would be a one-time cost of conversion (just when loading the fielddata) and the same values would be exposed for facets, scripting and sorting.

Facet Value Transformer:

A script which is applied just before the serialization of the facet responses into JSON which could, eg:

  1. convert "T"/"F" to "true"/"false"
  2. convert longs representing IPv4 addresses into their string representations
  3. capitalize tokens for display
  4. format datetime longs as date strings

These would be more efficient than the current facet scripts because they would only be run on the return values, not on every value. However, you would still have the raw longs loaded into memory which are more useful for calculations (eg easier to calc netmasks on a long rather than on a string IPv4 address)

I believe that the value transformers are planned for the new facets that we are planning for 1.0 (@uri?)

@lmenezes
Copy link
Contributor

lmenezes commented Jun 6, 2013

The idea of having to specify a script for executing these mappings feel awkward, imho.

I think that everybody using elastic search will eventually run into this issue, and the script solution seems more like a work around rather than an "option" of something that should be transparent in the first place.

I could see this feature as a generic transformer(which i like) for the facets, but for elastic search builtin types(boolean, ip..) I think this should be done without the need of defining anything(at most a "flag", saying you want it to be done, and even that is just because of backward compatibility).

If internally, this "automatic" transformation for the builtin types is done using the script implementation you just described, cool. but having everybody defining its own script for mappings that most likely everybody will do, feels wrong.

that's my 2 cents anyway...

@clintongormley maybe the wrong uri :D @uboness

@clintongormley
Copy link

(heh, too many nicks in too many places)

Yes, I understand what you mean about the core types in ES. I think boolean and ip are the only two types that this really applies to. "true" and "false" are less surprising than "T" and "F", and IP addresses expressed as longs are pretty much useless in facets.

Perhaps for these two types, this should be the default return value in the new aggregations rewrite. We don't have to maintain backward compatibility here because this is completely new functionality. The old facets will continue to work as before.

For any other "label" transformations, the only generically useful way of doing them is via a script, as everybody will have their own particular requirements which would result in an explosion of options.

@lmenezes
Copy link
Contributor

lmenezes commented Jun 6, 2013

@clintongormley Agreed. This way makes complete sense and I like the script approach as a generic way of doing this. Looking forward to it.

@kroepke
Copy link
Author

kroepke commented Jun 6, 2013

@clintongormley In fact this applies to dates as well, if you use terms facets (even if you might argue that terms facets on dates could be done with the date histogram facet instead).
While I see the value of what @s1monw is proposing, I have to agree with @lmenezes that it feels wrong to have to fix this with specifying it for every query just to get back what I put in.
The use case of doing netmask calculations is a valid one, but right now I consider terms facets to be broken because they leak internal representation. I doubt that it's useful to keep this for backwards compatibility reasons beyond 1.0, especially since most of the code has been touched in the last 7 months anyway.
(Sorry for sounding a little bitter, but this has been a continuous problem for us.)

@clintongormley
Copy link

@kroepke As I understand it, there are also plans for converting datetimes into strings, by specifying formats (rather than scripts). However, for (eg) charting libraries, it is usually more efficient to hand them millis-since-the-epoch instead of date strings, so the default return value will probably remain as longs.

@gpopovic
Copy link

gpopovic commented Jan 4, 2014

Any news on this?

@uboness
Copy link
Contributor

uboness commented Jan 10, 2014

we'll work on this one post 1.0

@kevinkluge kevinkluge removed the v1.0.1 label Feb 11, 2014
@s1monw s1monw added v1.2.0 and removed v1.1.0 labels Mar 12, 2014
@jpountz jpountz added v1.3.0 and removed v1.2.0 labels May 7, 2014
@s1monw
Copy link
Contributor

s1monw commented Jul 4, 2014

We deprecated facets already in 1.3 we decided to close this. We will fix this in aggregations instead.

@s1monw s1monw closed this as completed Jul 4, 2014
@s1monw s1monw removed v2.0.0 labels Jul 4, 2014
@aaneja
Copy link

aaneja commented Oct 8, 2014

Any updates on which ES release will have this ?

@clintongormley
Copy link

@aaneja will probably be 2.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests