Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPRangeQuery to search for IPs which are members of a subnet. #1546

Merged
merged 8 commits into from
Nov 3, 2021

Conversation

amnonbc
Copy link
Contributor

@amnonbc amnonbc commented Jan 17, 2021

I have closed #1536
and squashed the history, to make it easier to review.

The review is under 600 lines of code, about half of which are tests.

This PR adds an IP field, and an IPRangeQuery which can be used to match all ip's in a network.

bleve.NewIPRangeQuery("1.2.3.0/24")

will return all documents whose IP fields contain IPs within the 1.2.3.X network.

The IP data is normalised as a big endian ipv6 format, which is a slice of 16 bytes,
and this is what is stored in the indexes.

We support searching for a specific IP, and searching for membership of a network.

The search expressions and fields are always converted to ipv6.

If we are searching for a specific IP, we just hand the work over to a TermSearcher.

For network membership, we calculate the minimum and maximum IP addresses in
the network and using indexReader.FieldDictRange() and return the indexes in this range.

The tests in https://github.com/blevesearch/bleve/blob/8bcceaf6405e9fbb1a5e08ca94db4ae2eef858da/test/ip_field_test.go
illustrate how the IP Range query can be used.

@mschoch
Copy link
Contributor

mschoch commented Jan 19, 2021

OK, so the first thing that I would ask is that you not take any immediate action on my comments. Often times it seems people assume if they respond to my feedback immediately we'll merge it and be done. In practice, I will also be asking others to review this as well, and sometimes their feedback alters my opinion. So making changes at this stage is premature.

@mschoch
Copy link
Contributor

mschoch commented Jan 19, 2021

First the good news, I think this PR has basically the right form/shape to make its way into the code base. However, there is one possibly big issue for us to think about. As this is currently implemented, there is basically no real new capability for end users. It is in essence a convenience wrapper around indexing terms in a particular format, and using term and term range queries to match them. To be clear, that is not a bad thing necessarily. The boolean field type does the exact same thing (indexes 'T' or 'F' and has a searcher that knows about this encoding). And the date/geo types are really just wrappers around the technique used by the numeric field type.

So, then what is the problem with this. Well, the implementation uses term range queries to implement CIDR netmask queries, but this is not very efficient. If the range expands to a large number of addresses, it just doesn't work. I don't know exactly how this will be used in practice, but since ipv6 addresses are 128bit, net masks can quite easily expand to very large numbers of addresses. So, while there is nothing wrong with this as an initial implementation, I'm not sure it will be useful in practice.

What other alternatives are there? Well, again the good news is that we already have one option. The numeric indexing format we have today addresses this exact problem. When you do a numeric range search from -∞ to ∞, we do not perform a disjunction query over 2^64 terms. Instead we use a technique to index multiple terms for each number, not just a single term. These terms have bits shifted off, so that multiple numeric values map to the same term. The trade-off here is additional space in the index allows for more efficient querying. I'm pretty sure a similar technique could be used here. The actual implementation we use for numeric fields has some hard-coded limit at 64-bits.

The problem is a bit deeper here because the current technique trades space for query speed, but it can really bloat indexes. One of the major items we expect to focus on this year is improving the numeric index (most likely using approach similar to lucene's bkd-tree). Obviously supporting > 64bits is one of our requirements already, so hopefully this field type will eventually map to that implementation.

That leaves us with 3 choices that I can see:

  1. Leave this PR as is, and simply document the limitations of the current implementation.
  2. Adapt it to use a technique like our current numeric fields.
  3. Wait for the newer/better numeric field type that is planned.

In a perfect world I would wait for 3, because if we do anything else, we have to break compatibility for some users or have some plan to migrate things in the future. But, arguably we've been waiting for 3 for too long now, and it's time to do something. We have to have some compatibility story for numeric/date/geo anyway, so this isn't really contributing to making that worse.

That leaves option 1 or 2. Again, I would prefer 2, but I also get that it is more work that seems like it would be thrown away in the near future. So I could be persuaded to go ahead with option 1, but I would like to hear more about your use case, and how well this will work for it.

I have some minor nits with naming and code formatting in some places, but wanted to start with this big top-level item first.

I'm adding @abhinavdangeti and @sreekanth-cb to review as well.

@amnonbc
Copy link
Contributor Author

amnonbc commented Jan 19, 2021

So, then what is the problem with this. Well, the implementation uses term range queries to implement CIDR netmask queries, but this is not very efficient. If the range expands to a large number of addresses, it just doesn't work. I don't know exactly how this will be used in practice, but since ipv6 addresses are 128bit, net masks can quite easily expand to very large numbers of addresses. So, while there is nothing wrong with this as an initial implementation, I'm not sure it will be useful in practice.

Hi Marty,
Thanks for looking at the PR and the feedback.
In our testing we have been pleased with the speed of these searches. But I'll add some benchmarks for the worst case queries to the PR tomorrow. This should help me understand the nature of the problem.

A previous version of this PR used a FieldDictPrefix query to narrow down the search. This required some post-filtering if the mask length was not an exact number of bytes. But would this be a better approach?

@mschoch
Copy link
Contributor

mschoch commented Jan 20, 2021

So, I don't think benchmarks will necessarily show the problem, as it depends on data loaded into the index, which for tests/benchmarks is usually small. Consider this example:

User does an IPRangeQuery for 96.255.6.131/8, which should match addresses from 96.0.0.0 to 96.255.255.255. The term dictionary can efficiently find all the terms (encoded IP addrs) which are in this range AND used by a document in the index. But, the next phase is we actually have to find all those documents, and doing this creates a disjunction query over all the terms. So, if you're dataset is small, and only 2 addresses in that range exist in your index, you're still fine, we perform a disjunction of 2 terms. But, if you're the NSA, maybe your index has information about all 16,777,216 possible hosts in that range. In that case, you will get the following error:

return fmt.Errorf("TooManyClauses over field: `%s` [%d > maxClauseCount,"+

You can increase the limit, but at some point you will either run out of memory or find the query takes too long to execute.

@amnonbc
Copy link
Contributor Author

amnonbc commented Jan 20, 2021

User does an IPRangeQuery for 96.255.6.131/8, which should match addresses from 96.0.0.0 to 96.255.255.255. The term dictionary can efficiently find all the terms (encoded IP addrs) which are in this range AND used by a document in the index. But, the next phase is we actually have to find all those documents, and doing this creates a disjunction query over all the terms. So, if you're dataset is small, and only 2 addresses in that range exist in your index, you're still fine, we perform a disjunction of 2 terms. But, if you're the NSA, maybe your index has information about all 16,777,216 possible hosts in that range. In that case, you will get the following error:

return fmt.Errorf("TooManyClauses over field: `%s` [%d > maxClauseCount,"+

You can increase the limit, but at some point you will either run out of memory or find the query takes too long to execute.

I would argue that returning an TooManyClauses error is the most reasonable behaviour when faced with a pathological search which matches an astronomical number of IPs. We would expect to see similar behaviour when searching for a regexp which matches a vast number of distinct items. If we searched for the regular expression .* in a hundred character text field, then this could match 26^100 distinct values (even if we limited the character set to lower case english letters). If our index actually contains all of these items, then we might expect bleve to struggle with the search. Of course if we tried
to construct such an index, we would run out of disk space, and atoms in the universe, well before we got round to running any searches.

In our specific use case, our IP addresses correspond to actual physical computers, and the IPRangeQuery would usually be a CIDR corresponding to a specific network. We would not expect these searches to match a particularly large number of IPs and setting DisjunctionMaxClauseCount to a sensible limit will protect our system from users entering pathological searches.

Our very old implementation represented addresses as numeric fields and this did bloat our indexes to
a prohibitive level. So we don't want to go back down that route.

So for these reasons, option (1) would be the best solution for our use case.

@amnonbc amnonbc mentioned this pull request Jan 21, 2021
@amnonbc
Copy link
Contributor Author

amnonbc commented Jan 26, 2021

Hello again @mschoch + team...
Any more thoughts on this PR?
Is the current approach something you would want to merge in Bleve?
If not, I can switch move IP-Range search outside of Bleve using Bleve's term and range queries.
But either way, it would be useful to know which way you are moving on this.

@mschoch
Copy link
Contributor

mschoch commented Jan 27, 2021

@amnonbc I have provided my review and asked for @abhinavdangeti and @sreekanth-cb to review this, and so far they have not. They do not work for me, I cannot make them do it. As this PR changes the project's API (in an additive way), it must meet the highest standard of review, as once we add something, the expectation is that we'll support it going forward. So, our default posture for changes like this is to not accept them without a consensus to include them.

In this case, as we've already discussed, neither the index nor the searching requires any new capability here. If you are in a hurry to use this approach, you can do it already using existing text fields and term range searches.

I expect that when they have time, they will provide their review, and we'll decide how to move the project forward with this at that time.

@amnonbc
Copy link
Contributor Author

amnonbc commented Jan 27, 2021

@amnonbc I have provided my review and asked for @abhinavdangeti and @sreekanth-cb to review this, and so far they have not. They do not work for me, I cannot make them do it. As this PR changes the project's API (in an additive way), it must meet the highest standard of review, as once we add something, the expectation is that we'll support it going forward. So, our default posture for changes like this is to not accept them without a consensus to include them.

In this case, as we've already discussed, neither the index nor the searching requires any new capability here. If you are in a hurry to use this approach, you can do it already using existing text fields and term range searches.

I expect that when they have time, they will provide their review, and we'll decide how to move the project forward with this at that time.

No rush.

We are happy there is a chance it might be merged and more than happy to support that if it happens

@abhinavdangeti
Copy link
Member

Hello @mschoch, @amnonbc. Apologies on the delay in getting to this.

I believe this PR does add value (even if it's just more of a wrapper, we do this already with some others).
Also, I do agree with Marty that waiting for the newer/better numeric indexing may not be the best idea for we’re not certain on the timeline for it.

So, for the moment - I don't object the current approach.

Copy link
Member

@abhinavdangeti abhinavdangeti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new files added here need the copyright stub.
Otherwise, I'm ok with getting this change in right now.

@@ -0,0 +1,67 @@
// Copyright (c) 2014 Couchbase, Inc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2021*

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool...
@abhinavdangeti
Which copyright should I copy>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amnonbc This is the bit (from search/searcher/search_ip_range.go), that's missing in some files ..

//  Copyright (c) 2021 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// 		http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
  • document/field_ip.go
  • document/field_ip_test.go
  • search/query/ip_range.go
  • search/searcher/search_ip_range_test.go
  • test/ip_field_test.go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, added.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get a unit test failure when running against 1.17.
This will need a bit of work and some coffee to figure out.
I'll do it tomorrow. (It is nearly 11pm in London now).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I had some coffee and fixed the breaking test.
The standard library in Go 1.17 has become stricter on parsing IP addresses, and was rejected IP input which
contained gratuitions leading zeros.

@iredmail
Copy link
Contributor

iredmail commented Nov 3, 2021

Is it better use “IP” (all upper cases) instead of “Ip”?

@amnonbc
Copy link
Contributor Author

amnonbc commented Nov 3, 2021

Is it better use “IP” (all upper cases) instead of “Ip”?
Good point.
IP is exported, whereas ip isn't.
But I should perhaps change the occurrences of Ip to IP to match the standard Library?

@abhinavdangeti abhinavdangeti merged commit 6ab7a11 into blevesearch:master Nov 3, 2021
@abhinavdangeti abhinavdangeti added this to the v2.3.0 milestone Nov 22, 2021
abhinavdangeti added a commit that referenced this pull request Oct 5, 2022
abhinavdangeti added a commit that referenced this pull request Oct 10, 2022
* MB-40604: Support IPRange query

Related: #1546

* Update IPs used in TestIPRangeQuery
ns-codereview pushed a commit to couchbase/cbft that referenced this pull request Jan 18, 2023
+ IP field support introduced with:
    blevesearch/bleve#1546

+ Also, requires:
    blevesearch/bleve#1735

Change-Id: I76cc275ebc7817ecec0521789587c1b68eeafdd9
Reviewed-on: https://review.couchbase.org/c/cbft/+/179981
Well-Formed: Build Bot <build@couchbase.com>
Tested-by: Abhi Dangeti <abhinav@couchbase.com>
Reviewed-by: <thejas.orkombu@couchbase.com>
abhinavdangeti added a commit that referenced this pull request Jan 19, 2023
* MB-40604: Support IPRange query

Related: #1546

* Update IPs used in TestIPRangeQuery
ns-codereview pushed a commit to couchbase/cbft that referenced this pull request Jan 20, 2023
+ Uses `7.2-couchbase` branch of blevesearch/bleve
  which includes the following commit over v2.3.3:
    - blevesearch/bleve#1735

+ IP field and RangeQuery support added with:
    - blevesearch/bleve#1546
    - blevesearch/bleve#1735

Change-Id: I76cc275ebc7817ecec0521789587c1b68eeafdd9
Reviewed-on: https://review.couchbase.org/c/cbft/+/185273
Well-Formed: Restriction Checker
Tested-by: Abhi Dangeti <abhinav@couchbase.com>
Reviewed-by: Abhi Dangeti <abhinav@couchbase.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants