IPRangeQuery to search for IPs which are members of a subnet. #1546

amnonbc · 2021-01-17T10:35:34Z

I have closed #1536
and squashed the history, to make it easier to review.

The review is under 600 lines of code, about half of which are tests.

This PR adds an IP field, and an IPRangeQuery which can be used to match all ip's in a network.

bleve.NewIPRangeQuery("1.2.3.0/24")

will return all documents whose IP fields contain IPs within the 1.2.3.X network.

The IP data is normalised as a big endian ipv6 format, which is a slice of 16 bytes,
and this is what is stored in the indexes.

We support searching for a specific IP, and searching for membership of a network.

The search expressions and fields are always converted to ipv6.

If we are searching for a specific IP, we just hand the work over to a TermSearcher.

For network membership, we calculate the minimum and maximum IP addresses in
the network and using indexReader.FieldDictRange() and return the indexes in this range.

The tests in https://github.com/blevesearch/bleve/blob/8bcceaf6405e9fbb1a5e08ca94db4ae2eef858da/test/ip_field_test.go
illustrate how the IP Range query can be used.

update from upstream v1.0.14

upgrade to blevesearch/bleve v2

upgrade to v2.0.1

…operly formatter CIDR or IP

mschoch · 2021-01-19T21:31:36Z

OK, so the first thing that I would ask is that you not take any immediate action on my comments. Often times it seems people assume if they respond to my feedback immediately we'll merge it and be done. In practice, I will also be asking others to review this as well, and sometimes their feedback alters my opinion. So making changes at this stage is premature.

mschoch · 2021-01-19T22:03:05Z

First the good news, I think this PR has basically the right form/shape to make its way into the code base. However, there is one possibly big issue for us to think about. As this is currently implemented, there is basically no real new capability for end users. It is in essence a convenience wrapper around indexing terms in a particular format, and using term and term range queries to match them. To be clear, that is not a bad thing necessarily. The boolean field type does the exact same thing (indexes 'T' or 'F' and has a searcher that knows about this encoding). And the date/geo types are really just wrappers around the technique used by the numeric field type.

So, then what is the problem with this. Well, the implementation uses term range queries to implement CIDR netmask queries, but this is not very efficient. If the range expands to a large number of addresses, it just doesn't work. I don't know exactly how this will be used in practice, but since ipv6 addresses are 128bit, net masks can quite easily expand to very large numbers of addresses. So, while there is nothing wrong with this as an initial implementation, I'm not sure it will be useful in practice.

What other alternatives are there? Well, again the good news is that we already have one option. The numeric indexing format we have today addresses this exact problem. When you do a numeric range search from -∞ to ∞, we do not perform a disjunction query over 2^64 terms. Instead we use a technique to index multiple terms for each number, not just a single term. These terms have bits shifted off, so that multiple numeric values map to the same term. The trade-off here is additional space in the index allows for more efficient querying. I'm pretty sure a similar technique could be used here. The actual implementation we use for numeric fields has some hard-coded limit at 64-bits.

The problem is a bit deeper here because the current technique trades space for query speed, but it can really bloat indexes. One of the major items we expect to focus on this year is improving the numeric index (most likely using approach similar to lucene's bkd-tree). Obviously supporting > 64bits is one of our requirements already, so hopefully this field type will eventually map to that implementation.

That leaves us with 3 choices that I can see:

Leave this PR as is, and simply document the limitations of the current implementation.
Adapt it to use a technique like our current numeric fields.
Wait for the newer/better numeric field type that is planned.

In a perfect world I would wait for 3, because if we do anything else, we have to break compatibility for some users or have some plan to migrate things in the future. But, arguably we've been waiting for 3 for too long now, and it's time to do something. We have to have some compatibility story for numeric/date/geo anyway, so this isn't really contributing to making that worse.

That leaves option 1 or 2. Again, I would prefer 2, but I also get that it is more work that seems like it would be thrown away in the near future. So I could be persuaded to go ahead with option 1, but I would like to hear more about your use case, and how well this will work for it.

I have some minor nits with naming and code formatting in some places, but wanted to start with this big top-level item first.

I'm adding @abhinavdangeti and @sreekanth-cb to review as well.

amnonbc · 2021-01-19T23:05:53Z

So, then what is the problem with this. Well, the implementation uses term range queries to implement CIDR netmask queries, but this is not very efficient. If the range expands to a large number of addresses, it just doesn't work. I don't know exactly how this will be used in practice, but since ipv6 addresses are 128bit, net masks can quite easily expand to very large numbers of addresses. So, while there is nothing wrong with this as an initial implementation, I'm not sure it will be useful in practice.

Hi Marty,
Thanks for looking at the PR and the feedback.
In our testing we have been pleased with the speed of these searches. But I'll add some benchmarks for the worst case queries to the PR tomorrow. This should help me understand the nature of the problem.

A previous version of this PR used a FieldDictPrefix query to narrow down the search. This required some post-filtering if the mask length was not an exact number of bytes. But would this be a better approach?

mschoch · 2021-01-20T00:26:15Z

So, I don't think benchmarks will necessarily show the problem, as it depends on data loaded into the index, which for tests/benchmarks is usually small. Consider this example:

User does an IPRangeQuery for 96.255.6.131/8, which should match addresses from 96.0.0.0 to 96.255.255.255. The term dictionary can efficiently find all the terms (encoded IP addrs) which are in this range AND used by a document in the index. But, the next phase is we actually have to find all those documents, and doing this creates a disjunction query over all the terms. So, if you're dataset is small, and only 2 addresses in that range exist in your index, you're still fine, we perform a disjunction of 2 terms. But, if you're the NSA, maybe your index has information about all 16,777,216 possible hosts in that range. In that case, you will get the following error:

bleve/search/searcher/search_disjunction.go

Line 111 in 2522dc2

return fmt.Errorf("TooManyClauses over field: `%s` [%d > maxClauseCount,"+

You can increase the limit, but at some point you will either run out of memory or find the query takes too long to execute.

amnonbc · 2021-01-20T09:25:18Z

User does an IPRangeQuery for 96.255.6.131/8, which should match addresses from 96.0.0.0 to 96.255.255.255. The term dictionary can efficiently find all the terms (encoded IP addrs) which are in this range AND used by a document in the index. But, the next phase is we actually have to find all those documents, and doing this creates a disjunction query over all the terms. So, if you're dataset is small, and only 2 addresses in that range exist in your index, you're still fine, we perform a disjunction of 2 terms. But, if you're the NSA, maybe your index has information about all 16,777,216 possible hosts in that range. In that case, you will get the following error:

bleve/search/searcher/search_disjunction.go

Line 111 in 2522dc2

return fmt.Errorf("TooManyClauses over field: `%s` [%d > maxClauseCount,"+

You can increase the limit, but at some point you will either run out of memory or find the query takes too long to execute.

I would argue that returning an TooManyClauses error is the most reasonable behaviour when faced with a pathological search which matches an astronomical number of IPs. We would expect to see similar behaviour when searching for a regexp which matches a vast number of distinct items. If we searched for the regular expression .* in a hundred character text field, then this could match 26^100 distinct values (even if we limited the character set to lower case english letters). If our index actually contains all of these items, then we might expect bleve to struggle with the search. Of course if we tried
to construct such an index, we would run out of disk space, and atoms in the universe, well before we got round to running any searches.

In our specific use case, our IP addresses correspond to actual physical computers, and the IPRangeQuery would usually be a CIDR corresponding to a specific network. We would not expect these searches to match a particularly large number of IPs and setting DisjunctionMaxClauseCount to a sensible limit will protect our system from users entering pathological searches.

Our very old implementation represented addresses as numeric fields and this did bloat our indexes to
a prohibitive level. So we don't want to go back down that route.

So for these reasons, option (1) would be the best solution for our use case.

amnonbc · 2021-01-26T21:59:31Z

Hello again @mschoch + team...
Any more thoughts on this PR?
Is the current approach something you would want to merge in Bleve?
If not, I can switch move IP-Range search outside of Bleve using Bleve's term and range queries.
But either way, it would be useful to know which way you are moving on this.

mschoch · 2021-01-27T15:51:28Z

@amnonbc I have provided my review and asked for @abhinavdangeti and @sreekanth-cb to review this, and so far they have not. They do not work for me, I cannot make them do it. As this PR changes the project's API (in an additive way), it must meet the highest standard of review, as once we add something, the expectation is that we'll support it going forward. So, our default posture for changes like this is to not accept them without a consensus to include them.

In this case, as we've already discussed, neither the index nor the searching requires any new capability here. If you are in a hurry to use this approach, you can do it already using existing text fields and term range searches.

I expect that when they have time, they will provide their review, and we'll decide how to move the project forward with this at that time.

amnonbc · 2021-01-27T17:44:33Z

@amnonbc I have provided my review and asked for @abhinavdangeti and @sreekanth-cb to review this, and so far they have not. They do not work for me, I cannot make them do it. As this PR changes the project's API (in an additive way), it must meet the highest standard of review, as once we add something, the expectation is that we'll support it going forward. So, our default posture for changes like this is to not accept them without a consensus to include them.

In this case, as we've already discussed, neither the index nor the searching requires any new capability here. If you are in a hurry to use this approach, you can do it already using existing text fields and term range searches.

I expect that when they have time, they will provide their review, and we'll decide how to move the project forward with this at that time.

No rush.

We are happy there is a chance it might be merged and more than happy to support that if it happens

abhinavdangeti · 2021-02-02T21:26:26Z

Hello @mschoch, @amnonbc. Apologies on the delay in getting to this.

I believe this PR does add value (even if it's just more of a wrapper, we do this already with some others).
Also, I do agree with Marty that waiting for the newer/better numeric indexing may not be the best idea for we’re not certain on the timeline for it.

So, for the moment - I don't object the current approach.

abhinavdangeti

The new files added here need the copyright stub.
Otherwise, I'm ok with getting this change in right now.

abhinavdangeti · 2021-11-02T19:17:43Z

search/searcher/search_ip_range.go

@@ -0,0 +1,67 @@
+//  Copyright (c) 2014 Couchbase, Inc.


cool...
@abhinavdangeti
Which copyright should I copy>

@amnonbc This is the bit (from search/searcher/search_ip_range.go), that's missing in some files ..

// Copyright (c) 2021 Couchbase, Inc. // // Licensed under the Apache License, Version 2.0 (the "License"); // you may not use this file except in compliance with the License. // You may obtain a copy of the License at // // http://www.apache.org/licenses/LICENSE-2.0 // // Unless required by applicable law or agreed to in writing, software // distributed under the License is distributed on an "AS IS" BASIS, // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. // See the License for the specific language governing permissions and // limitations under the License.

document/field_ip.go

document/field_ip_test.go

search/query/ip_range.go

search/searcher/search_ip_range_test.go

test/ip_field_test.go

I get a unit test failure when running against 1.17.
This will need a bit of work and some coffee to figure out.
I'll do it tomorrow. (It is nearly 11pm in London now).

OK. I had some coffee and fixed the breaking test.
The standard library in Go 1.17 has become stricter on parsing IP addresses, and was rejected IP input which
contained gratuitions leading zeros.

Because of golang/go@d3e3d03 which rejects leading zeros in dot-decimal notation.

iredmail · 2021-11-03T06:39:25Z

Is it better use “IP” (all upper cases) instead of “Ip”?

amnonbc · 2021-11-03T06:50:22Z

Is it better use “IP” (all upper cases) instead of “Ip”?
Good point.
IP is exported, whereas ip isn't.
But I should perhaps change the occurrences of Ip to IP to match the standard Library?

Related: #1546

* MB-40604: Support IPRange query Related: #1546 * Update IPs used in TestIPRangeQuery

+ IP field support introduced with: blevesearch/bleve#1546 + Also, requires: blevesearch/bleve#1735 Change-Id: I76cc275ebc7817ecec0521789587c1b68eeafdd9 Reviewed-on: https://review.couchbase.org/c/cbft/+/179981 Well-Formed: Build Bot <build@couchbase.com> Tested-by: Abhi Dangeti <abhinav@couchbase.com> Reviewed-by: <thejas.orkombu@couchbase.com>

* MB-40604: Support IPRange query Related: #1546 * Update IPs used in TestIPRangeQuery

+ Uses `7.2-couchbase` branch of blevesearch/bleve which includes the following commit over v2.3.3: - blevesearch/bleve#1735 + IP field and RangeQuery support added with: - blevesearch/bleve#1546 - blevesearch/bleve#1735 Change-Id: I76cc275ebc7817ecec0521789587c1b68eeafdd9 Reviewed-on: https://review.couchbase.org/c/cbft/+/185273 Well-Formed: Restriction Checker Tested-by: Abhi Dangeti <abhinav@couchbase.com> Reviewed-by: Abhi Dangeti <abhinav@couchbase.com>

amnonbc and others added 4 commits December 22, 2020 15:17

Merge pull request #2 from blevesearch/master

0f9a2dd

update from upstream v1.0.14

Merge pull request #5 from blevesearch/master

cf77da1

upgrade to blevesearch/bleve v2

Merge pull request #6 from blevesearch/master

34122a1

upgrade to v2.0.1

Support for IPRangeQuery

8bcceaf

amnonbc mentioned this pull request Jan 17, 2021

Native IP Field type and associated searchers #1536

Closed

make sure we return an error if argument to IPRangeSearch is not a pr…

c775b08

…operly formatter CIDR or IP

mschoch requested review from abhinavdangeti and sreekanth-cb January 19, 2021 22:03

amnonbc mentioned this pull request Jan 21, 2021

New ip field redsift/bleve#4

Closed

abhinavdangeti approved these changes Nov 2, 2021

View reviewed changes

amnonbc added 2 commits November 2, 2021 22:36

Add missing copyright stubs to new files

afbd8ea

Fix test which broke in Go 1.17

38247ee

Because of golang/go@d3e3d03 which rejects leading zeros in dot-decimal notation.

rename all instances of Ip to IP to conform with stdlib

82d591a

abhinavdangeti approved these changes Nov 3, 2021

View reviewed changes

abhinavdangeti merged commit 6ab7a11 into blevesearch:master Nov 3, 2021

abhinavdangeti added this to the v2.3.0 milestone Nov 22, 2021

abhinavdangeti added a commit that referenced this pull request Oct 5, 2022

MB-40604: Support IPRange query

6eae057

Related: #1546

abhinavdangeti mentioned this pull request Oct 5, 2022

MB-40604: Support IPRange query #1735

Merged

abhinavdangeti added a commit that referenced this pull request Oct 10, 2022

MB-40604: Support IPRange query (#1735)

1a8fefb

* MB-40604: Support IPRange query Related: #1546 * Update IPs used in TestIPRangeQuery

abhinavdangeti added a commit that referenced this pull request Jan 19, 2023

[BP] MB-40604: Support IPRange query (#1735)

f8e57fb

* MB-40604: Support IPRange query Related: #1546 * Update IPs used in TestIPRangeQuery

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IPRangeQuery to search for IPs which are members of a subnet. #1546

IPRangeQuery to search for IPs which are members of a subnet. #1546

amnonbc commented Jan 17, 2021 •

edited

Loading

mschoch commented Jan 19, 2021

mschoch commented Jan 19, 2021

amnonbc commented Jan 19, 2021 •

edited

Loading

mschoch commented Jan 20, 2021

amnonbc commented Jan 20, 2021 •

edited

Loading

amnonbc commented Jan 26, 2021

mschoch commented Jan 27, 2021 •

edited

Loading

amnonbc commented Jan 27, 2021

abhinavdangeti commented Feb 2, 2021

abhinavdangeti left a comment

abhinavdangeti Nov 2, 2021

amnonbc Nov 2, 2021

abhinavdangeti Nov 2, 2021

amnonbc Nov 2, 2021

amnonbc Nov 2, 2021

amnonbc Nov 3, 2021

iredmail commented Nov 3, 2021

amnonbc commented Nov 3, 2021

IPRangeQuery to search for IPs which are members of a subnet. #1546

IPRangeQuery to search for IPs which are members of a subnet. #1546

Conversation

amnonbc commented Jan 17, 2021 • edited Loading

mschoch commented Jan 19, 2021

mschoch commented Jan 19, 2021

amnonbc commented Jan 19, 2021 • edited Loading

mschoch commented Jan 20, 2021

amnonbc commented Jan 20, 2021 • edited Loading

amnonbc commented Jan 26, 2021

mschoch commented Jan 27, 2021 • edited Loading

amnonbc commented Jan 27, 2021

abhinavdangeti commented Feb 2, 2021

abhinavdangeti left a comment

Choose a reason for hiding this comment

abhinavdangeti Nov 2, 2021

Choose a reason for hiding this comment

amnonbc Nov 2, 2021

Choose a reason for hiding this comment

abhinavdangeti Nov 2, 2021

Choose a reason for hiding this comment

amnonbc Nov 2, 2021

Choose a reason for hiding this comment

amnonbc Nov 2, 2021

Choose a reason for hiding this comment

amnonbc Nov 3, 2021

Choose a reason for hiding this comment

iredmail commented Nov 3, 2021

amnonbc commented Nov 3, 2021

amnonbc commented Jan 17, 2021 •

edited

Loading

amnonbc commented Jan 19, 2021 •

edited

Loading

amnonbc commented Jan 20, 2021 •

edited

Loading

mschoch commented Jan 27, 2021 •

edited

Loading