Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: ecs-ification of BRO_ patterns #286

Merged
merged 21 commits into from
Dec 22, 2020
Merged

Conversation

kares
Copy link
Contributor

@kares kares commented Oct 13, 2020

For context somewhere ~3 years ago Bro was rebranded as Zeek.
LS' bro patterns were matching an old version of logs, some patterns got updated along the way several times.

There's an attempt to handle parsing of old Bro logs as well as 'new' Zeek format:

  1. if an (old) Bro pattern changed (before the Zeek rebranding) make sure in ECS mode we always match
  2. introduce ZEEK_ variants for all BRO_ patterns for improved matching of recent Zeek logs (these usually do not match the old BRO_ variants)

Each pattern has an example spec of the old matching and the new one, linking those here (instead of pasting sample jsons as they would get a bit long with 4 patterns) :

NOTE: Beats supports the full Zeek pattern set (we only support 4 here) since 7.6.0, we strive here for maximum compatibility with field naming in Beats.

TODO: consider moving ZEEK_ patterns into a separate file, for specs it makes sense to have them in one place.

BRO_DNS %{NUMBER:timestamp}\t%{NOTSPACE:[zeek][session_id]}\t%{IP:[source][ip]}\t%{INT:[source][port]:int}\t%{IP:[destination][ip]}\t%{INT:[destination][port]:int}\t%{WORD:[network][transport]}\t(?:-|%{INT:[dns][id]:int})\t(?:-|%{BRO_DATA:[dns][question][name]})\t(?:-|%{INT:[zeek][dns][qclass]:int})\t(?:-|%{BRO_DATA:[zeek][dns][qclass_name]})\t(?:-|%{INT:[zeek][dns][qtype]:int})\t(?:-|%{BRO_DATA:[dns][question][type]})\t(?:-|%{INT:[zeek][dns][rcode]:int})\t(?:-|%{BRO_DATA:[dns][response_code]})\t(?:-|%{BRO_BOOL:[zeek][dns][AA]})\t(?:-|%{BRO_BOOL:[zeek][dns][TC]})\t(?:-|%{BRO_BOOL:[zeek][dns][RD]})\t(?:-|%{BRO_BOOL:[zeek][dns][RA]})\t(?:-|%{NONNEGINT:[zeek][dns][Z]:int})\t(?:-|%{BRO_DATA:[zeek][dns][answers]})\t(?:-|%{DATA:[zeek][dns][TTLs]})\t(?:-|%{BRO_BOOL:[zeek][dns][rejected]})

# dns.log - the 'new' format (added *zeek.dns.rtt*)
ZEEK_DNS %{NUMBER:timestamp}\t%{NOTSPACE:[zeek][session_id]}\t%{IP:[source][ip]}\t%{INT:[source][port]:int}\t%{IP:[destination][ip]}\t%{INT:[destination][port]:int}\t%{WORD:[network][transport]}\t(?:-|%{INT:[dns][id]:int})\t(?:-|%{NUMBER:[zeek][dns][rtt]:float})\t(?:-|%{BRO_DATA:[dns][question][name]})\t(?:-|%{INT:[zeek][dns][qclass]:int})\t(?:-|%{BRO_DATA:[zeek][dns][qclass_name]})\t(?:-|%{INT:[zeek][dns][qtype]:int})\t(?:-|%{BRO_DATA:[dns][question][type]})\t(?:-|%{INT:[zeek][dns][rcode]:int})\t(?:-|%{BRO_DATA:[dns][response_code]})\t%{BRO_BOOL:[zeek][dns][AA]}\t%{BRO_BOOL:[zeek][dns][TC]}\t%{BRO_BOOL:[zeek][dns][RD]}\t%{BRO_BOOL:[zeek][dns][RA]}\t%{NONNEGINT:[zeek][dns][Z]:int}\t(?:-|%{BRO_DATA:[zeek][dns][answers]})\t(?:-|%{DATA:[zeek][dns][TTLs]})\t(?:-|%{BRO_BOOL:[zeek][dns][rejected]})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the dns.id field is being type-casted %{INT:[dns][id]:int} (for Beats compatibility).
ECS prescribes the type for dns.id as keyword, thus it should stay a string, any preferences?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as the field is type keyword in the index template mapping, I don't believe it being cast to an integer will be a problem. It'll still be indexed as a keyword.

BRO_DNS %{NUMBER:timestamp}\t%{NOTSPACE:[zeek][session_id]}\t%{IP:[source][ip]}\t%{INT:[source][port]:int}\t%{IP:[destination][ip]}\t%{INT:[destination][port]:int}\t%{WORD:[network][transport]}\t(?:-|%{INT:[dns][id]:int})\t(?:-|%{BRO_DATA:[dns][question][name]})\t(?:-|%{INT:[zeek][dns][qclass]:int})\t(?:-|%{BRO_DATA:[zeek][dns][qclass_name]})\t(?:-|%{INT:[zeek][dns][qtype]:int})\t(?:-|%{BRO_DATA:[dns][question][type]})\t(?:-|%{INT:[zeek][dns][rcode]:int})\t(?:-|%{BRO_DATA:[dns][response_code]})\t(?:-|%{BRO_BOOL:[zeek][dns][AA]})\t(?:-|%{BRO_BOOL:[zeek][dns][TC]})\t(?:-|%{BRO_BOOL:[zeek][dns][RD]})\t(?:-|%{BRO_BOOL:[zeek][dns][RA]})\t(?:-|%{NONNEGINT:[zeek][dns][Z]:int})\t(?:-|%{BRO_DATA:[zeek][dns][answers]})\t(?:-|%{DATA:[zeek][dns][TTLs]})\t(?:-|%{BRO_BOOL:[zeek][dns][rejected]})

# dns.log - the 'new' format (added *zeek.dns.rtt*)
ZEEK_DNS %{NUMBER:timestamp}\t%{NOTSPACE:[zeek][session_id]}\t%{IP:[source][ip]}\t%{INT:[source][port]:int}\t%{IP:[destination][ip]}\t%{INT:[destination][port]:int}\t%{WORD:[network][transport]}\t(?:-|%{INT:[dns][id]:int})\t(?:-|%{NUMBER:[zeek][dns][rtt]:float})\t(?:-|%{BRO_DATA:[dns][question][name]})\t(?:-|%{INT:[zeek][dns][qclass]:int})\t(?:-|%{BRO_DATA:[zeek][dns][qclass_name]})\t(?:-|%{INT:[zeek][dns][qtype]:int})\t(?:-|%{BRO_DATA:[dns][question][type]})\t(?:-|%{INT:[zeek][dns][rcode]:int})\t(?:-|%{BRO_DATA:[dns][response_code]})\t%{BRO_BOOL:[zeek][dns][AA]}\t%{BRO_BOOL:[zeek][dns][TC]}\t%{BRO_BOOL:[zeek][dns][RD]}\t%{BRO_BOOL:[zeek][dns][RA]}\t%{NONNEGINT:[zeek][dns][Z]:int}\t(?:-|%{BRO_DATA:[zeek][dns][answers]})\t(?:-|%{DATA:[zeek][dns][TTLs]})\t(?:-|%{BRO_BOOL:[zeek][dns][rejected]})
Copy link
Contributor Author

@kares kares Oct 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're keeping source/destination naming convention even for DNS logs (as opposed to client/server),
e.g. source.ip + source.port

(except for the ZEEK_FILES pattern where we match the tx_host - host that transferred the file - as server.ip)

BRO_FILES %{NUMBER:timestamp}\t%{NOTSPACE:[zeek][files][fuid]}\t(?:-|%{IP:[zeek][files][tx_host]})\t(?:-|%{IP:[zeek][files][rx_host]})\t(?:-|%{BRO_DATA:[zeek][files][session_ids]})\t(?:-|%{BRO_DATA:[zeek][files][source]})\t(?:-|%{INT:[zeek][files][depth]:int})\t(?:-|%{BRO_DATA:[zeek][files][analyzers]})\t(?:-|%{BRO_DATA:[file][mime_type]})\t(?:-|%{BRO_DATA:[file][name]})\t(?:-|%{NUMBER:[zeek][files][duration]:float})\t(?:-|%{BRO_DATA:[zeek][files][local_orig]})\t(?:-|%{BRO_BOOL:[zeek][files][is_orig]})\t(?:-|%{INT:[zeek][files][seen_bytes]:int})\t(?:-|%{INT:[file][size]:int})\t(?:-|%{INT:[zeek][files][missing_bytes]:int})\t(?:-|%{INT:[zeek][files][overflow_bytes]:int})\t(?:-|%{BRO_BOOL:[zeek][files][timedout]})\t(?:-|%{BRO_DATA:[zeek][files][parent_fuid]})\t(?:-|%{BRO_DATA:[file][hash][md5]})\t(?:-|%{BRO_DATA:[file][hash][sha1]})\t(?:-|%{BRO_DATA:[file][hash][sha256]})\t(?:-|%{BRO_DATA:[zeek][files][extracted]})

# files.log - new format (2 new fields added at the end)
ZEEK_FILES_TX_HOSTS (?:-|%{IP:[server][ip]})|(?<[zeek][files][tx_hosts]>%{IP:[server][ip]}(?:[\s,]%{IP})+)
Copy link
Contributor Author

@kares kares Oct 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zeek.files.tx_host field -> server.ip (Beats copies the field while also keeping zeek.files.tx_hosts)


# files.log - new format (2 new fields added at the end)
ZEEK_FILES_TX_HOSTS (?:-|%{IP:[server][ip]})|(?<[zeek][files][tx_hosts]>%{IP:[server][ip]}(?:[\s,]%{IP})+)
ZEEK_FILES_RX_HOSTS (?:-|%{IP:[client][ip]})|(?<[zeek][files][rx_hosts]>%{IP:[client][ip]}(?:[\s,]%{IP})+)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're handling cases when multiple IPs are present: the first IP in the set ends up as client.ip,
while the whole field is also matched as rx_hosts see the test

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and when a single IP is present, we intentionally only capture [client][ip] (not [zeek][files][rx_hosts])?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct - that was the intention here: to not have "duplication" for the common case.
we always want [client][ip] filled and if there are multiple IP fields than also capture them into the _hosts field.
logs having multiple IPs seems to be very rare - almost not worth the hustle.

@kares
Copy link
Contributor Author

kares commented Oct 13, 2020

left potential comments to discuss/clarify as comments inline, please take a look as/if needed // cc @ebeahan @webmat

@kares kares marked this pull request as ready for review October 13, 2020 15:14
@kares kares requested a review from yaauie October 13, 2020 15:16
@kares kares mentioned this pull request Oct 14, 2020
21 tasks
Copy link

@webmat webmat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've noted two minor things you may want to address in comments below. Other than these, this looks good to me 👍

# conn.log
BRO_CONN %{NUMBER:ts}\t%{NOTSPACE:uid}\t%{IP:orig_h}\t%{INT:orig_p}\t%{IP:resp_h}\t%{INT:resp_p}\t%{WORD:proto}\t%{GREEDYDATA:service}\t%{NUMBER:duration}\t%{NUMBER:orig_bytes}\t%{NUMBER:resp_bytes}\t%{GREEDYDATA:conn_state}\t%{GREEDYDATA:local_orig}\t%{GREEDYDATA:missed_bytes}\t%{GREEDYDATA:history}\t%{GREEDYDATA:orig_pkts}\t%{GREEDYDATA:orig_ip_bytes}\t%{GREEDYDATA:resp_pkts}\t%{GREEDYDATA:resp_ip_bytes}\t%{GREEDYDATA:tunnel_parents}
# http.log - the 'new' format has *version* and *origin* fields added and *filename* replaced with *orig_filenames* + *resp_filenames*
ZEEK_HTTP %{NUMBER:timestamp}\t%{NOTSPACE:[zeek][session_id]}\t%{IP:[source][ip]}\t%{INT:[source][port]:int}\t%{IP:[destination][ip]}\t%{INT:[destination][port]:int}\t%{INT:[zeek][http][trans_depth]:int}\t(?:-|%{WORD:[http][request][method]})\t(?:-|%{BRO_DATA:[url][domain]})\t(?:-|%{BRO_DATA:[url][original]})\t(?:-|%{BRO_DATA:[http][request][referrer]})\t(?:-|%{NUMBER:[http][version]})\t(?:-|%{BRO_DATA:[user_agent][original]})\t(?:-|%{BRO_DATA:[zeek][http][origin]})\t(?:-|%{NUMBER:[http][request][body][bytes]:int})\t(?:-|%{NUMBER:[http][response][body][bytes]:int})\t(?:-|%{POSINT:[http][response][status_code]:int})\t(?:-|%{DATA:[zeek][http][status_msg]})\t(?:-|%{POSINT:[zeek][http][info_code]:int})\t(?:-|%{DATA:[zeek][http][info_msg]})\t(?:\(empty\)|%{BRO_DATA:[zeek][http][tags]})\t(?:-|%{BRO_DATA:[url][username]})\t(?:-|%{BRO_DATA:[url][password]})\t(?:-|%{BRO_DATA:[zeek][http][proxied]})\t(?:-|%{BRO_DATA:[zeek][http][orig_fuids]})\t(?:-|%{BRO_DATA:[zeek][http][orig_filenames]})\t(?:-|%{BRO_DATA:[zeek][http][orig_mime_types]})\t(?:-|%{BRO_DATA:[zeek][http][resp_fuids]})\t(?:-|%{BRO_DATA:[zeek][http][resp_filenames]})\t(?:-|%{BRO_DATA:[zeek][http][resp_mime_types]})
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see mime_types here as well 👀

ZEEK_DNS %{NUMBER:timestamp}\t%{NOTSPACE:[zeek][session_id]}\t%{IP:[source][ip]}\t%{INT:[source][port]:int}\t%{IP:[destination][ip]}\t%{INT:[destination][port]:int}\t%{WORD:[network][transport]}\t(?:-|%{INT:[dns][id]:int})\t(?:-|%{NUMBER:[zeek][dns][rtt]:float})\t(?:-|%{BRO_DATA:[dns][question][name]})\t(?:-|%{INT:[zeek][dns][qclass]:int})\t(?:-|%{BRO_DATA:[zeek][dns][qclass_name]})\t(?:-|%{INT:[zeek][dns][qtype]:int})\t(?:-|%{BRO_DATA:[dns][question][type]})\t(?:-|%{INT:[zeek][dns][rcode]:int})\t(?:-|%{BRO_DATA:[dns][response_code]})\t%{BRO_BOOL:[zeek][dns][AA]}\t%{BRO_BOOL:[zeek][dns][TC]}\t%{BRO_BOOL:[zeek][dns][RD]}\t%{BRO_BOOL:[zeek][dns][RA]}\t%{NONNEGINT:[zeek][dns][Z]:int}\t(?:-|%{BRO_DATA:[zeek][dns][answers]})\t(?:-|%{DATA:[zeek][dns][TTLs]})\t(?:-|%{BRO_BOOL:[zeek][dns][rejected]})

# conn.log - old bro, also supports 'newer' format (optional *zeek.connection.local_resp* flag) compared to non-ecs mode
BRO_CONN %{NUMBER:timestamp}\t%{NOTSPACE:[zeek][session_id]}\t%{IP:[source][ip]}\t%{INT:[source][port]:int}\t%{IP:[destination][ip]}\t%{INT:[destination][port]:int}\t%{WORD:[network][transport]}\t(?:-|%{BRO_DATA:[network][protocol]})\t(?:-|%{NUMBER:[zeek][connection][duration]:float})\t(?:-|%{INT:[zeek][connection][orig_bytes]:int})\t(?:-|%{INT:[zeek][connection][resp_bytes]:int})\t(?:-|%{BRO_DATA:[zeek][connection][state]})\t(?:-|%{BRO_BOOL:[zeek][connection][local_orig]})\t(?:(?:-|%{BRO_BOOL:[zeek][connection][local_resp]})\t)?(?:-|%{INT:[zeek][connection][missed_bytes]:int})\t(?:-|%{BRO_DATA:[zeek][connection][history]})\t(?:-|%{INT:[source][packets]:int})\t(?:-|%{INT:[source][bytes]:int})\t(?:-|%{INT:[destination][packets]:int})\t(?:-|%{INT:[destination][bytes]:int})\t(?:\(empty\)|%{BRO_DATA:[zeek][connection][tunnel_parents]})
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think [zeek][connection][orig_bytes] and [zeek][connection][resp_bytes] could be mapped to source.bytes and destination.bytes here.

Actually I see the ECS fields used later in the same pattern. Were the fields in the beginning overlooked?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't overlooked but its definitely a bit confusing, we follow here what Beats is doing - taking the later fields (orig_ip_bytes and resp_ip_bytes) and using those as source/destination.bytes.

According to the Zeek docs the IP bytes are from the IP packet. While orig_bytes and resp_bytes are taken from the TCP sequence no. Beats drops the orig_bytes, resp_bytes fields, for LS we capture them (as [zeek][connection][orig_bytes] and [zeek][connection][resp_bytes]) for the user to decide what to do with them.

expect(grok).to include("server"=>{"ip" => '10.1.1.2'}) # tx_host
expect(grok).to include("client"=>{"ip" => '192.168.1.105'}) # rx_host
expect(grok).to include("zeek"=>{"files" => hash_including("tx_hosts" => '10.1.1.2 10.1.1.1')})
expect(grok).to include("zeek"=>{"files" => hash_including("rx_hosts" => '192.168.1.105,192.168.1.10,192.168.1.11')})
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love that it keeps the full list in the custom field 👍

... instead of the rx/tx_host custom fields
Copy link
Contributor

@yaauie yaauie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I am impressed with the thought that went into mapping this all. I've left a couple questions, but overall this looks fantastic 🎉


# files.log - new format (2 new fields added at the end)
ZEEK_FILES_TX_HOSTS (?:-|%{IP:[server][ip]})|(?<[zeek][files][tx_hosts]>%{IP:[server][ip]}(?:[\s,]%{IP})+)
ZEEK_FILES_RX_HOSTS (?:-|%{IP:[client][ip]})|(?<[zeek][files][rx_hosts]>%{IP:[client][ip]}(?:[\s,]%{IP})+)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and when a single IP is present, we intentionally only capture [client][ip] (not [zeek][files][rx_hosts])?

# conn.log
BRO_CONN %{NUMBER:ts}\t%{NOTSPACE:uid}\t%{IP:orig_h}\t%{INT:orig_p}\t%{IP:resp_h}\t%{INT:resp_p}\t%{WORD:proto}\t%{GREEDYDATA:service}\t%{NUMBER:duration}\t%{NUMBER:orig_bytes}\t%{NUMBER:resp_bytes}\t%{GREEDYDATA:conn_state}\t%{GREEDYDATA:local_orig}\t%{GREEDYDATA:missed_bytes}\t%{GREEDYDATA:history}\t%{GREEDYDATA:orig_pkts}\t%{GREEDYDATA:orig_ip_bytes}\t%{GREEDYDATA:resp_pkts}\t%{GREEDYDATA:resp_ip_bytes}\t%{GREEDYDATA:tunnel_parents}
# http.log - the 'new' format has *version* and *origin* fields added and *filename* replaced with *orig_filenames* + *resp_filenames*
ZEEK_HTTP %{NUMBER:timestamp}\t%{NOTSPACE:[zeek][session_id]}\t%{IP:[source][ip]}\t%{INT:[source][port]:int}\t%{IP:[destination][ip]}\t%{INT:[destination][port]:int}\t%{INT:[zeek][http][trans_depth]:int}\t(?:-|%{WORD:[http][request][method]})\t(?:-|%{BRO_DATA:[url][domain]})\t(?:-|%{BRO_DATA:[url][original]})\t(?:-|%{BRO_DATA:[http][request][referrer]})\t(?:-|%{NUMBER:[http][version]})\t(?:-|%{BRO_DATA:[user_agent][original]})\t(?:-|%{BRO_DATA:[zeek][http][origin]})\t(?:-|%{NUMBER:[http][request][body][bytes]:int})\t(?:-|%{NUMBER:[http][response][body][bytes]:int})\t(?:-|%{POSINT:[http][response][status_code]:int})\t(?:-|%{DATA:[zeek][http][status_msg]})\t(?:-|%{POSINT:[zeek][http][info_code]:int})\t(?:-|%{DATA:[zeek][http][info_msg]})\t(?:\(empty\)|%{BRO_DATA:[zeek][http][tags]})\t(?:-|%{BRO_DATA:[url][username]})\t(?:-|%{BRO_DATA:[url][password]})\t(?:-|%{BRO_DATA:[zeek][http][proxied]})\t(?:-|%{BRO_DATA:[zeek][http][orig_fuids]})\t(?:-|%{BRO_DATA:[zeek][http][orig_filenames]})\t(?:-|%{BRO_DATA:[http][request][mime_type]})\t(?:-|%{BRO_DATA:[zeek][http][resp_fuids]})\t(?:-|%{BRO_DATA:[zeek][http][resp_filenames]})\t(?:-|%{BRO_DATA:[http][response][mime_type]})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there value in referencing the BRO_BOOL and BRO_DATA types from within the ZEEK_* patterns, or should we define them separately?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks - that's actually a good catch.
makes little sense, we should consider spliting this out into separate files - having ecs-v1/bro as well as ecs-v1/zeek and introduce ZEEK_BOOL/ZEEK_DATA helper patterns for the later

Copy link
Contributor Author

@kares kares Dec 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pattern definitions are now split-ed into standalone patterns/ecs-v1/bro and patterns/ecs-v1/zeek file-sets

@kares kares merged commit 84477d0 into logstash-plugins:ecs-wip Dec 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants