Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the algorithm for extracting multifields #48756

Closed
przemekwitek opened this issue Oct 31, 2019 · 1 comment · Fixed by #48799
Closed

Improve the algorithm for extracting multifields #48756

przemekwitek opened this issue Oct 31, 2019 · 1 comment · Fixed by #48799
Assignees
Labels
:ml Machine learning

Comments

@przemekwitek
Copy link
Contributor

przemekwitek commented Oct 31, 2019

When extracting multifield, only one of the fields (i.e. the first aggregatable) should be extracted.

This should help to fix issues such as:
https://github.com/elastic/ml-team/issues/235
https://github.com/elastic/ml-team/issues/239
which data team raised

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this issue Oct 31, 2019
Aggregatable mutli-fields are at the moment wrongly mapped
as normal doc_value fields and thus they support fetching from
source. However, they do not exist in the source. This results
to failure to extract such fields.

This commit fixes this bug. While a fix could be worked out
on top of the existing code, it is evident the extraction logic
has become difficult to understand and maintain. As we also
want to deduplicate multi-fields for data frame analytics,
it seemed appropriate to refactor the code to simplify and
better handle the extraction of multi-fields.

Relates elastic#48756
dimitris-athanasiou added a commit that referenced this issue Nov 1, 2019
Aggregatable mutli-fields are at the moment wrongly mapped
as normal doc_value fields and thus they support fetching from
source. However, they do not exist in the source. This results
to failure to extract such fields.

This commit fixes this bug. While a fix could be worked out
on top of the existing code, it is evident the extraction logic
has become difficult to understand and maintain. As we also
want to deduplicate multi-fields for data frame analytics,
it seemed appropriate to refactor the code to simplify and
better handle the extraction of multi-fields.

Relates #48756
dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this issue Nov 1, 2019
Aggregatable mutli-fields are at the moment wrongly mapped
as normal doc_value fields and thus they support fetching from
source. However, they do not exist in the source. This results
to failure to extract such fields.

This commit fixes this bug. While a fix could be worked out
on top of the existing code, it is evident the extraction logic
has become difficult to understand and maintain. As we also
want to deduplicate multi-fields for data frame analytics,
it seemed appropriate to refactor the code to simplify and
better handle the extraction of multi-fields.

Relates elastic#48756

Backport of elastic#48770
dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this issue Nov 1, 2019
Aggregatable mutli-fields are at the moment wrongly mapped
as normal doc_value fields and thus they support fetching from
source. However, they do not exist in the source. This results
to failure to extract such fields.

This commit fixes this bug. While a fix could be worked out
on top of the existing code, it is evident the extraction logic
has become difficult to understand and maintain. As we also
want to deduplicate multi-fields for data frame analytics,
it seemed appropriate to refactor the code to simplify and
better handle the extraction of multi-fields.

Relates elastic#48756

Backport of elastic#48770
dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this issue Nov 1, 2019
In the case multi-fields exist in the source index, we pick
all variants of them in our extracted fields detection for
data frame analytics. This means we may have multiple instances
of the same feature. The worse consequence of this is when the
dependent variable (for regression or classification) is also
duplicated which means we train a model on the dependent variable
itself.

Now that elastic#48770 is merged, this commit is adding logic to
only select one variant of multi-fields.

Closes elastic#48756
dimitris-athanasiou added a commit that referenced this issue Nov 1, 2019
In the case multi-fields exist in the source index, we pick
all variants of them in our extracted fields detection for
data frame analytics. This means we may have multiple instances
of the same feature. The worse consequence of this is when the
dependent variable (for regression or classification) is also
duplicated which means we train a model on the dependent variable
itself.

Now that #48770 is merged, this commit is adding logic to
only select one variant of multi-fields.

Closes #48756
dimitris-athanasiou added a commit that referenced this issue Nov 1, 2019
Aggregatable mutli-fields are at the moment wrongly mapped
as normal doc_value fields and thus they support fetching from
source. However, they do not exist in the source. This results
to failure to extract such fields.

This commit fixes this bug. While a fix could be worked out
on top of the existing code, it is evident the extraction logic
has become difficult to understand and maintain. As we also
want to deduplicate multi-fields for data frame analytics,
it seemed appropriate to refactor the code to simplify and
better handle the extraction of multi-fields.

Relates #48756

Backport of #48770
dimitris-athanasiou added a commit that referenced this issue Nov 1, 2019
Aggregatable mutli-fields are at the moment wrongly mapped
as normal doc_value fields and thus they support fetching from
source. However, they do not exist in the source. This results
to failure to extract such fields.

This commit fixes this bug. While a fix could be worked out
on top of the existing code, it is evident the extraction logic
has become difficult to understand and maintain. As we also
want to deduplicate multi-fields for data frame analytics,
it seemed appropriate to refactor the code to simplify and
better handle the extraction of multi-fields.

Relates #48756

Backport of #48770
dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this issue Nov 1, 2019
…48799)

In the case multi-fields exist in the source index, we pick
all variants of them in our extracted fields detection for
data frame analytics. This means we may have multiple instances
of the same feature. The worse consequence of this is when the
dependent variable (for regression or classification) is also
duplicated which means we train a model on the dependent variable
itself.

Now that elastic#48770 is merged, this commit is adding logic to
only select one variant of multi-fields.

Closes elastic#48756

Backport of elastic#48799
dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this issue Nov 1, 2019
…48799)

In the case multi-fields exist in the source index, we pick
all variants of them in our extracted fields detection for
data frame analytics. This means we may have multiple instances
of the same feature. The worse consequence of this is when the
dependent variable (for regression or classification) is also
duplicated which means we train a model on the dependent variable
itself.

Now that elastic#48770 is merged, this commit is adding logic to
only select one variant of multi-fields.

Closes elastic#48756

Backport elastic#48799
dimitris-athanasiou added a commit that referenced this issue Nov 1, 2019
…48806)

In the case multi-fields exist in the source index, we pick
all variants of them in our extracted fields detection for
data frame analytics. This means we may have multiple instances
of the same feature. The worse consequence of this is when the
dependent variable (for regression or classification) is also
duplicated which means we train a model on the dependent variable
itself.

Now that #48770 is merged, this commit is adding logic to
only select one variant of multi-fields.

Closes #48756

Backport of #48799
dimitris-athanasiou added a commit that referenced this issue Nov 1, 2019
…48807)

In the case multi-fields exist in the source index, we pick
all variants of them in our extracted fields detection for
data frame analytics. This means we may have multiple instances
of the same feature. The worse consequence of this is when the
dependent variable (for regression or classification) is also
duplicated which means we train a model on the dependent variable
itself.

Now that #48770 is merged, this commit is adding logic to
only select one variant of multi-fields.

Closes #48756

Backport #48799
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants