Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FeaturesRefactor] Features::view()/preprocess() #3970

Conversation

micmn
Copy link
Contributor

@micmn micmn commented Aug 23, 2017

[Work in progress]

Continues #3968, only the last commit is relevant.

The design is outlined in this gist.

I might split this into a few PRs that builds on the fundamental changes [1-2-3],
[4] cross validation, [5] add/remove_subset() replacement, [6] preprocessors.

Anyway, what's in here:

  1. Base methods to construct new features Features::preprocess()/view().

  2. Base method to construct new labels Labels::view(), this requires a shallow copy 'ctor
    like in Features classes (duplicate() + copy 'ctors),
    implemented for (Dense/Binary/Multiclass/Regression)Labels.

  3. DenseFeatures:

  • on-the-fly evaluation through get_feature_vector() method;
  • eager evaluation through eval() method;
  • get_feature_matrix() constness.
  1. CrossValidation: use view() instead of add/remove_subset().

  2. Replace add/remove_subset() with view() + unit tests failing due to DenseFeatures changes [see below]:

  • StochasticGBMachine: refactor get_subset() and related functions
    to use a new labels object instead of adding the subset to m_labels,
    refactor unit tests and add case that covers subset_frac < 1 branch;
  • LMNNImpl: refactor to use preprocess() instead of apply_to_feature_matrix();
  • CARTree: fix feature matrix constness, minor changes to unit test (some);
  1. Preprocessors:
  • NormOne::apply_to_feature_vector() uses linalg;
  • rewrite PruneVarSubMean::init() to use DotFeatures::get_mean()/get_std()
    (the latter was added to DotFeatures), add unit test;
  • RandomFourierGaussPreproc: fix feature matrix constness;
  • rewrite MultipleProcessors unit test with eval case;
  • rewrite LeastAngleRegression.ols_equivalence unit test to use preprocess()

[Misc] Initializer list constructor for SGMatrix/Vector
(not a big breakthrough but anyway useful for unit tests with fixed data...)


Unit tests:

  • 27 - unit-StochasticGBMachine (Fixed)
  • 32 - unit-LeastAngleRegression (Fixed)
  • 46 - unit-LMNNImpl (Fixed)
  • 47 - unit-LMNN (Fixed)
  • 167 - unit-CARTree (Fixed)

  • 58 - unit-Block (Failed)
  • 74 - unit-StreamingDenseFeaturesTest (Failed)
  • 80 - unit-StreamingHashedDenseFeaturesTest (Failed)
  • 122 - unit-LogPlusOne (Failed)
  • 123 - unit-RescaleFeatures (Failed)
  • 163 - unit-RandomCARTree (OTHER_FAULT)
  • 165 - unit-C45ClassifierTree (Failed)
  • 166 - unit-ID3ClassifierTree (Failed)
  • 168 - unit-RandomForest (Failed)
  • 170 - unit-CHAIDTree (SEGFAULT)
  • 178 - unit-BaggingMachine (Failed)
  • 185 - unit-GaussianProcessClassification (SEGFAULT)
  • 227 - unit-KMeans (Failed)

micmn added 3 commits August 18, 2017 11:44
	- add element-wise product between vectors (Eigen)
	- fix add() parameters constness
	- implement DotIterator, SGMatrix and SGVector iterators
	- Perceptron training with iterators + unit test

SGVector: make SGVector(index_t len) 'ctor zero-initialize memory
with SG_CALLOC like SGMatrix(index_t rows, index_t cols) 'ctor.
Copy link
Member

@vigsterkr vigsterkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice one! we should talk about some :)

SGMatrix<ST> target(first_vec.vlen, get_num_vectors());
target.set_column(0, first_vec);

for (index_t i = 1; i < get_num_vectors(); ++i)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this whole part above should be rather done by a smart memcpy instead of setting the columns separately. needs a stride etc.

@@ -1067,6 +1048,24 @@ CFeatures* CDenseFeatures<ST>::create_merged_copy(CFeatures* other)
return create_merged_copy(list);
}

template <class ST>
Some<CDenseFeatures<ST>> CDenseFeatures<ST>::view(const SGVector<index_t>& subset)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem with this is that this right away makes it inaccessible/usable from SWIG interfaces... we are having difficulties with exposing Some in SWIG

auto feats_view = wrap(this->duplicate());

auto sg_subset = SGVector<index_t>(subset.size());
std::copy(subset.cbegin(), subset.cend(), sg_subset.data());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sg_memcpy ? as both std::vector and SGVector backed by a continuous memory chunk

@stale
Copy link

stale bot commented Feb 26, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Feb 26, 2020
@stale
Copy link

stale bot commented Mar 4, 2020

This issue is now being closed due to a lack of activity. Feel free to reopen it.

@stale stale bot closed this Mar 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants