Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Pipeline #1213

Merged
merged 12 commits into from
Mar 21, 2017
Merged

Fix Pipeline #1213

merged 12 commits into from
Mar 21, 2017

Conversation

kris-singh
Copy link
Contributor

@kris-singh kris-singh commented Mar 14, 2017

Solves PR #932

@kris-singh
Copy link
Contributor Author

kris-singh commented Mar 14, 2017

Do we have to add sklearn as dependencies. Ready for review

Copy link
Contributor

@tmylk tmylk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide a pipeline in tests and ipynb where output is lda is used in logistic regression

@@ -67,5 +68,15 @@ def testCSRMatrixConversion(self):
self.assertTrue(isinstance(v, six.string_types))
self.assertTrue(isinstance(k, int))

def testPipline(self):
model = SklearnWrapperLdaModel(id2word=dictionary, num_topics=2, passes=100, minimum_probability=0, random_state=numpy.random.seed(0))
text_lda = Pipeline([('model', model)])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can a pipeline contain two things? From lda to logistic regression would be good. Also could you please add it to the tutorial.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do, you mean to say that we use lda as a feature extractor. And then use it to in the logistic regression. I thought of this and modified the transform function accordingly.

"source": []
"source": [
"def scorer(estimator, X,y=None):\n",
" goodcm = CoherenceModel(model=estimator, texts= texts, dictionary=estimator.id2word, coherence='c_v')\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gridsearch returns exception in the ipynb. Is it possible to have it fixed?

@kris-singh
Copy link
Contributor Author

@tmylk Could you have a look at the Travis . I don't understand why is it failing.

@tmylk
Copy link
Contributor

tmylk commented Mar 17, 2017

Tests fixed by smart_open update

@@ -67,5 +84,21 @@ def testCSRMatrixConversion(self):
self.assertTrue(isinstance(v, six.string_types))
self.assertTrue(isinstance(k, int))

def testPipline(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo in name of the function

data = fetch_20newsgroups(subset='train',
categories=cats,
shuffle=True)
text_lda = Pipeline([('features', vec),('model', model)])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add logistic regression to the pipeline to analyse output of the lda

Copy link
Contributor Author

@kris-singh kris-singh Mar 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do that in the ipynb example as you had suggested. Also, I am not getting good accuracy using the features from lda transform around 52% which is meaningless for a binary classification task.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add it to the test.
accuracy is not important here. it is about being in compatible format

vec = CountVectorizer(min_df=10, stop_words='english')
rand = numpy.random.mtrand.RandomState(1) # set seed for getting same result
cats = ['rec.sport.baseball', 'sci.crypt']
data = fetch_20newsgroups(subset='train',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are smaller datasets in test_data folder. downloading a lot of data makes tests run too long

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmylk i was not able to find a dataset that the labels. If you know can you please tell me which one to use.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a tiny 100k subset of newsgroups would be ok.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the size of the text docs that you are adding?

@kris-singh
Copy link
Contributor Author

@tmylk All changes made. Ready for merge. Please let me know if further changes are required.

@kris-singh
Copy link
Contributor Author

@tmylk any other issues that will help with nmf that i could possibly look at.

@@ -86,19 +92,15 @@ def testCSRMatrixConversion(self):

def testPipline(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo in test name

@tmylk
Copy link
Contributor

tmylk commented Mar 20, 2017

Thanks! The PR looks good.
For completeess, could you please remove the section inappropriately called "Using together with Scikit learn's Logistic Regression". That section doesn't use gensim at all so shouldn't be in the notebook. It's an omission by the original author.

Please put your new Pipeline section instead of it so users can find it faster.

@kris-singh
Copy link
Contributor Author

Changes made. Also the size of the test file is around 300 kb.

@tmylk tmylk merged commit 97cd64f into piskvorky:develop Mar 21, 2017
@tmylk
Copy link
Contributor

tmylk commented Mar 21, 2017

Thanks for the new feature!

@piskvorky
Copy link
Owner

piskvorky commented Apr 9, 2017

@tmylk this PR has multiple coding style and PEP8 issues. Please do not merge PRs that are not ready for merging.

"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from sklearn import linear_model"
"def scorer(estimator, X,y=None):\n",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8: space after comma.

},
"outputs": [],
"source": [
"id2word=Dictionary(map(lambda x : x.split(),data.data))\n",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

map is discouraged -- use comprehensions and generators.

Also, PEP8 -- space after comma, spaces around =.

"clf=linear_model.LogisticRegression(penalty='l1', C=0.1) #l1 penalty used\n",
"clf.fit(X,data.target)\n",
"print_features(clf,vocab)"
"model=SklearnWrapperLdaModel(num_topics=15,id2word=id2word,iterations=50, random_state=37)\n",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8: spaces around assignment operator =. Other space/formatting/PEP8 issues further down this file, but this is the last comment.

@@ -109,4 +134,4 @@ def partial_fit(self, X):
if sparse.issparse(X):
X = matutils.Sparse2Corpus(X)

self.update(corpus=X)
self.update(corpus=X)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8: newline at the end of file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants