Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add examples for clustering #222

Merged
merged 7 commits into from
May 24, 2018

Conversation

Ivanidzo4ka
Copy link
Contributor

Address #205

@Ivanidzo4ka Ivanidzo4ka requested review from GalOshri and codemzs May 24, 2018 00:36
@asthana86
Copy link
Contributor

This is great!, can we also add this as an E2E sample in dotnet/machinelearning/samples with a readme.md similar to the ones we are adding for regression, binary and multi-class classification!

@Ivanidzo4ka
Copy link
Contributor Author

I don't see "dotnet/machinelearning/samples" repo. Can you provide link to it?


In reply to: 391724603 [](ancestors = 391724603)


var pipeline = new LearningPipeline();
pipeline.Add(new TextLoader(dataPath).CreateFrom<NewsData>(useHeader: false));
pipeline.Add(new CategoricalOneHotVectorizer("Label"));
Copy link
Contributor

@zeahmed zeahmed May 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "Label" is not used in clustering unless model is being evaluated against true labels. Why CategoricalOneHotVectorizer is being applied on "Label"? #Resolved

string dataPath = GetDataPath(@"external/20newsgroups.txt");

var pipeline = new LearningPipeline();
pipeline.Add(new TextLoader(dataPath).CreateFrom<NewsData>(useHeader: false));
Copy link
Contributor

@zeahmed zeahmed May 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure to set allowQuotedStrings and supportSparse properly. The dataset that I have is NOT quoted and is not in sparse format. By default, these two are turned on in TextLoader. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data set I have is actually have quotes inside mail content. but's it's definitely not sparse


In reply to: 190677490 [](ancestors = 190677490)

pipeline.Add(CollectionDataSource.Create(data));
pipeline.Add(new KMeansPlusPlusClusterer() { K = k });
var model = pipeline.Train<ClusteringData, ClusteringPrediction>();
//validate no initial centers of clusters belong to same cluster.
Copy link
Contributor

@zeahmed zeahmed May 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor comment.
//validate no initial centers of clusters belong to same cluster.
These don't seem to be initial center as these are not set as initial cluster centers to KMean trainer. That is what initial center means in KMean or other clustering algorithms.

Rather, these are just data points curated in a way that these appear to be cluster centers initially. #Pending

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is current phrasing better?


In reply to: 190697440 [](ancestors = 190697440)

Copy link
Contributor

@TomFinley TomFinley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Ivanidzo4ka -- might want to change title of #205 since it is I think wrong (that is, the API was there, it just wasn't clear how to use it, which you have I hope now addressed.)

[Column(ordinal: "0")]
public string Id;

[Column(ordinal: "1", name: "Label")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be using DefaultColumnNames?


pipeline.Add(new KMeansPlusPlusClusterer() { K = 20 });
var model = pipeline.Train<NewsData, ClusteringPrediction>();
var gunResult = model.Predict(new NewsData() { Subject = "Let's disscuss gun control", Content = @"The United States has 88.8 guns per 100 people, or about 270,000,000 guns, which is the highest total and per capita number in the world. 22% of Americans own one or more guns (35% of men and 12% of women). America's pervasive gun culture stems in part from its colonial history, revolutionary roots, frontier expansion, and the Second Amendment, which states: ""A well regulated militia,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh good I'm glad we didn't decide to write anything controversial here. 😄

@justinormont
Copy link
Contributor

@asthana86

This is great!, can we also add this as an E2E sample in dotnet/machinelearning/samples with a readme.md similar to the ones we are adding for regression, binary and multi-class classification!

@zyw400 may have some samples for clustering which we can move to the repo.

@justinormont justinormont merged commit b1bbceb into dotnet:master May 24, 2018
eerhardt pushed a commit to eerhardt/machinelearning that referenced this pull request Jul 27, 2018
* example

* add Clusters tests

* cleanup

* address comments

* bring clustering reference back

* rephrasing
@ghost ghost locked as resolved and limited conversation to collaborators Mar 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants