-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Machine Learning based projects #4660
Comments
Great idea.
I am highly in favour of automating our services. Main problem is with Rails absence of libraries to ML. We can find majority of above on based on Isolation Forest algorithms, Naive Bayes, BBN, CNN, ANN etc. which are heavily implemented in python, not in rails. Writing libraries from Scratch does not make sense at all. |
I would love to participate in any of these projects! I've worked with Recommendation Systems and Sentiment Analysis during my graduation. But I don't know any libraries to Rails tho. I've only worked with libraries in Python and R before. |
Same scene is with me. I will love to work on these projects. Some are in
my current semester curriculum but they are heavily based on python and R.
…On Sat, Jan 19, 2019, 1:23 PM Camila Araújo ***@***.*** wrote:
I would love to participate in any of these projects! I've worked with
Recommendation Systems and Sentiment Analysis during my graduation. But I
don't know any libraries to Rails tho. I've only worked with Python and R
before.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#4660 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AUACQwG8buuV6NgCZ0UtchzatJ8fgaIlks5vEs79gaJpZM4aIqPK>
.
|
We could make flask server
…On Sat, 19 Jan, 2019, 13:26 Sidharth Bansal ***@***.*** wrote:
Same scene is with me. I will love to work on these projects. Some are in
my current semester curriculum but they are heavily based on python and R.
On Sat, Jan 19, 2019, 1:23 PM Camila Araújo ***@***.***
wrote:
> I would love to participate in any of these projects! I've worked with
> Recommendation Systems and Sentiment Analysis during my graduation. But I
> don't know any libraries to Rails tho. I've only worked with Python and R
> before.
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#4660 (comment)
>,
> or mute the thread
> <
https://github.com/notifications/unsubscribe-auth/AUACQwG8buuV6NgCZ0UtchzatJ8fgaIlks5vEs79gaJpZM4aIqPK
>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4660 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AqtjHul_KrFgr1v230-HkxgZWGPG_cyoks5vEs-PgaJpZM4aIqPK>
.
|
Hi everyone, just dropping here to say that making a flask server for data science stuff is the correct approach here. Essentially, you would need a separate server crunching the numbers and acting as an interface to the models. This flask server would need to be run in a separate container and I volunteer to make appropriate changes to the |
*Tag Prediction*
Suggest the tags based on the content of the post posted on the website of
public lab
* 1. Real World / Business Objectives and Constraints *
1.1 Predict as many labels as possible correctly.
1.2 No strict latency constraint.
1.3 Cost of errors would be a bad customer experience.
* 2. Machine Learning problem *
* 2.1 Data*
Requires lots of data to train the machine learning model which can be
done by API
*Data Field Explanation*
Id - Unique identifier for each question
Title - The question's title
Body - The body of the question
Tags - The tags associated with the question (all lowercase, should
not contain tabs '\t' or ampersands '&')
* 2.2 Mapping the real-world problem to a Machine Learning Problem*
* 2.2.1 Type of Machine Learning Problem*
It is a multilable classification problem
Multilable Classification: Multilabel classification assigns to each sample
a set of target labels. This can be thought as predicting properties of a
data-point that are not mutually exclusive, such as topics that are
relevant for a document. A text might be about any of religion, politics,
finance or education at the same time or none of these.
__Credit__: http://scikit-learn.org/stable/modules/multiclass.html
* 2.2.2 Performance metric*
*Micro-Averaged F1-Score (Mean F Score**) *: The F1 score can be
interpreted as a weighted average of the precision and recall, where an F1
score reaches its best value at 1 and worst score at 0. The relative
contribution of precision and recall to the F1 score are equal. The formula
for the F1 score is:
*F1 = 2 * (precision * recall) / (precision + recall)*
In the multi-class and multi-label case, this is the weighted average of
the F1 score of each class.
*'micro f1 score':*
Calculate metrics globally by counting the total true positives, false
negatives and false positives.
*'macro f1 score':*
Calculate metrics for each label, and find their unweighted mean. This does
not take label imbalance into account.
*2.2.3 Machine Learning Objectives and Constraints*
1. Minimize Micro avg F1 Score.
2. Try out multiple strategies for Multi-label classification.
*3. Exploratory Data Analysis *
3.1 Using Pandas with SQLite to Load the data
3.2 Analysis of Tags
3.3 Cleaning and preprocessing
1. Sample data points
2. Separate Code from Body
3. Remove Special characters from Question title and description
4. Remove stop words
5. Remove HTML Tags
6. Convert all the characters into small letters
7. Use SnowballStemmer to stem the words
*4. Machine Learning Models *
4.1 Converting tags for multilable problems
4.2 Split the data into test and train (80:20)
4.3 featurizing data with TfIdf vectorizer
4.4 Applying Logistic Regression/SVM with OneVsRest Classifier
*5. testing the model*
*Sources / useful linksyourtube : https://youtu.be/nNDqbUhtIRg
<https://youtu.be/nNDqbUhtIRg> research paper :
https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tagging-1.pdf
<https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tagging-1.pdf>
research paper : https://dl.acm.org/citation.cfm?id=2660970&dl=ACM&coll=DL
<https://dl.acm.org/citation.cfm?id=2660970&dl=ACM&coll=DL>*
…On Mon, Jan 21, 2019 at 2:42 AM Ujjwal Sharma ***@***.***> wrote:
Hi everyone, just dropping here to say that making a flask server for data
science stuff is the correct approach here. Essentially, you would need a
separate server crunching the numbers and acting as an interface to the
models. This flask server would need to be run in a separate container and
I volunteer to make appropriate changes to the docker-compose config to
make sure this floats. Looking forward to assist people in implementing the
above cool features in the website.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4660 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AqtjHg_7Gl_e3BOjI4El8ttC6wvowmIKks5vFNuigaJpZM4aIqPK>
.
|
I really love your research but its important to take input from @jywarren
whether or not the organisation is aiming at ML into current projects.
Today or tomorrow we need to enable ml. But it depends on core mission
projects too.
So, Jeff will guide us best whether these could be further discussed or
will be taken care later on.
Thanks everyone.
…On Fri, Jan 25, 2019, 6:19 PM SKashyapD ***@***.*** wrote:
*Tag Prediction*
Suggest the tags based on the content of the post posted on the website of
public lab
* 1. Real World / Business Objectives and Constraints *
1.1 Predict as many labels as possible correctly.
1.2 No strict latency constraint.
1.3 Cost of errors would be a bad customer experience.
* 2. Machine Learning problem *
* 2.1 Data*
Requires lots of data to train the machine learning model which can be
done by API
*Data Field Explanation*
Id - Unique identifier for each question
Title - The question's title
Body - The body of the question
Tags - The tags associated with the question (all lowercase, should
not contain tabs '\t' or ampersands '&')
* 2.2 Mapping the real-world problem to a Machine Learning Problem*
* 2.2.1 Type of Machine Learning Problem*
It is a multilable classification problem
Multilable Classification: Multilabel classification assigns to each sample
a set of target labels. This can be thought as predicting properties of a
data-point that are not mutually exclusive, such as topics that are
relevant for a document. A text might be about any of religion, politics,
finance or education at the same time or none of these.
__Credit__: http://scikit-learn.org/stable/modules/multiclass.html
* 2.2.2 Performance metric*
*Micro-Averaged F1-Score (Mean F Score**) *: The F1 score can be
interpreted as a weighted average of the precision and recall, where an F1
score reaches its best value at 1 and worst score at 0. The relative
contribution of precision and recall to the F1 score are equal. The formula
for the F1 score is:
*F1 = 2 * (precision * recall) / (precision + recall)*
In the multi-class and multi-label case, this is the weighted average of
the F1 score of each class.
*'micro f1 score':*
Calculate metrics globally by counting the total true positives, false
negatives and false positives.
*'macro f1 score':*
Calculate metrics for each label, and find their unweighted mean. This does
not take label imbalance into account.
*2.2.3 Machine Learning Objectives and Constraints*
1. Minimize Micro avg F1 Score.
2. Try out multiple strategies for Multi-label classification.
*3. Exploratory Data Analysis *
3.1 Using Pandas with SQLite to Load the data
3.2 Analysis of Tags
3.3 Cleaning and preprocessing
1. Sample data points
2. Separate Code from Body
3. Remove Special characters from Question title and description
4. Remove stop words
5. Remove HTML Tags
6. Convert all the characters into small letters
7. Use SnowballStemmer to stem the words
*4. Machine Learning Models *
4.1 Converting tags for multilable problems
4.2 Split the data into test and train (80:20)
4.3 featurizing data with TfIdf vectorizer
4.4 Applying Logistic Regression/SVM with OneVsRest Classifier
*5. testing the model*
*Sources / useful linksyourtube : https://youtu.be/nNDqbUhtIRg
<https://youtu.be/nNDqbUhtIRg> research paper :
https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tagging-1.pdf
<
https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tagging-1.pdf
>
research paper : https://dl.acm.org/citation.cfm?id=2660970&dl=ACM&coll=DL
<https://dl.acm.org/citation.cfm?id=2660970&dl=ACM&coll=DL>*
On Mon, Jan 21, 2019 at 2:42 AM Ujjwal Sharma ***@***.***>
wrote:
> Hi everyone, just dropping here to say that making a flask server for
data
> science stuff is the correct approach here. Essentially, you would need a
> separate server crunching the numbers and acting as an interface to the
> models. This flask server would need to be run in a separate container
and
> I volunteer to make appropriate changes to the docker-compose config to
> make sure this floats. Looking forward to assist people in implementing
the
> above cool features in the website.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#4660 (comment)
>,
> or mute the thread
> <
https://github.com/notifications/unsubscribe-auth/AqtjHg_7Gl_e3BOjI4El8ttC6wvowmIKks5vFNuigaJpZM4aIqPK
>
> .
>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#4660 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AUACQ5sKU_-ulgC69LkAJf6AuPDBpMU2ks5vGv1ugaJpZM4aIqPK>
.
|
Hello everyone!
Please let me know if I should start working on it since it will take a lot of
time commitment and effort on my part.
Or If you want me to work on something else please let me know.
…On Fri, 25 Jan, 2019, 19:00 Sidharth Bansal ***@***.*** wrote:
I really love your research but its important to take input from @jywarren
whether or not the organisation is aiming at ML into current projects.
Today or tomorrow we need to enable ml. But it depends on core mission
projects too.
So, Jeff will guide us best whether these could be further discussed or
will be taken care later on.
Thanks everyone.
On Fri, Jan 25, 2019, 6:19 PM SKashyapD ***@***.*** wrote:
> *Tag Prediction*
> Suggest the tags based on the content of the post posted on the website
of
> public lab
>
> * 1. Real World / Business Objectives and Constraints *
> 1.1 Predict as many labels as possible correctly.
> 1.2 No strict latency constraint.
> 1.3 Cost of errors would be a bad customer experience.
>
> * 2. Machine Learning problem *
> * 2.1 Data*
> Requires lots of data to train the machine learning model which can be
> done by API
> *Data Field Explanation*
> Id - Unique identifier for each question
> Title - The question's title
> Body - The body of the question
> Tags - The tags associated with the question (all lowercase, should
> not contain tabs '\t' or ampersands '&')
>
> * 2.2 Mapping the real-world problem to a Machine Learning Problem*
> * 2.2.1 Type of Machine Learning Problem*
> It is a multilable classification problem
> Multilable Classification: Multilabel classification assigns to each
sample
> a set of target labels. This can be thought as predicting properties of a
> data-point that are not mutually exclusive, such as topics that are
> relevant for a document. A text might be about any of religion, politics,
> finance or education at the same time or none of these.
> __Credit__: http://scikit-learn.org/stable/modules/multiclass.html
>
> * 2.2.2 Performance metric*
> *Micro-Averaged F1-Score (Mean F Score**) *: The F1 score can be
> interpreted as a weighted average of the precision and recall, where an
F1
> score reaches its best value at 1 and worst score at 0. The relative
> contribution of precision and recall to the F1 score are equal. The
formula
> for the F1 score is:
> *F1 = 2 * (precision * recall) / (precision + recall)*
> In the multi-class and multi-label case, this is the weighted average of
> the F1 score of each class.
> *'micro f1 score':*
> Calculate metrics globally by counting the total true positives, false
> negatives and false positives.
> *'macro f1 score':*
> Calculate metrics for each label, and find their unweighted mean. This
does
> not take label imbalance into account.
>
> *2.2.3 Machine Learning Objectives and Constraints*
> 1. Minimize Micro avg F1 Score.
> 2. Try out multiple strategies for Multi-label classification.
>
> *3. Exploratory Data Analysis *
> 3.1 Using Pandas with SQLite to Load the data
> 3.2 Analysis of Tags
> 3.3 Cleaning and preprocessing
> 1. Sample data points
> 2. Separate Code from Body
> 3. Remove Special characters from Question title and description
> 4. Remove stop words
> 5. Remove HTML Tags
> 6. Convert all the characters into small letters
> 7. Use SnowballStemmer to stem the words
>
> *4. Machine Learning Models *
> 4.1 Converting tags for multilable problems
> 4.2 Split the data into test and train (80:20)
> 4.3 featurizing data with TfIdf vectorizer
> 4.4 Applying Logistic Regression/SVM with OneVsRest Classifier
>
> *5. testing the model*
>
>
>
> *Sources / useful linksyourtube : https://youtu.be/nNDqbUhtIRg
> <https://youtu.be/nNDqbUhtIRg> research paper :
>
>
https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tagging-1.pdf
> <
>
https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tagging-1.pdf
> >
> research paper :
https://dl.acm.org/citation.cfm?id=2660970&dl=ACM&coll=DL
> <https://dl.acm.org/citation.cfm?id=2660970&dl=ACM&coll=DL>*
>
> On Mon, Jan 21, 2019 at 2:42 AM Ujjwal Sharma ***@***.***>
> wrote:
>
> > Hi everyone, just dropping here to say that making a flask server for
> data
> > science stuff is the correct approach here. Essentially, you would
need a
> > separate server crunching the numbers and acting as an interface to the
> > models. This flask server would need to be run in a separate container
> and
> > I volunteer to make appropriate changes to the docker-compose config to
> > make sure this floats. Looking forward to assist people in implementing
> the
> > above cool features in the website.
> >
> > —
> > You are receiving this because you were mentioned.
> > Reply to this email directly, view it on GitHub
> > <
#4660 (comment)
> >,
> > or mute the thread
> > <
>
https://github.com/notifications/unsubscribe-auth/AqtjHg_7Gl_e3BOjI4El8ttC6wvowmIKks5vFNuigaJpZM4aIqPK
> >
> > .
> >
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#4660 (comment)
>,
> or mute the thread
> <
https://github.com/notifications/unsubscribe-auth/AUACQ5sKU_-ulgC69LkAJf6AuPDBpMU2ks5vGv1ugaJpZM4aIqPK
>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4660 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AqtjHgNI9WMvwvfIuHshnnlTfUGIf3efks5vGwbjgaJpZM4aIqPK>
.
|
Hi, thanks to everyone for your input here! I think there are some potential use cases for machine learning across the Public Lab ecosystem! But perhaps we need to do a bit more in-detail brainstorming on individual examples. For example, I'm not sure that running a containerized flask server as part of the Could such a separate server for data analysis access data via the API? Of the brainstormed applications, i'm hesitant on the spam one -- i like the basic premise, but to me, it seems more sustainable and less 'reinvent the wheel' to look at an existing library or service for spam identification, like Askimet or something. I'm sure others have worked on this problem and am less sure we could provide something unique that would be competitive. On the other hand, I'd love to think about places in the PL ecosystem where machine learning would present a really unique benefit that supports our overall mission. Would Spectral Workbench be one of those places?
On MapKnitter, would it be plausible to scan images and try to identify features and tag accordingly? The Vision API at Google Cloud can do some pretty interesting things there: https://cloud.google.com/vision/ Although in this test it didn't seem to find anything in this aerial photo except that it was an aerial photo 😄 : Perhaps one approach here might be to begin a Zooniverse project using MapKnitter data: https://www.zooniverse.org/lab Then that could be used as training data to develop a machine learning approach to identifying, say, areas of high risk of spills, pollution, etc. Terrapattern tried doing something kind of like this: https://qz.com/764746/terrapattern-open-source-satellite-photo-search-tool/ http://www.terrapattern.com/about That could be a really interesting approach, and I like the idea of using the MapKnitter image set to help an ML approach get better at identifying pollution. Note that Terrapattern also uses OpenStreetMap tags to train it's model. Perhaps we could correlate MapKnitter images with any OSM tags which are overlapping with the images shown, although there might not be too many. Anyhow, these are some ideas that get a bit at the environmental mission of Public Lab, and might make for an interesting set of possible projects that wouldn't necessarily live IN the |
This is a really great example of using machine learning to identify environmental issues: https://skytruth.org/2019/02/using-machine-learning-to-map-the-footprint-of-fracking-in-central-appalachia/ it also gets at some of the challenges, as well as discusses how to use existing manually categorized datasets as a training set, OR to use existing databases to correlate with imagery to train a model. Great work, @SkyTruth! |
Hey everyone and thanks @jywarren for your wonderful inputs and your proposed ideas are very cool and interesting. |
Hey everyone, We can host a Flask server in this way:
Goal: Automatically label aerial imagery Implementing the Machine learning model in simple steps:
The lengthy task: collecting the pair of aerial images and label One important yet rarely discussed aspect of using machine learning for aerial image interpretation is the source of the data. For buildings, Google Maps can provide the locations of a substantial portion of the buildings in almost any major city. This type of data can act as a source of noisy labels, which are correct with very high probability when they indicate the presence of an object and with lower, but still substantially high, probability when they indicate the absence of an object. Training a classifier on large amounts of this type of noisy data with a robust loss function can potentially produce a much better detector than by using a much smaller set of accurate labels. At present, there seem to be no applications of robust estimators to aerial image data with noisy labels. In a classification task, small translations or rotations can be applied to the input images, but in order to apply the same idea to image labelling one must be able to realistically transform both the image and the labels. On a road detection task, applying rotations to each training case before it is processed has been shown to help prevent overfitting So we need to start making our own dataset for the better result. The most important part is data. A larger and more accurate sample size will lead to the better results. Tagging; It is almost similar task as I suggested earlier for the text the difference is that, now the dataset is of images so we need to use CNN to perform the following task there is a great blog post by Adit how CNN actually work for image classification: https://adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural-Networks/ The machine learning model Residual Network (ResNet) which is a major breakthrough in CNN. Why does it work?
To understand more deeply you can go through a great intuitive blog : https://wiseodd.github.io/techblog/2016/10/13/residual-net/ Our approach to making our model better 1.instead of softmax, use the sigmoid activation function 2.optimize tag threshold to maximize F2 score Many of the times we are trying to find the optimal threshold for F2 score using trial and error but instead of that we can find the best threshold using a brute-force search on a local validation set can actually net really good results on the LB, without much overfitting in the local score. Basically, you can try every possible threshold on a local validation set, and take the best performing threshold, applying it to the test set.
A very common trick used in ML which is also known as transfer learning which means instead of training your model with random initialization we can initialize the parameters we got from another similar model who already trained on different data set. which is basically a great head start. Simply put, a pre-trained model is a model created by some one else to solve a similar problem. Instead of building a model from scratch to solve a similar problem, you use the model trained on other problem as a starting point. For example, if you want to build a self-learning car. You can spend years to build a decent image recognition algorithm from scratch or you can take inception model (a pre-trained model) from Google which was built on ImageNet data to identify images in those pictures. A pre-trained model may not be 100% accurate in your application, but it saves huge efforts required to re-invent the wheel. Let me show this to you with a recent example.
The more the data the better so like we can rotate our image by 90 degrees left and right which eventually increased the size of our dataset.
Or we can also do good with "ConNets101"
Basically "semantic segmentation" attempts to partition the image into semantically meaningful parts, and to classify each part into one of the pre-determined classes. You can also achieve the same goal by classifying each pixel (rather than the entire image/segment). In that case, you are doing pixel-wise classification, which leads to the same end result but in a slightly different path.
Other ideas for future works. 1.Detection of an oil spill. Detecting oil spill accurately using CNN is a very tough task because there are some natural phenomena which look similar from space and a small sample size does not help. We need SAR images to detect oil spill correctly because in SAR image oil spill look like in dark formation which can be easily get detected. The following can prove to be usefull:
2.Detection and mapping of plastic We can able to detect plastic on our trained model using object detection while labelling the data we need make a specific label for plastic or no-plastic so that our CNN network can use thousands of the example of labelled plastic pieces such that it will finally able to tell what is a piece of plastic and what is not. We can able to detect a different type of plastic like rope toy etc.
When somebody uploads an image on mapknitter with Geo-tagging we can able to find the PM2.5 level and detect the air quality using following link https://aqicn.org/map/india/#@g/19.9884/80.5078/5z But to predict future air pollution patterns in is itself a major machine learning task. PM2.5 refer to the tiny particle in the air that reduce visibility and cause air to appear hazy and get affected by the meteorological and traffic factor, burning of fossil fuel, Industrial parameters such as power plant emission play a significant role in air pollution. The required data-set
Our system does two tasks:
Since our plan is quite extensive, I'd like to begin working on it as soon as possible. I'd like to invite inputs from you regarding the same, primarily should I start the project on zooniverse or should I start labelling it manually? thanks, everyone |
Hi! This is a lot of information - thanks for compiling it! I wanted to ask a few things first --
I hope this helps! |
Oh, and also, starting a Zooniverse project would be GREAT! @Zengirl2 may be interested in this too. |
thanks @jywarren for great inputs and making things more clearer and interesting.
|
Zooniverse sounds great! I guess you should create a team first and add me (and @Zengirl2 or anyone else who is interested too) and I could then flesh out the rest of the project. Hope this sounds good? |
oh very cool, yes that sounds good! Can you email me with your email or
Zooniverse username at jeff@publiclab.org?
…On Wed, Mar 6, 2019 at 6:50 AM SKashyapD ***@***.***> wrote:
Zooniverse sounds great! I guess you should create a team first and add me
(and @Zengirl2 <https://github.com/Zengirl2> or anyone else who is
interested too) and I could then flesh out the rest of the project.
Hope this sounds good?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4660 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AABfJ6PXdmCVBLNsHluBxt-7LwtZy7tdks5vT6t9gaJpZM4aIqPK>
.
|
@SKashyapD Hey there--I do have a strong interest in Zooniverse, but I'm still behind on a fan project I'm working on. So, you can include me, but I won't be able to do much right now. |
thanks @jywarren for creating zooniverse project. should i start working on semantic segmentation part? thanks everyone |
@SKashyapD I originally had interest in using Zooniverse to go through possible pollution from hurricanes. They have started to do projects for hurricanes (although not with the pollution I would like). I was at the point of having conversations with two people from Zooniverse about learning to use their content system. I believe I may even have a video tutorial that they sent me. |
@SKashyapD Here's the links for some helpful info about setting up projects on Zooniverse (this was based on a specific example of flood/hurricane I had been asking about). Doc Explanation Youtube video |
There are lots of Hurricane Harvey images linked to from posts on this page: https://publiclab.org/wiki/harvey#Questions -- i hope that helps! |
@jywarren and @SKashyapD - when I log into Zooniverse it is not showing that I'm connected to any projects. Jeff, I remember seeing where you said you were going to invite me, but I don't remember getting any email about it. Can you see what name you used to add me? @SKashyapD what I was talking about as far as whether this is a Zooniverse project or private project is listed under lab policies. |
Please check your email you may have received the respective email, because your username is same in the project @Zengirl2 . |
For categorization of the project @jywarren may tell better about it |
@SKashyapD Hey, just got the email today. Will look at the project tonight when I get home 🦄 |
@SKashyapD I had a chance to look at the project and it is coming along fine. I noticed that when I chose to mark an image, that it did not give me another image once I had completed. Was this because it is not yet live? Or have you not attached a file of images yet? Anyway, here's my comments:
|
@Zengirl2 I have fixed that problem now it is working properly. please look up to it again.
Thank you! |
Hey @SKashyapD--your images are working correctly now 🎉
|
I saw machine learning and I wanted to chime in. Of the original list that @SidharthBansal compiled from the different source of requests, I wanted to add that we had been discussing the tag recommendation tangentially on the website (the code part of the conversation which has moved to Github). At any rate, to @jywarren's comment above regarding not 'reinventing the wheel' there are some recommendation engines in Ruby that I recommended (har har) my comment here. |
Sorry for the delay @Zengirl2
Can you please provide me few resources for more images @jywarren @Zengirl2 Thanks everyone! |
Thanks @skilfullycurled for taking initiative. you can check this it may help recommendify |
Hi @SKashyapD -- can you help me find your SoC proposal? Did it get posted? |
hey, @jywarren Thank you! |
No, no, thank you for taking the initiative on an ML thread, @SKashyapD! I'm not sure I'll be able to take that much more initiative on the implementation of a tag recommendation system since I don't have lots of experience in programming with Ruby, however, I really want to second your idea of having a server for this. One thing I would like to do is to piggy back on your initiative and eventually start a conversation about how to grow a community around ML and data science now that the stats downloads page is coming along. More on that later, I have to actually get back to my own data science project! |
@skilfullycurled it will be great |
Hello everyone
Not now.
I will start working on this after summer break.
Thank you
…On Mon, 1 Jul, 2019, 12:09 PM Gaurav Sachdeva, ***@***.***> wrote:
Hi @jywarren <https://github.com/jywarren> @SKashyapD
<https://github.com/SKashyapD> @Zengirl2 <https://github.com/Zengirl2>,
can we close this issue or anyone want to update it? Thanks!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4660?email_source=notifications&email_token=AKVWGHR4OEW2RRVVGHJUWRLP5GRDFA5CNFSM4GRCUPFKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODY5ENWA#issuecomment-507135704>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AKVWGHTGJVXBOFYM2LCMTYTP5GRDFANCNFSM4GRCUPFA>
.
|
Cool! |
@skilfullycurled @jywarren what are the other small projects we can start to grow the community around data science. thanks! |
I am a new bee into the ML area.. |
@SKashyapD, thank you for keeping the conversation alive! I'll rejoin as soon as I can. I just started school in a new program so although PL GitHub conversations are one of my favorite was to feel like I'm doing programming work but actually just avoiding it (not a joke), I should probably finish my work first. Still, I was too happy about the conversation to not join in. : ) One project comes immediately to mind: SPAM: There's currently two problems. The first is the spam we currently get from sign-ups and postings, and the second, spam accounts that were made before there was a more robust moderation system. There's a period of time (can't recall what it is) where there are literally ~300,000 accounts. I think Public Lab is awesome, but that seems a tad inflated. ; ) Additionally, I believe that when users are moderated as spam, they are not removed from the database. Spam isn't the most exciting task but it'd have a real impact. A) moderating spam is a huge resource train. B) Unless we're able to filter out spam accounts, there really can't be good data science because the data won't be good. This project has two quasi-FTO's. They aren't FTO's according to the actual definition, but the problem contains some "hello worlds" of data science that would be good for someone who is comfortable with Ruby (I don't think you have to be awesome at it, I hardly knew Python in my first data science class) but wants to get started in data science. And the second is for someone who is comfortable with the fundamental exploratory data analysts tasks and wants to try a simple ML exercise. I've been collecting a data set of spam. Project 1: Exploratory data analysis. I started #5450 to discuss non-ML ways to detect spam, and I came up with some guidelines simply by exploring the data. These guidelines could become more robust with more exploratory analysis of a larger dataset. This would be a good way to get familiar with the SciRuby library collection and the fundamentals of data science (using Ruby notebooks, dataframes, selecting data, aggregating results, plotting etc.) As I said, I've been collecting a dataset of spam, but we also need a way to identify past spam because I'm sure the markers have changed over time. Project 2: Creating a spam/ham classifier. This is why I started the collection actually, so that we'd have enough for the spam part. The harder thing is collecting data for people who are in the ham category. So that's sort of in the Project 1 category, but after we have enough of both, then there are plenty of tutorials for someone to have a nice learning experience. |
My pleasure @budema6, I'm excited about developing a community so it's really thanks to you for your interest! |
Update: I now have enough spam if ever anyone wants to take on training spam/ham classifier for the site. If I recall, I've seen a number of Jupyter notebooks that do this in Ruby. Of course, the data has to be parsed, and we need a ham dataset as well. In any event... |
Hey! This topic really interests me and I have made some Natural Language Processing projects with python and the spacy library. I'd love to help out and try applying NLP to spam detection. I'm no expert, but i think I could help 😄 |
@Uzay-G, thanks for reviving this thread. I'm not sure when/how but I'm thinking it might be a good idea to try to have a call. It just seems like there's enough interest in general, and it might be good to just meet each other and see if we can organize ourselves. I'd sort of like to see this become a tool topic just like balloon mapping or spectrometry. And, perhaps at some point even have a separate PL repo for projects the same way mapknitter does. Anyone at @publiclab/connectors, how are we handling developer open calls these days? |
Hi 😄, this issue has been automatically marked as stale because it has not had recent activity. Don't worry you can continue to work on this and ask @publiclab/reviewers to add "work in progress" label 🎉 . Otherwise, it will be closed if no further activity occurs in 5 days -- but you can always re-open it if you like! 💯 Thank you for your contributions 🙌 🎈. |
Sorry about the stalebot message here, it was a mistake! 😅 Can't seem to delete due to a GitHub API issue... strange. Carry on! |
Currently, our Spam system is completely manual, but
I think, instead of reviewing similar content/posts, we can use
Machine Learning algorithms for easing the task.
The text was updated successfully, but these errors were encountered: