Machine Learning based projects #4660

NeuralMonk · 2019-01-18T21:23:03Z

Currently, our Spam system is completely manual, but
I think, instead of reviewing similar content/posts, we can use
Machine Learning algorithms for easing the task.

SidharthBansal · 2019-01-19T06:23:42Z

Great idea.
@jywarren I want to add a couple more idea. I know they are not Core Mission Driven Projects. We must focus on them before addressing these less important issues. But just to brainstorm a little.

Content Based Tag Recommendation System (Suggested by Jeff)
Anomalous Spam Detection System(As suggested by @SKashyapD )
Recommendation Systems for posts (@Saurabh19126848_twitter suggestion on gitter chat )
recommendation system for posts (@Saurabh19126848_twitter suggestion on gitter chat)
sentiment analysis ( @Saurabh19126848_twitter suggestion on gitter chat)
Tag Suggestions by Natural Language Processing on nodes(suggested by me)

I am highly in favour of automating our services. Main problem is with Rails absence of libraries to ML. We can find majority of above on based on Isolation Forest algorithms, Naive Bayes, BBN, CNN, ANN etc. which are heavily implemented in python, not in rails. Writing libraries from Scratch does not make sense at all.
So, we also need to think of these considerations.

milaaraujo · 2019-01-19T07:53:32Z

I would love to participate in any of these projects! I've worked with Recommendation Systems and Sentiment Analysis during my graduation. But I don't know any libraries to Rails tho. I've only worked with libraries in Python and R before.

SidharthBansal · 2019-01-19T07:55:58Z

Same scene is with me. I will love to work on these projects. Some are in my current semester curriculum but they are heavily based on python and R.

…

On Sat, Jan 19, 2019, 1:23 PM Camila Araújo ***@***.*** wrote: I would love to participate in any of these projects! I've worked with Recommendation Systems and Sentiment Analysis during my graduation. But I don't know any libraries to Rails tho. I've only worked with Python and R before. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#4660 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AUACQwG8buuV6NgCZ0UtchzatJ8fgaIlks5vEs79gaJpZM4aIqPK> .

NeuralMonk · 2019-01-19T11:24:26Z

We could make flask server

…

On Sat, 19 Jan, 2019, 13:26 Sidharth Bansal ***@***.*** wrote: Same scene is with me. I will love to work on these projects. Some are in my current semester curriculum but they are heavily based on python and R. On Sat, Jan 19, 2019, 1:23 PM Camila Araújo ***@***.*** wrote: > I would love to participate in any of these projects! I've worked with > Recommendation Systems and Sentiment Analysis during my graduation. But I > don't know any libraries to Rails tho. I've only worked with Python and R > before. > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#4660 (comment) >, > or mute the thread > < https://github.com/notifications/unsubscribe-auth/AUACQwG8buuV6NgCZ0UtchzatJ8fgaIlks5vEs79gaJpZM4aIqPK > > . > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4660 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AqtjHul_KrFgr1v230-HkxgZWGPG_cyoks5vEs-PgaJpZM4aIqPK> .

ryzokuken · 2019-01-20T21:12:01Z

Hi everyone, just dropping here to say that making a flask server for data science stuff is the correct approach here. Essentially, you would need a separate server crunching the numbers and acting as an interface to the models. This flask server would need to be run in a separate container and I volunteer to make appropriate changes to the docker-compose config to make sure this floats. Looking forward to assist people in implementing the above cool features in the website.

NeuralMonk · 2019-01-25T12:49:46Z

*Tag Prediction* Suggest the tags based on the content of the post posted on the website of public lab * 1. Real World / Business Objectives and Constraints * 1.1 Predict as many labels as possible correctly. 1.2 No strict latency constraint. 1.3 Cost of errors would be a bad customer experience. * 2. Machine Learning problem * * 2.1 Data* Requires lots of data to train the machine learning model which can be done by API *Data Field Explanation* Id - Unique identifier for each question Title - The question's title Body - The body of the question Tags - The tags associated with the question (all lowercase, should not contain tabs '\t' or ampersands '&') * 2.2 Mapping the real-world problem to a Machine Learning Problem* * 2.2.1 Type of Machine Learning Problem* It is a multilable classification problem Multilable Classification: Multilabel classification assigns to each sample a set of target labels. This can be thought as predicting properties of a data-point that are not mutually exclusive, such as topics that are relevant for a document. A text might be about any of religion, politics, finance or education at the same time or none of these. __Credit__: http://scikit-learn.org/stable/modules/multiclass.html * 2.2.2 Performance metric* *Micro-Averaged F1-Score (Mean F Score**) *: The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is: *F1 = 2 * (precision * recall) / (precision + recall)* In the multi-class and multi-label case, this is the weighted average of the F1 score of each class. *'micro f1 score':* Calculate metrics globally by counting the total true positives, false negatives and false positives. *'macro f1 score':* Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account. *2.2.3 Machine Learning Objectives and Constraints* 1. Minimize Micro avg F1 Score. 2. Try out multiple strategies for Multi-label classification. *3. Exploratory Data Analysis * 3.1 Using Pandas with SQLite to Load the data 3.2 Analysis of Tags 3.3 Cleaning and preprocessing 1. Sample data points 2. Separate Code from Body 3. Remove Special characters from Question title and description 4. Remove stop words 5. Remove HTML Tags 6. Convert all the characters into small letters 7. Use SnowballStemmer to stem the words *4. Machine Learning Models * 4.1 Converting tags for multilable problems 4.2 Split the data into test and train (80:20) 4.3 featurizing data with TfIdf vectorizer 4.4 Applying Logistic Regression/SVM with OneVsRest Classifier *5. testing the model* *Sources / useful linksyourtube : https://youtu.be/nNDqbUhtIRg <https://youtu.be/nNDqbUhtIRg> research paper : https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tagging-1.pdf <https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tagging-1.pdf> research paper : https://dl.acm.org/citation.cfm?id=2660970&dl=ACM&coll=DL <https://dl.acm.org/citation.cfm?id=2660970&dl=ACM&coll=DL>*

…

On Mon, Jan 21, 2019 at 2:42 AM Ujjwal Sharma ***@***.***> wrote: Hi everyone, just dropping here to say that making a flask server for data science stuff is the correct approach here. Essentially, you would need a separate server crunching the numbers and acting as an interface to the models. This flask server would need to be run in a separate container and I volunteer to make appropriate changes to the docker-compose config to make sure this floats. Looking forward to assist people in implementing the above cool features in the website. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4660 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AqtjHg_7Gl_e3BOjI4El8ttC6wvowmIKks5vFNuigaJpZM4aIqPK> .

SidharthBansal · 2019-01-25T13:30:07Z

I really love your research but its important to take input from @jywarren whether or not the organisation is aiming at ML into current projects. Today or tomorrow we need to enable ml. But it depends on core mission projects too. So, Jeff will guide us best whether these could be further discussed or will be taken care later on. Thanks everyone.

…

On Fri, Jan 25, 2019, 6:19 PM SKashyapD ***@***.*** wrote: *Tag Prediction* Suggest the tags based on the content of the post posted on the website of public lab * 1. Real World / Business Objectives and Constraints * 1.1 Predict as many labels as possible correctly. 1.2 No strict latency constraint. 1.3 Cost of errors would be a bad customer experience. * 2. Machine Learning problem * * 2.1 Data* Requires lots of data to train the machine learning model which can be done by API *Data Field Explanation* Id - Unique identifier for each question Title - The question's title Body - The body of the question Tags - The tags associated with the question (all lowercase, should not contain tabs '\t' or ampersands '&') * 2.2 Mapping the real-world problem to a Machine Learning Problem* * 2.2.1 Type of Machine Learning Problem* It is a multilable classification problem Multilable Classification: Multilabel classification assigns to each sample a set of target labels. This can be thought as predicting properties of a data-point that are not mutually exclusive, such as topics that are relevant for a document. A text might be about any of religion, politics, finance or education at the same time or none of these. __Credit__: http://scikit-learn.org/stable/modules/multiclass.html * 2.2.2 Performance metric* *Micro-Averaged F1-Score (Mean F Score**) *: The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is: *F1 = 2 * (precision * recall) / (precision + recall)* In the multi-class and multi-label case, this is the weighted average of the F1 score of each class. *'micro f1 score':* Calculate metrics globally by counting the total true positives, false negatives and false positives. *'macro f1 score':* Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account. *2.2.3 Machine Learning Objectives and Constraints* 1. Minimize Micro avg F1 Score. 2. Try out multiple strategies for Multi-label classification. *3. Exploratory Data Analysis * 3.1 Using Pandas with SQLite to Load the data 3.2 Analysis of Tags 3.3 Cleaning and preprocessing 1. Sample data points 2. Separate Code from Body 3. Remove Special characters from Question title and description 4. Remove stop words 5. Remove HTML Tags 6. Convert all the characters into small letters 7. Use SnowballStemmer to stem the words *4. Machine Learning Models * 4.1 Converting tags for multilable problems 4.2 Split the data into test and train (80:20) 4.3 featurizing data with TfIdf vectorizer 4.4 Applying Logistic Regression/SVM with OneVsRest Classifier *5. testing the model* *Sources / useful linksyourtube : https://youtu.be/nNDqbUhtIRg <https://youtu.be/nNDqbUhtIRg> research paper : https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tagging-1.pdf < https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tagging-1.pdf > research paper : https://dl.acm.org/citation.cfm?id=2660970&dl=ACM&coll=DL <https://dl.acm.org/citation.cfm?id=2660970&dl=ACM&coll=DL>* On Mon, Jan 21, 2019 at 2:42 AM Ujjwal Sharma ***@***.***> wrote: > Hi everyone, just dropping here to say that making a flask server for data > science stuff is the correct approach here. Essentially, you would need a > separate server crunching the numbers and acting as an interface to the > models. This flask server would need to be run in a separate container and > I volunteer to make appropriate changes to the docker-compose config to > make sure this floats. Looking forward to assist people in implementing the > above cool features in the website. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#4660 (comment) >, > or mute the thread > < https://github.com/notifications/unsubscribe-auth/AqtjHg_7Gl_e3BOjI4El8ttC6wvowmIKks5vFNuigaJpZM4aIqPK > > . > — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#4660 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AUACQ5sKU_-ulgC69LkAJf6AuPDBpMU2ks5vGv1ugaJpZM4aIqPK> .

NeuralMonk · 2019-02-05T12:46:10Z

Hello everyone! Please let me know if I should start working on it since it will take a lot of time commitment and effort on my part. Or If you want me to work on something else please let me know.

…

On Fri, 25 Jan, 2019, 19:00 Sidharth Bansal ***@***.*** wrote: I really love your research but its important to take input from @jywarren whether or not the organisation is aiming at ML into current projects. Today or tomorrow we need to enable ml. But it depends on core mission projects too. So, Jeff will guide us best whether these could be further discussed or will be taken care later on. Thanks everyone. On Fri, Jan 25, 2019, 6:19 PM SKashyapD ***@***.*** wrote: > *Tag Prediction* > Suggest the tags based on the content of the post posted on the website of > public lab > > * 1. Real World / Business Objectives and Constraints * > 1.1 Predict as many labels as possible correctly. > 1.2 No strict latency constraint. > 1.3 Cost of errors would be a bad customer experience. > > * 2. Machine Learning problem * > * 2.1 Data* > Requires lots of data to train the machine learning model which can be > done by API > *Data Field Explanation* > Id - Unique identifier for each question > Title - The question's title > Body - The body of the question > Tags - The tags associated with the question (all lowercase, should > not contain tabs '\t' or ampersands '&') > > * 2.2 Mapping the real-world problem to a Machine Learning Problem* > * 2.2.1 Type of Machine Learning Problem* > It is a multilable classification problem > Multilable Classification: Multilabel classification assigns to each sample > a set of target labels. This can be thought as predicting properties of a > data-point that are not mutually exclusive, such as topics that are > relevant for a document. A text might be about any of religion, politics, > finance or education at the same time or none of these. > __Credit__: http://scikit-learn.org/stable/modules/multiclass.html > > * 2.2.2 Performance metric* > *Micro-Averaged F1-Score (Mean F Score**) *: The F1 score can be > interpreted as a weighted average of the precision and recall, where an F1 > score reaches its best value at 1 and worst score at 0. The relative > contribution of precision and recall to the F1 score are equal. The formula > for the F1 score is: > *F1 = 2 * (precision * recall) / (precision + recall)* > In the multi-class and multi-label case, this is the weighted average of > the F1 score of each class. > *'micro f1 score':* > Calculate metrics globally by counting the total true positives, false > negatives and false positives. > *'macro f1 score':* > Calculate metrics for each label, and find their unweighted mean. This does > not take label imbalance into account. > > *2.2.3 Machine Learning Objectives and Constraints* > 1. Minimize Micro avg F1 Score. > 2. Try out multiple strategies for Multi-label classification. > > *3. Exploratory Data Analysis * > 3.1 Using Pandas with SQLite to Load the data > 3.2 Analysis of Tags > 3.3 Cleaning and preprocessing > 1. Sample data points > 2. Separate Code from Body > 3. Remove Special characters from Question title and description > 4. Remove stop words > 5. Remove HTML Tags > 6. Convert all the characters into small letters > 7. Use SnowballStemmer to stem the words > > *4. Machine Learning Models * > 4.1 Converting tags for multilable problems > 4.2 Split the data into test and train (80:20) > 4.3 featurizing data with TfIdf vectorizer > 4.4 Applying Logistic Regression/SVM with OneVsRest Classifier > > *5. testing the model* > > > > *Sources / useful linksyourtube : https://youtu.be/nNDqbUhtIRg > <https://youtu.be/nNDqbUhtIRg> research paper : > > https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tagging-1.pdf > < > https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tagging-1.pdf > > > research paper : https://dl.acm.org/citation.cfm?id=2660970&dl=ACM&coll=DL > <https://dl.acm.org/citation.cfm?id=2660970&dl=ACM&coll=DL>* > > On Mon, Jan 21, 2019 at 2:42 AM Ujjwal Sharma ***@***.***> > wrote: > > > Hi everyone, just dropping here to say that making a flask server for > data > > science stuff is the correct approach here. Essentially, you would need a > > separate server crunching the numbers and acting as an interface to the > > models. This flask server would need to be run in a separate container > and > > I volunteer to make appropriate changes to the docker-compose config to > > make sure this floats. Looking forward to assist people in implementing > the > > above cool features in the website. > > > > — > > You are receiving this because you were mentioned. > > Reply to this email directly, view it on GitHub > > < #4660 (comment) > >, > > or mute the thread > > < > https://github.com/notifications/unsubscribe-auth/AqtjHg_7Gl_e3BOjI4El8ttC6wvowmIKks5vFNuigaJpZM4aIqPK > > > > . > > > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#4660 (comment) >, > or mute the thread > < https://github.com/notifications/unsubscribe-auth/AUACQ5sKU_-ulgC69LkAJf6AuPDBpMU2ks5vGv1ugaJpZM4aIqPK > > . > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4660 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AqtjHgNI9WMvwvfIuHshnnlTfUGIf3efks5vGwbjgaJpZM4aIqPK> .

jywarren · 2019-02-06T23:47:55Z

Hi, thanks to everyone for your input here! I think there are some potential use cases for machine learning across the Public Lab ecosystem! But perhaps we need to do a bit more in-detail brainstorming on individual examples. For example, I'm not sure that running a containerized flask server as part of the plots2 codebase makes sense because it dramatically expands the setup complexity of the project (we had an issue with this in a previous project to run a Solr container), but perhaps it could make sense to develop in a separate repository?

Could such a separate server for data analysis access data via the API?

Of the brainstormed applications, i'm hesitant on the spam one -- i like the basic premise, but to me, it seems more sustainable and less 'reinvent the wheel' to look at an existing library or service for spam identification, like Askimet or something. I'm sure others have worked on this problem and am less sure we could provide something unique that would be competitive.

On the other hand, I'd love to think about places in the PL ecosystem where machine learning would present a really unique benefit that supports our overall mission.

Would Spectral Workbench be one of those places?

I note a mention of neural networks for trying to solve an issue here: Port "capture" interface into this library from main spectral-workbench project spectral-workbench.js#56 (comment) (although seems that should be broken into its own issue)
@Lucaszw emailed me some time back with the idea of using machine learning to apply appropriate tags to spectra in SpectralWorkbench. That also seems interesting!

On MapKnitter, would it be plausible to scan images and try to identify features and tag accordingly?

The Vision API at Google Cloud can do some pretty interesting things there: https://cloud.google.com/vision/

Although in this test it didn't seem to find anything in this aerial photo except that it was an aerial photo 😄 :

Perhaps one approach here might be to begin a Zooniverse project using MapKnitter data: https://www.zooniverse.org/lab

Then that could be used as training data to develop a machine learning approach to identifying, say, areas of high risk of spills, pollution, etc.

Terrapattern tried doing something kind of like this: https://qz.com/764746/terrapattern-open-source-satellite-photo-search-tool/

http://www.terrapattern.com/about

That could be a really interesting approach, and I like the idea of using the MapKnitter image set to help an ML approach get better at identifying pollution.

Note that Terrapattern also uses OpenStreetMap tags to train it's model. Perhaps we could correlate MapKnitter images with any OSM tags which are overlapping with the images shown, although there might not be too many.

Anyhow, these are some ideas that get a bit at the environmental mission of Public Lab, and might make for an interesting set of possible projects that wouldn't necessarily live IN the plots2 codebase, but could be really powerful tools for our community.

jywarren · 2019-02-14T16:28:41Z

This is a really great example of using machine learning to identify environmental issues: https://skytruth.org/2019/02/using-machine-learning-to-map-the-footprint-of-fracking-in-central-appalachia/

it also gets at some of the challenges, as well as discusses how to use existing manually categorized datasets as a training set, OR to use existing databases to correlate with imagery to train a model. Great work, @SkyTruth!

NeuralMonk · 2019-02-14T18:36:06Z

Hey everyone and thanks @jywarren for your wonderful inputs and your proposed ideas are very cool and interesting.
I have already started reading and researching about them. It will take me about a week to find out how things are supposed to be done.
thanks, everyone.

NeuralMonk · 2019-02-27T21:14:19Z

Hey everyone,
I have done my research on given ideas and devised the following plan:
@jywarren, it is definitely a good idea to create a new repository for machine learning based projects, instituting a separate server for data analysis access data via the API.

We can host a Flask server in this way:

It will take the screenshot of the image,
Feed it to the input of the model,
Take the output of the model to show it on the web page.

Goal: Automatically label aerial imagery

Tagging,
Semantic segmentation.

Implementing the Machine learning model in simple steps:

Collect the pair of images and label,
Write a program that predicts labels for given images(model),
Let the computer automatically tune parameters to mimic examples(learning).

The lengthy task: collecting the pair of aerial images and label

One important yet rarely discussed aspect of using machine learning for aerial image interpretation is the source of the data.
Since labelling images is a very time-consuming process, the datasets have been small in both aerial image applications and general image labelling work. Hence, obtaining good sources of accurately labelled data is important for both evaluating existing approaches and training systems that are likely to work under varying conditions.
In some domains, hand-labelling data in order to train a classifier is not necessary because the label information is often readily available. For example, in the case of road detection (Semantic segmentation), the locations of existing roads are typically known because they are useful for navigation and not just as target labels in a machine learning task.
The abundance of accurately labelled data for road detection makes it a very good candidate for evaluating existing aerial image interpretation systems as well as the application of machine learning techniques.

For buildings, Google Maps can provide the locations of a substantial portion of the buildings in almost any major city. This type of data can act as a source of noisy labels, which are correct with very high probability when they indicate the presence of an object and with lower, but still substantially high, probability when they indicate the absence of an object. Training a classifier on large amounts of this type of noisy data with a robust loss function can potentially produce a much better detector than by using a much smaller set of accurate labels. At present, there seem to be no applications of robust estimators to aerial image data with noisy labels.
For object classes such as cars or areas for which Google Maps possesses neither accurate nor complete map information, hand-labelling data seems to be the option or to use of crowdsourcing tools like zooniverse https://www.zooniverse.org/ which helps us to make the dataset.

In a classification task, small translations or rotations can be applied to the input images, but in order to apply the same idea to image labelling one must be able to realistically transform both the image and the labels. On a road detection task, applying rotations to each training case before it is processed has been shown to help prevent overfitting

So we need to start making our own dataset for the better result.
we can do it manually and I'd like to volunteer my self to do the same by using a python script.
alternatively platforms like Zooniverse can be used to create the dataset
https://help.zooniverse.org/getting-started/

The most important part is data. A larger and more accurate sample size will lead to the better results.
The primary obstacle is the imbalance in dataset which makes detecting rare labels a difficult task.

Tagging;

It is almost similar task as I suggested earlier for the text the difference is that, now the dataset is of images so we need to use CNN to perform the following task there is a great blog post by Adit how CNN actually work for image classification: https://adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural-Networks/

The machine learning model

Residual Network (ResNet) which is a major breakthrough in CNN.
1.allowing training model with 100's of the layer for grater accuracy.
2. layers compute residual(delta) between input and output

Why does it work?

each layer has less work to do(no copying)
allows gradient to flow more easily due to skipping connection

To understand more deeply you can go through a great intuitive blog : https://wiseodd.github.io/techblog/2016/10/13/residual-net/

Our approach to making our model better

1.instead of softmax, use the sigmoid activation function

2.optimize tag threshold to maximize F2 score

Many of the times we are trying to find the optimal threshold for F2 score using trial and error but instead of that we can find the best threshold using a brute-force search on a local validation set can actually net really good results on the LB, without much overfitting in the local score. Basically, you can try every possible threshold on a local validation set, and take the best performing threshold, applying it to the test set.
And we also know that the best threshold is vastly different for each class. This means we can also get a big improvement by setting a different threshold for each class

Using pretrained model

A very common trick used in ML which is also known as transfer learning which means instead of training your model with random initialization we can initialize the parameters we got from another similar model who already trained on different data set. which is basically a great head start.

Simply put, a pre-trained model is a model created by some one else to solve a similar problem. Instead of building a model from scratch to solve a similar problem, you use the model trained on other problem as a starting point.

For example, if you want to build a self-learning car. You can spend years to build a decent image recognition algorithm from scratch or you can take inception model (a pre-trained model) from Google which was built on ImageNet data to identify images in those pictures.

A pre-trained model may not be 100% accurate in your application, but it saves huge efforts required to re-invent the wheel. Let me show this to you with a recent example.

Augment label dataset using lossless image transformation.

The more the data the better so like we can rotate our image by 90 degrees left and right which eventually increased the size of our dataset.

Tune learning rate (LR) manually
it is very important to find which LR has best performance
Ensembling of 3 model architecture(optional)
ResNet 5x
inception 5x
DenseNet 5x

Or we can also do good with "ConNets101"
it depends on what are the resources we have
ensembling is good ML approach but give a little boost in F2 score and take about 15 times more computation than ConvNet101.

Semantic segmentation

Basically "semantic segmentation" attempts to partition the image into semantically meaningful parts, and to classify each part into one of the pre-determined classes. You can also achieve the same goal by classifying each pixel (rather than the entire image/segment). In that case, you are doing pixel-wise classification, which leads to the same end result but in a slightly different path.
to understand it deeply check the very insightful blog
https://www.jeremyjordan.me/semantic-segmentation/

ResNet based FCN architecture
fine-tuned a pre-trained model
Use IR R G image as input
Make prediction using sliding window because network only can handle 256X256
Ensembling average of five model

Other ideas for future works.

1.Detection of an oil spill.

Detecting oil spill accurately using CNN is a very tough task because there are some natural phenomena which look similar from space and a small sample size does not help. We need SAR images to detect oil spill correctly because in SAR image oil spill look like in dark formation which can be easily get detected. The following can prove to be usefull:

Fully convolution Network
FCN-GoogleNet
FCN-ResNets
deep neural autoencoder

2.Detection and mapping of plastic

We can able to detect plastic on our trained model using object detection while labelling the data we need make a specific label for plastic or no-plastic so that our CNN network can use thousands of the example of labelled plastic pieces such that it will finally able to tell what is a piece of plastic and what is not. We can able to detect a different type of plastic like rope toy etc.

Air pollution

When somebody uploads an image on mapknitter with Geo-tagging we can able to find the PM2.5 level and detect the air quality using following link https://aqicn.org/map/india/#@g/19.9884/80.5078/5z
so we can able to classify air is polluted or not in the given region.

But to predict future air pollution patterns in is itself a major machine learning task.

PM2.5 refer to the tiny particle in the air that reduce visibility and cause air to appear hazy and get affected by the meteorological and traffic factor, burning of fossil fuel, Industrial parameters such as power plant emission play a significant role in air pollution.

The required data-set

Temperature
wind speed
Dewpoint
pressure
PM2.5 Concentration
classified data sample(polluted or not)

Our system does two tasks:

detect the level of PM2.5 on given location
Predict PM2.5 value for a particular date
2.1) Logistic regression to predict air is polluted or not
2.2) Autoregression to predict a future value of PM2.5 based on the previous PM2.5 value reading

Since our plan is quite extensive, I'd like to begin working on it as soon as possible. I'd like to invite inputs from you regarding the same, primarily should I start the project on zooniverse or should I start labelling it manually?

thanks, everyone

jywarren · 2019-02-27T22:19:20Z

Hi! This is a lot of information - thanks for compiling it! I wanted to ask a few things first --

With such a complex system, perhaps we should do some diagramming to show what the parts of the system are, and what are the potential ways to fulfill each part -- we could start with a diagram template like the one linked here, that was used to generate the plots2 data model: https://github.com/publiclab/plots2/blob/master/doc/DATA_MODEL.md
I'm really interested in good integration with existing efforts -- what portions of systems like Terrapattern and others are re-usable, or could we at least remain compatible with? https://github.com/CreativeInquiry/terrapattern
For buildings, Google Maps can provide the locations of a substantial portion of the buildings in almost any major city. -- I'd even prefer OpenStreetMap, which Terrapattern uses, and is an open source data source which we could also encourage people to contribute to in order to improve the training! See how to query here: Landfill, mine/quarry map data via OpenStreetMap leaflet-environmental-layers#50 and also a lot about more data sources to draw from in https://github.com/publiclab/leaflet-environmental-layers/ !
For the PM air quality data, do you think perhaps it's possible that there is no visible sign of air quality issues in MapKnitter images? or if you're not using images to correlate, but just data, there may be other models to look to first.

I hope this helps!

jywarren · 2019-02-27T22:20:25Z

Oh, and also, starting a Zooniverse project would be GREAT! @Zengirl2 may be interested in this too.

NeuralMonk · 2019-03-06T11:02:49Z

thanks @jywarren for great inputs and making things more clearer and interesting.

Yes it is little complex and i will try to breakdown things in simpler way and i started working on this I will try to complete it as soon as possible.
for now we can able to do Semantic segmentation part which can help model to predict tags like ROAD, BUILDING, WATER, TREES, VEGETATION because there is data available freely like
eg- https://project.inria.fr/aerialimagelabeling/
and we can use opneStreerMap http://openstreetmapdata.com/
so we can start doing thiis
Using open source is always fun.
Using images we can only able to find out whether or not the image is hazy but with the location of the image we are able to find out its PM2.5 value of that particular location.

NeuralMonk · 2019-03-06T11:50:21Z

Zooniverse sounds great! I guess you should create a team first and add me (and @Zengirl2 or anyone else who is interested too) and I could then flesh out the rest of the project.

Hope this sounds good?

jywarren · 2019-03-06T15:46:52Z

oh very cool, yes that sounds good! Can you email me with your email or Zooniverse username at jeff@publiclab.org?

…

On Wed, Mar 6, 2019 at 6:50 AM SKashyapD ***@***.***> wrote: Zooniverse sounds great! I guess you should create a team first and add me (and @Zengirl2 <https://github.com/Zengirl2> or anyone else who is interested too) and I could then flesh out the rest of the project. Hope this sounds good? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4660 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AABfJ6PXdmCVBLNsHluBxt-7LwtZy7tdks5vT6t9gaJpZM4aIqPK> .

Zengirl2 · 2019-03-06T21:05:53Z

@SKashyapD Hey there--I do have a strong interest in Zooniverse, but I'm still behind on a fan project I'm working on. So, you can include me, but I won't be able to do much right now.

NeuralMonk · 2019-03-06T21:13:20Z

Most simplest way to show how things going to work each and every block have there own technical details. please create a repository and I will explain every technical detail on it.

thanks @jywarren for creating zooniverse project.
zooniverse project looks great I started working on it but I have to know few things first to make it better and clear.
-what we are specifically looking for(core mission)?
-what are the labels we are going to take to create our database?
-anything important you want to mention?

should i start working on semantic segmentation part?

thanks everyone

NeuralMonk · 2019-03-06T21:26:17Z

thanks @Zengirl2 for showing interest . And any kind of contribution will be great.
@jywarren please add @Zengirl2 to our zooniverse project.

Zengirl2 · 2019-03-06T21:42:56Z

@SKashyapD I originally had interest in using Zooniverse to go through possible pollution from hurricanes. They have started to do projects for hurricanes (although not with the pollution I would like). I was at the point of having conversations with two people from Zooniverse about learning to use their content system. I believe I may even have a video tutorial that they sent me.

NeuralMonk · 2019-03-09T18:26:22Z

I am really excited to complete zooniverse project and semantic segmentation part @jywarren please give me some inputs so that i can start working and I will try complete all this as soon as possible.
@Zengirl2 please give me that tutorial video it will help me a lot.

Zengirl2 · 2019-03-11T19:54:47Z

@SKashyapD Here's the links for some helpful info about setting up projects on Zooniverse (this was based on a specific example of flood/hurricane I had been asking about).

Doc Explanation
https://docs.google.com/document/d/1W5y5Iq6WY5OpP6P4kcHrE6od0tGBFhO0huXvXHJJCzs/edit?usp=sharing

Youtube video
https://www.youtube.com/watch?v=_bcu5tJDjPY

NeuralMonk · 2019-03-14T20:02:11Z

thanks @Zengirl2 for providing me resources.
@jywarren please let me know when your are finished and I already working on some prerequisite that will help us in future

jywarren · 2019-03-15T18:50:31Z

I think @Zengirl2's idea for core mission is great -- identify specific types of pollution from aerial photos -- and we can start with whatever is a good initial training set.

I added @Zengirl2 to the zooniverse! Thank you!

jywarren · 2019-03-15T18:52:26Z

There are lots of Hurricane Harvey images linked to from posts on this page: https://publiclab.org/wiki/harvey#Questions -- i hope that helps!

NeuralMonk · 2019-03-30T19:34:52Z

@Zengirl2 to edit the project you can go through this link too.
what are the criteria to get selected as zooniverse project?
which type of categorization you are talking about? categorization of dataset?

Zengirl2 · 2019-03-30T23:26:50Z

@jywarren and @SKashyapD - when I log into Zooniverse it is not showing that I'm connected to any projects. Jeff, I remember seeing where you said you were going to invite me, but I don't remember getting any email about it. Can you see what name you used to add me?

@SKashyapD what I was talking about as far as whether this is a Zooniverse project or private project is listed under lab policies.

NeuralMonk · 2019-04-01T18:39:46Z

Please check your email you may have received the respective email, because your username is same in the project @Zengirl2 .

NeuralMonk · 2019-04-01T20:21:30Z

For categorization of the project @jywarren may tell better about it
do we have enough volunteer for classification task?

Zengirl2 · 2019-04-01T21:01:47Z

@SKashyapD Hey, just got the email today. Will look at the project tonight when I get home 🦄

Zengirl2 · 2019-04-02T20:09:39Z

@SKashyapD I had a chance to look at the project and it is coming along fine. I noticed that when I chose to mark an image, that it did not give me another image once I had completed. Was this because it is not yet live? Or have you not attached a file of images yet? Anyway, here's my comments:

If this is just a test, it is fine that it is not a full blown Zooniverse project. Just sending the link to the Public Lab community once this is live is good.
Usually a Zooniverse project only takes on marking an image for one or two things. We are asking more by having many types of pollution. I know just trying to identify oil sheen from an image is difficult, so we probably need to develop a tutorial. Also, a gas company flare--would that be considered pollution? These are some of the things a tutorial can make more understandable :). In fact, the original image you used as an example earlier before you sent the link for the project was great--perhaps that can be used for the tutorial.
We should probably make it more clear why we are trying to do this work, so maybe filling out the field guide section would be a good idea as well.

NeuralMonk · 2019-04-06T17:40:50Z

@Zengirl2 I have fixed that problem now it is working properly. please look up to it again.

I will make a tutorial as soon as possible, and I will add more images too.
Can you provide an exemplary tutorial anything which can help to make the tutorial better?
I have done some research during the making of summer of code proposal for why we are doing it, so can I add few things @jywarren?
Can you provide me with your BIo or something which can help me to create Team section @jywarren @Zengirl2? it will help us to make our project looks good.

Thank you!

Zengirl2 · 2019-04-07T22:02:04Z

Hey @SKashyapD--your images are working correctly now 🎉

Great example of a similar project and tutorial (it has already completed but you can still view it)
Tutorial Details - I know some important things we were talking about identifying was sheen on water from oil spills, damaged infrastructure (like large oil tanks that get ripped open or toppled from hurricanes), flares (the flames from stacks from gas companies) and I'm wondering if we can identify tar on beaches? Maybe that counts as oil spill, too.
Drawing "Mining" - I was having difficulties using this--do you need to make more than two points? It said "2 of 0 required drawn" when I tried it.
Classification section - This seems to be a summary of the places identified by the symbols/drawings, but not sure where/how I'm supposed to input any information (like for instance if I knew there was a gas plant in a location).
Pretty Stuff - The hurricane project example I gave you earlier helps to show how to make a project attractive/needed. I'm thinking we may be able to get a photo for the front page that looks more like hurricane devastation. I believe we have images already on Public Lab's site that could be useful, so I'll try to find one. This also affects the message on the top of the project...maybe something like "We need your help recognizing pollution from aerial images so we can prepare for future disasters". Also, where you have the quote about "destroying oceans" maybe we can give more detail about how hurricanes and other disasters cause pollution of air, water and soil for living things in surrounding areas long after the initial event. Also, the ability to identify pollution from aerial images helps to hold companies accountable for preparation and remediation. Think Skytruth :)
My bio (you can use my pic from Github--let me know if you need it larger)- Leslie is a user and educator of open source hardware and volunteers with Public Lab to help others investigate their environmental concerns. She is currently working on a Master's of Environmental Studies with a focus on Conservation Tech at University of Pennsylvania.

skilfullycurled · 2019-04-10T17:41:34Z

I saw machine learning and I wanted to chime in. Of the original list that @SidharthBansal compiled from the different source of requests, I wanted to add that we had been discussing the tag recommendation tangentially on the website (the code part of the conversation which has moved to Github). At any rate, to @jywarren's comment above regarding not 'reinventing the wheel' there are some recommendation engines in Ruby that I recommended (har har) my comment here.

NeuralMonk · 2019-04-17T19:16:33Z

Sorry for the delay @Zengirl2

I started working on the tutorial and thanks for the resources.
For drawing mining I selected polygon because it will help us to map mining area better(you can draw any required shape).
In classification section I will add an extra section for notes like this.
I will update the few section of the project to make it appealing.
Thanks for the bio @Zengirl2.

Can you please provide me few resources for more images @jywarren @Zengirl2

Thanks everyone!

NeuralMonk · 2019-04-17T19:27:48Z

Thanks @skilfullycurled for taking initiative. you can check this it may help recommendify

jywarren · 2019-04-18T00:12:08Z

Hi @SKashyapD -- can you help me find your SoC proposal? Did it get posted?

NeuralMonk · 2019-04-18T01:06:38Z

hey, @jywarren
Yes, It got posted on Public lab website and You have also reviewed it. SoC proposal
Is there any trouble or something?

Thank you!

skilfullycurled · 2019-04-18T01:48:19Z

No, no, thank you for taking the initiative on an ML thread, @SKashyapD! I'm not sure I'll be able to take that much more initiative on the implementation of a tag recommendation system since I don't have lots of experience in programming with Ruby, however, I really want to second your idea of having a server for this.

One thing I would like to do is to piggy back on your initiative and eventually start a conversation about how to grow a community around ML and data science now that the stats downloads page is coming along. More on that later, I have to actually get back to my own data science project!

NeuralMonk · 2019-04-30T18:03:04Z

@skilfullycurled it will be great

grvsachdeva · 2019-07-01T06:39:42Z

Hi @jywarren @SKashyapD @Zengirl2, can we close this issue or anyone want to update it? Thanks!

NeuralMonk · 2019-07-01T08:10:07Z

Hello everyone Not now. I will start working on this after summer break. Thank you

…

On Mon, 1 Jul, 2019, 12:09 PM Gaurav Sachdeva, ***@***.***> wrote: Hi @jywarren <https://github.com/jywarren> @SKashyapD <https://github.com/SKashyapD> @Zengirl2 <https://github.com/Zengirl2>, can we close this issue or anyone want to update it? Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4660?email_source=notifications&email_token=AKVWGHR4OEW2RRVVGHJUWRLP5GRDFA5CNFSM4GRCUPFKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODY5ENWA#issuecomment-507135704>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKVWGHTGJVXBOFYM2LCMTYTP5GRDFANCNFSM4GRCUPFA> .

grvsachdeva · 2019-07-01T08:27:22Z

Cool!

NeuralMonk · 2019-08-27T14:22:08Z

@skilfullycurled @jywarren what are the other small projects we can start to grow the community around data science.

thanks!

budema6 · 2019-08-27T16:18:15Z

I am a new bee into the ML area..
Was going thru the problem statement as it was interesting.
Thanks for posting in details for better understanding.
Thanks @skilfullycurled @jywarren @SKashyapD @Zengirl2

skilfullycurled · 2019-08-27T16:39:52Z

@SKashyapD, thank you for keeping the conversation alive! I'll rejoin as soon as I can. I just started school in a new program so although PL GitHub conversations are one of my favorite was to feel like I'm doing programming work but actually just avoiding it (not a joke), I should probably finish my work first. Still, I was too happy about the conversation to not join in. : )

One project comes immediately to mind:

SPAM: There's currently two problems. The first is the spam we currently get from sign-ups and postings, and the second, spam accounts that were made before there was a more robust moderation system. There's a period of time (can't recall what it is) where there are literally ~300,000 accounts. I think Public Lab is awesome, but that seems a tad inflated. ; ) Additionally, I believe that when users are moderated as spam, they are not removed from the database.

Spam isn't the most exciting task but it'd have a real impact. A) moderating spam is a huge resource train. B) Unless we're able to filter out spam accounts, there really can't be good data science because the data won't be good.

This project has two quasi-FTO's. They aren't FTO's according to the actual definition, but the problem contains some "hello worlds" of data science that would be good for someone who is comfortable with Ruby (I don't think you have to be awesome at it, I hardly knew Python in my first data science class) but wants to get started in data science. And the second is for someone who is comfortable with the fundamental exploratory data analysts tasks and wants to try a simple ML exercise.

I've been collecting a data set of spam.

Project 1: Exploratory data analysis. I started #5450 to discuss non-ML ways to detect spam, and I came up with some guidelines simply by exploring the data. These guidelines could become more robust with more exploratory analysis of a larger dataset. This would be a good way to get familiar with the SciRuby library collection and the fundamentals of data science (using Ruby notebooks, dataframes, selecting data, aggregating results, plotting etc.) As I said, I've been collecting a dataset of spam, but we also need a way to identify past spam because I'm sure the markers have changed over time.

Project 2: Creating a spam/ham classifier. This is why I started the collection actually, so that we'd have enough for the spam part. The harder thing is collecting data for people who are in the ham category. So that's sort of in the Project 1 category, but after we have enough of both, then there are plenty of tutorials for someone to have a nice learning experience.

skilfullycurled · 2019-08-27T16:42:37Z

My pleasure @budema6, I'm excited about developing a community so it's really thanks to you for your interest!

skilfullycurled · 2019-12-13T02:06:50Z

Update: I now have enough spam if ever anyone wants to take on training spam/ham classifier for the site. If I recall, I've seen a number of Jupyter notebooks that do this in Ruby. Of course, the data has to be parsed, and we need a ham dataset as well. In any event...

Uzay-G · 2020-01-11T21:55:39Z

Hey! This topic really interests me and I have made some Natural Language Processing projects with python and the spacy library. I'd love to help out and try applying NLP to spam detection. I'm no expert, but i think I could help 😄

skilfullycurled · 2020-01-12T00:23:49Z

@Uzay-G, thanks for reviving this thread. I'm not sure when/how but I'm thinking it might be a good idea to try to have a call. It just seems like there's enough interest in general, and it might be good to just meet each other and see if we can organize ourselves. I'd sort of like to see this become a tool topic just like balloon mapping or spectrometry. And, perhaps at some point even have a separate PL repo for projects the same way mapknitter does.

Anyone at @publiclab/connectors, how are we handling developer open calls these days?

stale · 2020-10-07T06:11:13Z

Hi 😄, this issue has been automatically marked as stale because it has not had recent activity. Don't worry you can continue to work on this and ask @publiclab/reviewers to add "work in progress" label 🎉 . Otherwise, it will be closed if no further activity occurs in 5 days -- but you can always re-open it if you like! 💯 Thank you for your contributions 🙌 🎈.

jywarren · 2020-10-08T15:09:02Z

Sorry about the stalebot message here, it was a mistake! 😅 Can't seem to delete due to a GitHub API issue... strange. Carry on!

SidharthBansal changed the title ~~Machine Learning based spam detection~~ Machine Learning based projects Jan 19, 2019

grvsachdeva added discussion planning Planning issues! labels Jan 20, 2019

jywarren mentioned this issue Jan 27, 2019

Port "capture" interface into this library from main spectral-workbench project publiclab/spectral-workbench.js#56

Closed

skilfullycurled mentioned this issue Jan 12, 2020

Discussion of automated spam detection techniques (formerly spam dashboard discussion) #2377

Open

stale bot added the stale label Oct 7, 2020

jywarren removed the stale label Oct 8, 2020

Machine Learning based projects #4660

Machine Learning based projects #4660

Comments

NeuralMonk commented Jan 18, 2019

SidharthBansal commented Jan 19, 2019

milaaraujo commented Jan 19, 2019 • edited Loading

SidharthBansal commented Jan 19, 2019 via email

NeuralMonk commented Jan 19, 2019 via email

ryzokuken commented Jan 20, 2019

NeuralMonk commented Jan 25, 2019 via email

SidharthBansal commented Jan 25, 2019 via email

NeuralMonk commented Feb 5, 2019 via email • edited Loading

jywarren commented Feb 6, 2019

jywarren commented Feb 14, 2019

NeuralMonk commented Feb 14, 2019

NeuralMonk commented Feb 27, 2019

jywarren commented Feb 27, 2019

jywarren commented Feb 27, 2019

NeuralMonk commented Mar 6, 2019

NeuralMonk commented Mar 6, 2019

jywarren commented Mar 6, 2019 via email

Zengirl2 commented Mar 6, 2019

NeuralMonk commented Mar 6, 2019

NeuralMonk commented Mar 6, 2019

Zengirl2 commented Mar 6, 2019

NeuralMonk commented Mar 9, 2019

Zengirl2 commented Mar 11, 2019 • edited Loading

NeuralMonk commented Mar 14, 2019

jywarren commented Mar 15, 2019

jywarren commented Mar 15, 2019

NeuralMonk commented Mar 30, 2019

Zengirl2 commented Mar 30, 2019

NeuralMonk commented Apr 1, 2019

NeuralMonk commented Apr 1, 2019

Zengirl2 commented Apr 1, 2019

Zengirl2 commented Apr 2, 2019

NeuralMonk commented Apr 6, 2019

Zengirl2 commented Apr 7, 2019

skilfullycurled commented Apr 10, 2019

NeuralMonk commented Apr 17, 2019

NeuralMonk commented Apr 17, 2019

jywarren commented Apr 18, 2019

NeuralMonk commented Apr 18, 2019 • edited Loading

skilfullycurled commented Apr 18, 2019

NeuralMonk commented Apr 30, 2019

grvsachdeva commented Jul 1, 2019

NeuralMonk commented Jul 1, 2019 via email

grvsachdeva commented Jul 1, 2019

NeuralMonk commented Aug 27, 2019

budema6 commented Aug 27, 2019

skilfullycurled commented Aug 27, 2019

skilfullycurled commented Aug 27, 2019

skilfullycurled commented Dec 13, 2019

Uzay-G commented Jan 11, 2020

skilfullycurled commented Jan 12, 2020

stale bot commented Oct 7, 2020

jywarren commented Oct 8, 2020

milaaraujo commented Jan 19, 2019 •

edited

Loading

NeuralMonk commented Feb 5, 2019 via email •

edited

Loading

Zengirl2 commented Mar 11, 2019 •

edited

Loading

NeuralMonk commented Apr 18, 2019 •

edited

Loading