Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Machine Learning based projects #4660

Open
NeuralMonk opened this issue Jan 18, 2019 · 66 comments
Open

Machine Learning based projects #4660

NeuralMonk opened this issue Jan 18, 2019 · 66 comments
Labels
discussion planning Planning issues!

Comments

@NeuralMonk
Copy link
Contributor

Currently, our Spam system is completely manual, but
I think, instead of reviewing similar content/posts, we can use
Machine Learning algorithms for easing the task.

@SidharthBansal SidharthBansal changed the title Machine Learning based spam detection Machine Learning based projects Jan 19, 2019
@SidharthBansal
Copy link
Member

Great idea.
@jywarren I want to add a couple more idea. I know they are not Core Mission Driven Projects. We must focus on them before addressing these less important issues. But just to brainstorm a little.

  • Content Based Tag Recommendation System (Suggested by Jeff)
  • Anomalous Spam Detection System(As suggested by @SKashyapD )
  • Recommendation Systems for posts (@Saurabh19126848_twitter suggestion on gitter chat )
  • recommendation system for posts (@Saurabh19126848_twitter suggestion on gitter chat)
  • sentiment analysis ( @Saurabh19126848_twitter suggestion on gitter chat)
  • Tag Suggestions by Natural Language Processing on nodes(suggested by me)

I am highly in favour of automating our services. Main problem is with Rails absence of libraries to ML. We can find majority of above on based on Isolation Forest algorithms, Naive Bayes, BBN, CNN, ANN etc. which are heavily implemented in python, not in rails. Writing libraries from Scratch does not make sense at all.
So, we also need to think of these considerations.

@milaaraujo
Copy link
Collaborator

milaaraujo commented Jan 19, 2019

I would love to participate in any of these projects! I've worked with Recommendation Systems and Sentiment Analysis during my graduation. But I don't know any libraries to Rails tho. I've only worked with libraries in Python and R before.

@SidharthBansal
Copy link
Member

SidharthBansal commented Jan 19, 2019 via email

@NeuralMonk
Copy link
Contributor Author

NeuralMonk commented Jan 19, 2019 via email

@ryzokuken
Copy link
Member

Hi everyone, just dropping here to say that making a flask server for data science stuff is the correct approach here. Essentially, you would need a separate server crunching the numbers and acting as an interface to the models. This flask server would need to be run in a separate container and I volunteer to make appropriate changes to the docker-compose config to make sure this floats. Looking forward to assist people in implementing the above cool features in the website.

@grvsachdeva grvsachdeva added discussion planning Planning issues! labels Jan 20, 2019
@NeuralMonk
Copy link
Contributor Author

NeuralMonk commented Jan 25, 2019 via email

@SidharthBansal
Copy link
Member

SidharthBansal commented Jan 25, 2019 via email

@NeuralMonk
Copy link
Contributor Author

NeuralMonk commented Feb 5, 2019 via email

@jywarren
Copy link
Member

jywarren commented Feb 6, 2019

Hi, thanks to everyone for your input here! I think there are some potential use cases for machine learning across the Public Lab ecosystem! But perhaps we need to do a bit more in-detail brainstorming on individual examples. For example, I'm not sure that running a containerized flask server as part of the plots2 codebase makes sense because it dramatically expands the setup complexity of the project (we had an issue with this in a previous project to run a Solr container), but perhaps it could make sense to develop in a separate repository?

Could such a separate server for data analysis access data via the API?

Of the brainstormed applications, i'm hesitant on the spam one -- i like the basic premise, but to me, it seems more sustainable and less 'reinvent the wheel' to look at an existing library or service for spam identification, like Askimet or something. I'm sure others have worked on this problem and am less sure we could provide something unique that would be competitive.

On the other hand, I'd love to think about places in the PL ecosystem where machine learning would present a really unique benefit that supports our overall mission.

Would Spectral Workbench be one of those places?

  1. I note a mention of neural networks for trying to solve an issue here: Port "capture" interface into this library from main spectral-workbench project spectral-workbench.js#56 (comment) (although seems that should be broken into its own issue)
  2. @Lucaszw emailed me some time back with the idea of using machine learning to apply appropriate tags to spectra in SpectralWorkbench. That also seems interesting!

On MapKnitter, would it be plausible to scan images and try to identify features and tag accordingly?

The Vision API at Google Cloud can do some pretty interesting things there: https://cloud.google.com/vision/

Although in this test it didn't seem to find anything in this aerial photo except that it was an aerial photo 😄 :

image

Perhaps one approach here might be to begin a Zooniverse project using MapKnitter data: https://www.zooniverse.org/lab

Then that could be used as training data to develop a machine learning approach to identifying, say, areas of high risk of spills, pollution, etc.

Terrapattern tried doing something kind of like this: https://qz.com/764746/terrapattern-open-source-satellite-photo-search-tool/

http://www.terrapattern.com/about

image

That could be a really interesting approach, and I like the idea of using the MapKnitter image set to help an ML approach get better at identifying pollution.

Note that Terrapattern also uses OpenStreetMap tags to train it's model. Perhaps we could correlate MapKnitter images with any OSM tags which are overlapping with the images shown, although there might not be too many.

Anyhow, these are some ideas that get a bit at the environmental mission of Public Lab, and might make for an interesting set of possible projects that wouldn't necessarily live IN the plots2 codebase, but could be really powerful tools for our community.

@jywarren
Copy link
Member

This is a really great example of using machine learning to identify environmental issues: https://skytruth.org/2019/02/using-machine-learning-to-map-the-footprint-of-fracking-in-central-appalachia/

it also gets at some of the challenges, as well as discusses how to use existing manually categorized datasets as a training set, OR to use existing databases to correlate with imagery to train a model. Great work, @SkyTruth!

@NeuralMonk
Copy link
Contributor Author

Hey everyone and thanks @jywarren for your wonderful inputs and your proposed ideas are very cool and interesting.
I have already started reading and researching about them. It will take me about a week to find out how things are supposed to be done.
thanks, everyone.

@NeuralMonk
Copy link
Contributor Author

Hey everyone,
I have done my research on given ideas and devised the following plan:
@jywarren, it is definitely a good idea to create a new repository for machine learning based projects, instituting a separate server for data analysis access data via the API.

We can host a Flask server in this way:

  1. It will take the screenshot of the image,
  2. Feed it to the input of the model,
  3. Take the output of the model to show it on the web page.

Goal: Automatically label aerial imagery

  1. Tagging,
  2. Semantic segmentation.
    screenshot from 2019-02-20 02-33-39

Implementing the Machine learning model in simple steps:

  1. Collect the pair of images and label,
  2. Write a program that predicts labels for given images(model),
  3. Let the computer automatically tune parameters to mimic examples(learning).

The lengthy task: collecting the pair of aerial images and label

One important yet rarely discussed aspect of using machine learning for aerial image interpretation is the source of the data.
Since labelling images is a very time-consuming process, the datasets have been small in both aerial image applications and general image labelling work. Hence, obtaining good sources of accurately labelled data is important for both evaluating existing approaches and training systems that are likely to work under varying conditions.
In some domains, hand-labelling data in order to train a classifier is not necessary because the label information is often readily available. For example, in the case of road detection (Semantic segmentation), the locations of existing roads are typically known because they are useful for navigation and not just as target labels in a machine learning task.
The abundance of accurately labelled data for road detection makes it a very good candidate for evaluating existing aerial image interpretation systems as well as the application of machine learning techniques.

For buildings, Google Maps can provide the locations of a substantial portion of the buildings in almost any major city. This type of data can act as a source of noisy labels, which are correct with very high probability when they indicate the presence of an object and with lower, but still substantially high, probability when they indicate the absence of an object. Training a classifier on large amounts of this type of noisy data with a robust loss function can potentially produce a much better detector than by using a much smaller set of accurate labels. At present, there seem to be no applications of robust estimators to aerial image data with noisy labels.
For object classes such as cars or areas for which Google Maps possesses neither accurate nor complete map information, hand-labelling data seems to be the option or to use of crowdsourcing tools like zooniverse https://www.zooniverse.org/ which helps us to make the dataset.

In a classification task, small translations or rotations can be applied to the input images, but in order to apply the same idea to image labelling one must be able to realistically transform both the image and the labels. On a road detection task, applying rotations to each training case before it is processed has been shown to help prevent overfitting

So we need to start making our own dataset for the better result.
we can do it manually and I'd like to volunteer my self to do the same by using a python script.
alternatively platforms like Zooniverse can be used to create the dataset
https://help.zooniverse.org/getting-started/

The most important part is data. A larger and more accurate sample size will lead to the better results.
The primary obstacle is the imbalance in dataset which makes detecting rare labels a difficult task.

Tagging;

It is almost similar task as I suggested earlier for the text the difference is that, now the dataset is of images so we need to use CNN to perform the following task there is a great blog post by Adit how CNN actually work for image classification: https://adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural-Networks/
screenshot from 2019-02-20 02-33-09

The machine learning model

Residual Network (ResNet) which is a major breakthrough in CNN.
1.allowing training model with 100's of the layer for grater accuracy.
2. layers compute residual(delta) between input and output

Why does it work?

  1. each layer has less work to do(no copying)
  2. allows gradient to flow more easily due to skipping connection
    screenshot from 2019-02-20 17-59-46

To understand more deeply you can go through a great intuitive blog : https://wiseodd.github.io/techblog/2016/10/13/residual-net/

Our approach to making our model better

1.instead of softmax, use the sigmoid activation function

2.optimize tag threshold to maximize F2 score

Many of the times we are trying to find the optimal threshold for F2 score using trial and error but instead of that we can find the best threshold using a brute-force search on a local validation set can actually net really good results on the LB, without much overfitting in the local score. Basically, you can try every possible threshold on a local validation set, and take the best performing threshold, applying it to the test set.
And we also know that the best threshold is vastly different for each class. This means we can also get a big improvement by setting a different threshold for each class

  1. Using pretrained model

A very common trick used in ML which is also known as transfer learning which means instead of training your model with random initialization we can initialize the parameters we got from another similar model who already trained on different data set. which is basically a great head start.

Simply put, a pre-trained model is a model created by some one else to solve a similar problem. Instead of building a model from scratch to solve a similar problem, you use the model trained on other problem as a starting point.

For example, if you want to build a self-learning car. You can spend years to build a decent image recognition algorithm from scratch or you can take inception model (a pre-trained model) from Google which was built on ImageNet data to identify images in those pictures.

A pre-trained model may not be 100% accurate in your application, but it saves huge efforts required to re-invent the wheel. Let me show this to you with a recent example.

  1. Augment label dataset using lossless image transformation.

The more the data the better so like we can rotate our image by 90 degrees left and right which eventually increased the size of our dataset.

  1. Tune learning rate (LR) manually
    it is very important to find which LR has best performance

  2. Ensembling of 3 model architecture(optional)

  3. ResNet 5x

  4. inception 5x

  5. DenseNet 5x

Or we can also do good with "ConNets101"
it depends on what are the resources we have
ensembling is good ML approach but give a little boost in F2 score and take about 15 times more computation than ConvNet101.

  1. Semantic segmentation

Basically "semantic segmentation" attempts to partition the image into semantically meaningful parts, and to classify each part into one of the pre-determined classes. You can also achieve the same goal by classifying each pixel (rather than the entire image/segment). In that case, you are doing pixel-wise classification, which leads to the same end result but in a slightly different path.
to understand it deeply check the very insightful blog
https://www.jeremyjordan.me/semantic-segmentation/

  1. ResNet based FCN architecture

  2. fine-tuned a pre-trained model

  3. Use IR R G image as input
    screenshot from 2019-02-21 01-29-51

  4. Make prediction using sliding window because network only can handle 256X256

  5. Ensembling average of five model

Other ideas for future works.

1.Detection of an oil spill.

Detecting oil spill accurately using CNN is a very tough task because there are some natural phenomena which look similar from space and a small sample size does not help. We need SAR images to detect oil spill correctly because in SAR image oil spill look like in dark formation which can be easily get detected. The following can prove to be usefull:

  1. Fully convolution Network
  2. FCN-GoogleNet
  3. FCN-ResNets
  4. deep neural autoencoder

2.Detection and mapping of plastic

We can able to detect plastic on our trained model using object detection while labelling the data we need make a specific label for plastic or no-plastic so that our CNN network can use thousands of the example of labelled plastic pieces such that it will finally able to tell what is a piece of plastic and what is not. We can able to detect a different type of plastic like rope toy etc.

  1. Air pollution

When somebody uploads an image on mapknitter with Geo-tagging we can able to find the PM2.5 level and detect the air quality using following link https://aqicn.org/map/india/#@g/19.9884/80.5078/5z
so we can able to classify air is polluted or not in the given region.

But to predict future air pollution patterns in is itself a major machine learning task.

PM2.5 refer to the tiny particle in the air that reduce visibility and cause air to appear hazy and get affected by the meteorological and traffic factor, burning of fossil fuel, Industrial parameters such as power plant emission play a significant role in air pollution.

The required data-set

  1. Temperature
  2. wind speed
  3. Dewpoint
  4. pressure
  5. PM2.5 Concentration
  6. classified data sample(polluted or not)

Our system does two tasks:

  1. detect the level of PM2.5 on given location
  2. Predict PM2.5 value for a particular date
    2.1) Logistic regression to predict air is polluted or not
    2.2) Autoregression to predict a future value of PM2.5 based on the previous PM2.5 value reading

Since our plan is quite extensive, I'd like to begin working on it as soon as possible. I'd like to invite inputs from you regarding the same, primarily should I start the project on zooniverse or should I start labelling it manually?

thanks, everyone

@jywarren
Copy link
Member

Hi! This is a lot of information - thanks for compiling it! I wanted to ask a few things first --

  1. With such a complex system, perhaps we should do some diagramming to show what the parts of the system are, and what are the potential ways to fulfill each part -- we could start with a diagram template like the one linked here, that was used to generate the plots2 data model: https://github.com/publiclab/plots2/blob/master/doc/DATA_MODEL.md
  2. I'm really interested in good integration with existing efforts -- what portions of systems like Terrapattern and others are re-usable, or could we at least remain compatible with? https://github.com/CreativeInquiry/terrapattern
  3. For buildings, Google Maps can provide the locations of a substantial portion of the buildings in almost any major city. -- I'd even prefer OpenStreetMap, which Terrapattern uses, and is an open source data source which we could also encourage people to contribute to in order to improve the training! See how to query here: Landfill, mine/quarry map data via OpenStreetMap leaflet-environmental-layers#50 and also a lot about more data sources to draw from in https://github.com/publiclab/leaflet-environmental-layers/ !
  4. For the PM air quality data, do you think perhaps it's possible that there is no visible sign of air quality issues in MapKnitter images? or if you're not using images to correlate, but just data, there may be other models to look to first.

I hope this helps!

@jywarren
Copy link
Member

Oh, and also, starting a Zooniverse project would be GREAT! @Zengirl2 may be interested in this too.

@NeuralMonk
Copy link
Contributor Author

thanks @jywarren for great inputs and making things more clearer and interesting.

  1. Yes it is little complex and i will try to breakdown things in simpler way and i started working on this I will try to complete it as soon as possible.
  2. for now we can able to do Semantic segmentation part which can help model to predict tags like ROAD, BUILDING, WATER, TREES, VEGETATION because there is data available freely like
    eg- https://project.inria.fr/aerialimagelabeling/
    and we can use opneStreerMap http://openstreetmapdata.com/
    so we can start doing thiis
  3. Using open source is always fun.
  4. Using images we can only able to find out whether or not the image is hazy but with the location of the image we are able to find out its PM2.5 value of that particular location.

@NeuralMonk
Copy link
Contributor Author

Zooniverse sounds great! I guess you should create a team first and add me (and @Zengirl2 or anyone else who is interested too) and I could then flesh out the rest of the project.

Hope this sounds good?

@jywarren
Copy link
Member

jywarren commented Mar 6, 2019 via email

@Zengirl2
Copy link

Zengirl2 commented Mar 6, 2019

@SKashyapD Hey there--I do have a strong interest in Zooniverse, but I'm still behind on a fan project I'm working on. So, you can include me, but I won't be able to do much right now.

@NeuralMonk
Copy link
Contributor Author

untitled diagram
Most simplest way to show how things going to work each and every block have there own technical details. please create a repository and I will explain every technical detail on it.

thanks @jywarren for creating zooniverse project.
zooniverse project looks great I started working on it but I have to know few things first to make it better and clear.
-what we are specifically looking for(core mission)?
-what are the labels we are going to take to create our database?
-anything important you want to mention?

should i start working on semantic segmentation part?

thanks everyone

@NeuralMonk
Copy link
Contributor Author

thanks @Zengirl2 for showing interest . And any kind of contribution will be great.
@jywarren please add @Zengirl2 to our zooniverse project.

@Zengirl2
Copy link

Zengirl2 commented Mar 6, 2019

@SKashyapD I originally had interest in using Zooniverse to go through possible pollution from hurricanes. They have started to do projects for hurricanes (although not with the pollution I would like). I was at the point of having conversations with two people from Zooniverse about learning to use their content system. I believe I may even have a video tutorial that they sent me.

@NeuralMonk
Copy link
Contributor Author

I am really excited to complete zooniverse project and semantic segmentation part @jywarren please give me some inputs so that i can start working and I will try complete all this as soon as possible.
@Zengirl2 please give me that tutorial video it will help me a lot.

@Zengirl2
Copy link

Zengirl2 commented Mar 11, 2019

@SKashyapD Here's the links for some helpful info about setting up projects on Zooniverse (this was based on a specific example of flood/hurricane I had been asking about).

Doc Explanation
https://docs.google.com/document/d/1W5y5Iq6WY5OpP6P4kcHrE6od0tGBFhO0huXvXHJJCzs/edit?usp=sharing

Youtube video
https://www.youtube.com/watch?v=_bcu5tJDjPY

@NeuralMonk
Copy link
Contributor Author

thanks @Zengirl2 for providing me resources.
@jywarren please let me know when your are finished and I already working on some prerequisite that will help us in future

@jywarren
Copy link
Member

I think @Zengirl2's idea for core mission is great -- identify specific types of pollution from aerial photos -- and we can start with whatever is a good initial training set.

I added @Zengirl2 to the zooniverse! Thank you!

@jywarren
Copy link
Member

There are lots of Hurricane Harvey images linked to from posts on this page: https://publiclab.org/wiki/harvey#Questions -- i hope that helps!

@NeuralMonk
Copy link
Contributor Author

@Zengirl2 to edit the project you can go through this link too.
what are the criteria to get selected as zooniverse project?
which type of categorization you are talking about? categorization of dataset?

@Zengirl2
Copy link

@jywarren and @SKashyapD - when I log into Zooniverse it is not showing that I'm connected to any projects. Jeff, I remember seeing where you said you were going to invite me, but I don't remember getting any email about it. Can you see what name you used to add me?

@SKashyapD what I was talking about as far as whether this is a Zooniverse project or private project is listed under lab policies.

@NeuralMonk
Copy link
Contributor Author

Please check your email you may have received the respective email, because your username is same in the project @Zengirl2 .

@NeuralMonk
Copy link
Contributor Author

For categorization of the project @jywarren may tell better about it
do we have enough volunteer for classification task?

@Zengirl2
Copy link

Zengirl2 commented Apr 1, 2019

@SKashyapD Hey, just got the email today. Will look at the project tonight when I get home 🦄

@Zengirl2
Copy link

Zengirl2 commented Apr 2, 2019

@SKashyapD I had a chance to look at the project and it is coming along fine. I noticed that when I chose to mark an image, that it did not give me another image once I had completed. Was this because it is not yet live? Or have you not attached a file of images yet? Anyway, here's my comments:

  • If this is just a test, it is fine that it is not a full blown Zooniverse project. Just sending the link to the Public Lab community once this is live is good.

  • Usually a Zooniverse project only takes on marking an image for one or two things. We are asking more by having many types of pollution. I know just trying to identify oil sheen from an image is difficult, so we probably need to develop a tutorial. Also, a gas company flare--would that be considered pollution? These are some of the things a tutorial can make more understandable :). In fact, the original image you used as an example earlier before you sent the link for the project was great--perhaps that can be used for the tutorial.

  • We should probably make it more clear why we are trying to do this work, so maybe filling out the field guide section would be a good idea as well.

@NeuralMonk
Copy link
Contributor Author

@Zengirl2 I have fixed that problem now it is working properly. please look up to it again.

  1. I will make a tutorial as soon as possible, and I will add more images too.
    Can you provide an exemplary tutorial anything which can help to make the tutorial better?
  2. I have done some research during the making of summer of code proposal for why we are doing it, so can I add few things @jywarren?
  3. Can you provide me with your BIo or something which can help me to create Team section @jywarren @Zengirl2? it will help us to make our project looks good.

Thank you!

@Zengirl2
Copy link

Zengirl2 commented Apr 7, 2019

Hey @SKashyapD--your images are working correctly now 🎉

  1. Great example of a similar project and tutorial (it has already completed but you can still view it)

  2. Tutorial Details - I know some important things we were talking about identifying was sheen on water from oil spills, damaged infrastructure (like large oil tanks that get ripped open or toppled from hurricanes), flares (the flames from stacks from gas companies) and I'm wondering if we can identify tar on beaches? Maybe that counts as oil spill, too.

  3. Drawing "Mining" - I was having difficulties using this--do you need to make more than two points? It said "2 of 0 required drawn" when I tried it.

  4. Classification section - This seems to be a summary of the places identified by the symbols/drawings, but not sure where/how I'm supposed to input any information (like for instance if I knew there was a gas plant in a location).

  5. Pretty Stuff - The hurricane project example I gave you earlier helps to show how to make a project attractive/needed. I'm thinking we may be able to get a photo for the front page that looks more like hurricane devastation. I believe we have images already on Public Lab's site that could be useful, so I'll try to find one. This also affects the message on the top of the project...maybe something like "We need your help recognizing pollution from aerial images so we can prepare for future disasters". Also, where you have the quote about "destroying oceans" maybe we can give more detail about how hurricanes and other disasters cause pollution of air, water and soil for living things in surrounding areas long after the initial event. Also, the ability to identify pollution from aerial images helps to hold companies accountable for preparation and remediation. Think Skytruth :)

  6. My bio (you can use my pic from Github--let me know if you need it larger)- Leslie is a user and educator of open source hardware and volunteers with Public Lab to help others investigate their environmental concerns. She is currently working on a Master's of Environmental Studies with a focus on Conservation Tech at University of Pennsylvania.

@skilfullycurled
Copy link
Contributor

I saw machine learning and I wanted to chime in. Of the original list that @SidharthBansal compiled from the different source of requests, I wanted to add that we had been discussing the tag recommendation tangentially on the website (the code part of the conversation which has moved to Github). At any rate, to @jywarren's comment above regarding not 'reinventing the wheel' there are some recommendation engines in Ruby that I recommended (har har) my comment here.

@NeuralMonk
Copy link
Contributor Author

Sorry for the delay @Zengirl2

  1. I started working on the tutorial and thanks for the resources.
  2. For drawing mining I selected polygon because it will help us to map mining area better(you can draw any required shape).
  3. In classification section I will add an extra section for notes like this.
  4. I will update the few section of the project to make it appealing.
  5. Thanks for the bio @Zengirl2.

Can you please provide me few resources for more images @jywarren @Zengirl2

Thanks everyone!

@NeuralMonk
Copy link
Contributor Author

Thanks @skilfullycurled for taking initiative. you can check this it may help recommendify

@jywarren
Copy link
Member

Hi @SKashyapD -- can you help me find your SoC proposal? Did it get posted?

@NeuralMonk
Copy link
Contributor Author

NeuralMonk commented Apr 18, 2019

hey, @jywarren
Yes, It got posted on Public lab website and You have also reviewed it. SoC proposal
Is there any trouble or something?

Thank you!

@skilfullycurled
Copy link
Contributor

No, no, thank you for taking the initiative on an ML thread, @SKashyapD! I'm not sure I'll be able to take that much more initiative on the implementation of a tag recommendation system since I don't have lots of experience in programming with Ruby, however, I really want to second your idea of having a server for this.

One thing I would like to do is to piggy back on your initiative and eventually start a conversation about how to grow a community around ML and data science now that the stats downloads page is coming along. More on that later, I have to actually get back to my own data science project!

@NeuralMonk
Copy link
Contributor Author

@skilfullycurled it will be great

@grvsachdeva
Copy link
Member

Hi @jywarren @SKashyapD @Zengirl2, can we close this issue or anyone want to update it? Thanks!

@NeuralMonk
Copy link
Contributor Author

NeuralMonk commented Jul 1, 2019 via email

@grvsachdeva
Copy link
Member

Cool!

@NeuralMonk
Copy link
Contributor Author

@skilfullycurled @jywarren what are the other small projects we can start to grow the community around data science.

thanks!

@budema6
Copy link

budema6 commented Aug 27, 2019

I am a new bee into the ML area..
Was going thru the problem statement as it was interesting.
Thanks for posting in details for better understanding.
Thanks @skilfullycurled @jywarren @SKashyapD @Zengirl2

@skilfullycurled
Copy link
Contributor

@SKashyapD, thank you for keeping the conversation alive! I'll rejoin as soon as I can. I just started school in a new program so although PL GitHub conversations are one of my favorite was to feel like I'm doing programming work but actually just avoiding it (not a joke), I should probably finish my work first. Still, I was too happy about the conversation to not join in. : )

One project comes immediately to mind:

SPAM: There's currently two problems. The first is the spam we currently get from sign-ups and postings, and the second, spam accounts that were made before there was a more robust moderation system. There's a period of time (can't recall what it is) where there are literally ~300,000 accounts. I think Public Lab is awesome, but that seems a tad inflated. ; ) Additionally, I believe that when users are moderated as spam, they are not removed from the database.

Spam isn't the most exciting task but it'd have a real impact. A) moderating spam is a huge resource train. B) Unless we're able to filter out spam accounts, there really can't be good data science because the data won't be good.

This project has two quasi-FTO's. They aren't FTO's according to the actual definition, but the problem contains some "hello worlds" of data science that would be good for someone who is comfortable with Ruby (I don't think you have to be awesome at it, I hardly knew Python in my first data science class) but wants to get started in data science. And the second is for someone who is comfortable with the fundamental exploratory data analysts tasks and wants to try a simple ML exercise.

I've been collecting a data set of spam.

Project 1: Exploratory data analysis. I started #5450 to discuss non-ML ways to detect spam, and I came up with some guidelines simply by exploring the data. These guidelines could become more robust with more exploratory analysis of a larger dataset. This would be a good way to get familiar with the SciRuby library collection and the fundamentals of data science (using Ruby notebooks, dataframes, selecting data, aggregating results, plotting etc.) As I said, I've been collecting a dataset of spam, but we also need a way to identify past spam because I'm sure the markers have changed over time.

Project 2: Creating a spam/ham classifier. This is why I started the collection actually, so that we'd have enough for the spam part. The harder thing is collecting data for people who are in the ham category. So that's sort of in the Project 1 category, but after we have enough of both, then there are plenty of tutorials for someone to have a nice learning experience.

@skilfullycurled
Copy link
Contributor

My pleasure @budema6, I'm excited about developing a community so it's really thanks to you for your interest!

@skilfullycurled
Copy link
Contributor

Update: I now have enough spam if ever anyone wants to take on training spam/ham classifier for the site. If I recall, I've seen a number of Jupyter notebooks that do this in Ruby. Of course, the data has to be parsed, and we need a ham dataset as well. In any event...

@Uzay-G
Copy link
Member

Uzay-G commented Jan 11, 2020

Hey! This topic really interests me and I have made some Natural Language Processing projects with python and the spacy library. I'd love to help out and try applying NLP to spam detection. I'm no expert, but i think I could help 😄

@skilfullycurled
Copy link
Contributor

@Uzay-G, thanks for reviving this thread. I'm not sure when/how but I'm thinking it might be a good idea to try to have a call. It just seems like there's enough interest in general, and it might be good to just meet each other and see if we can organize ourselves. I'd sort of like to see this become a tool topic just like balloon mapping or spectrometry. And, perhaps at some point even have a separate PL repo for projects the same way mapknitter does.

Anyone at @publiclab/connectors, how are we handling developer open calls these days?

@stale
Copy link

stale bot commented Oct 7, 2020

Hi 😄, this issue has been automatically marked as stale because it has not had recent activity. Don't worry you can continue to work on this and ask @publiclab/reviewers to add "work in progress" label 🎉 . Otherwise, it will be closed if no further activity occurs in 5 days -- but you can always re-open it if you like! 💯 Thank you for your contributions 🙌 🎈.

@stale stale bot added the stale label Oct 7, 2020
@jywarren jywarren removed the stale label Oct 8, 2020
@jywarren
Copy link
Member

jywarren commented Oct 8, 2020

Sorry about the stalebot message here, it was a mistake! 😅 Can't seem to delete due to a GitHub API issue... strange. Carry on!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion planning Planning issues!
Projects
None yet
Development

No branches or pull requests

10 participants