-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathcapstone.Rmd
91 lines (65 loc) · 11.6 KB
/
capstone.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
---
title: "Capstone Project"
linkcolor: red
urlcolor: red
---
This is information about the semester-long capstone project that you will be required. This will be a significant portion of your grade. The general idea of the project is to allow you to develop a deeper understanding of a machine learning technique and how to present that to an external audience in a digestable format.
## Guidelines:
* You may work individually or in a group of two.
* You must submit a project proposal before 4/12/17.
* The final project will be due on 5/3/17.
* The project will be graded based on the rubric found below, with project-specific modifications for parts that are not applicable.
## Resources For Project Ideas and Sample Projects
Here are some resources to help you choose your project and to understand what the final output ought to look like.
### External Resources
This type of project is common in machine learning courses, and so there are already a great deal of example projects for you to draw on. Please do not copy any of the ideas here exactly, though I will not expect that all project ideas are completely original -- you are welcome to flesh out a variant of one of these projects, provided that you are doing some original work. Please see me if you have concerns about this.
* From CS229 at Stanford:
- [2013](http://cs229.stanford.edu/projects2014.html)
- [2014](http://cs229.stanford.edu/projects2014.html)
- [2015](http://cs229.stanford.edu/projects2014.html)
- [2016](http://cs229.stanford.edu/projects2014.html)
* From CMU: [Project Ideas](http://www.cs.cmu.edu/~10701/projects.html);
* A [Quora post](https://www.quora.com/Machine-Learning-What-are-some-small-project-ideas-to-practice-the-following-ML-algorithms) on this topic.
### Suggestions
Here are a collection of random project ideas that I've come up with that might be interesting. I've got more, so please feel free to chat with me about it.
#### Data Journalism
"Data journalism" has become quite a thing over the past few years, starting mostly with the rise of Nate Silver (of [538](https://fivethirtyeight.com/) fame) after his success at being a reasonable pundit in the 2008 election. As in science, however, the reproducibility of such data-backed articles is important -- being able to verify the conclusions of articles that have their groundings in some sort of data analysis keeps those doing the analysis responsible for not "fudging the numbers". To this end, many of the data sets underlying 538 articles are now available in their R package, and a project which reproduced and enhanced some of their analyses with some more substantial machine learning would be a good way to learn the presentation/plotting/document generation part of R as well as provide an opportunity to get some writing in.
#### Gerrymandering
[Gerrymandering](https://en.wikipedia.org/wiki/Gerrymandering_in_the_United_States) is the redistricting of congressional districts that is meant to favor a political party or candidate. This is an obvious problem and clearly happens (see this [insanity](https://en.wikipedia.org/wiki/Texas's_35th_congressional_district)). There are a number of statistical techinques that are used to try and determine whether or not a given districting system is "fair", and indeed a summer school being planned to train mathematicians to be witnesses in court cases on gerrymandering (see [here](https://sites.tufts.edu/gerrymandr/project/)). A project around this would be to obtain data that is relevant to identifying gerrymandering, apply some statistical and/or machine learning techniques to understanding the phenomenon, and produce some useful visualizations/reports that describe your findings.
## Rubric
This is the basic structure of the rubric by which projects will be graded. Depending on the precise structure of your chosen project, however, this may vary, and we will discuss this at the time of the project proposal. For projects that follow a more standard structure of "choose a problem and data set -> choose a machine learning technique to solve that problem -> implement that solution", however, this will be the complete rubric.
### Project Definition
| Criteria | Specification |
|-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Overview | Provides an understandable, high-level overview of the project. The history of the problem the project is related to and context around prior work is given. |
| Problem Statement | The problem is clearly-defined and a strategy for solving the problem is laid out. |
| Metrics | The metrics used to describe the performance of the models used in the project are clearly-defined. They are reasonable, based on the problem.
### Analysis
| Criteria | Specification |
|---------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Exploratory Data Analysis | If there is a dataset associated to the problem, there is some initial exploratory data analysis included. Features and relevant statistics are calculated. If there is no data set, a description of the input space and data has been made. Abnormalities or characteristics about the data or input that need to be addressed have been identified. |
| Exploratory Visualization | A visualization has been provided that summarizes a relevant characteristic or feature to the data, with some discussion. |
| Algorithms and Techniques | All of the algorithms and techniques that are used in the project are thoroughly discussed and justified based on the problem. |
| Benchmark | There is a clearly-defined benchmark result or threshhold that provides a comparison of performance of solutions in the project.
### Methodology
| Criteria | Specification |
|--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Data Preprocessing | All preprocessing steps have been clearly documented. Abnormalities or characteristics about the data or input that needed to be addressed have been corrected. If no data preprocessing is necessary, it has been clearly justified |
| Implementation | The process for which metrics, algorithms, and techniques were implemented with the given datasets or input data has been thoroughly documented. Complications that occurred during the coding process are discussed. |
| Refinement | The process of improving upon the algorithms and techniques used is clearly documented. Both the initial and final solutions are reported, along with intermediate solutions, if necessary. |
### Results
| Criteria | Specification |
|------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Model Evaluation | The final model’s qualities — such as parameters — are evaluated in detail. Some type of analysis is used to validate the robustness of the model’s solution. |
| Justification | The final results are compared to the benchmark result or threshold with some type of statistical analysis. Justification is made as to whether the final model and solution is significant enough to have adequately solved the problem. |
### Conclusion
| Criteria | Specification |
|---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Visualization | A visualization has been provided that emphasizes an important quality about the project with thorough discussion. Visual cues are clearly defined. |
| Reflection | Student adequately summarizes the end-to-end problem solution and discusses one or two particular aspects of the project they found interesting or difficult. |
| Improvement | Discussion is made as to how one aspect of the implementation could be improved. Potential solutions resulting from these improvements are considered and compared/contrasted to the current solution. |
### Overall Quality
| Criteria | Specification |
|---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Presentation | Project report follows a well-organized structure and would be readily understood by its intended audience. Each section is written in a clear, concise and specific manner. Few grammatical and spelling mistakes are present. All resources used to complete the project are cited and referenced. |
| Functionality | Code is formatted neatly with comments that effectively explain complex implementations. Output produces similar results and solutions as to those discussed in the project. |