This program is a school project for Artificial Intelligence class. It allows the user to add a layer to a NasNET Mobile Convolutional Neural Network model with their own images of their hand doing gestures for "rock", "paper", "scissors", and "nothing", which can then be tested and used in a simple "RPS" game.
Python 3.9.9
opencv-python tensorflow numpy sklearn scipy
Install the correct version of Python and ensure that you can run Python files (may require configuring the OS or shell environment variables, i.e. adding python to your "PATH".). Install the Python package installer "pip" using the command "python -m ensurepip --upgrade". The name of the python command may be "python3" on some Linux distributions, etc. Use pip to install the dependencies. This can be done using the provided file "requirements.txt" with the following command: "python -m pip install -r requirements.txt" or just "pip install -r requirements.txt" (If pip is also in your PATH). This program was tested on Windows 11 and Fedora Linux 35 and was found to need no further configuration. We can not guarantee compatability of the libraries with any particular hardware/operating system configurations.
Run the program using "python main.py". Interaction with the program occurs in both the terminal as well as in graphical windows that will appear at different stages of the program. Warnings in the terminal regarding Nvidia GPU drivers are likely and can be ignored.
When the program begins, the user will be prompted at the terminal as to whether they would like to load a previously saved model. If the user answers "n" for "No", or if the model files (which would reside in the ./model directory) do not exist, then the user will be shown a screen for gathering training data images for constructing the model. Replying "y" for "yes" when a saved model is present will allow the user to bypass this routine. There is currently no mechanism to save multiple unique models and select from between them for loading. The process of creating a model automatically writes to the ./model file overwriting any previously saved model.
After going through the training process once, the user will have the option to load the previous model instead of going through this relatively time-consuming process. When the training screen appears, the user will see the default video camera feed, assuming the operating system has configured this in a way that OpenCV detects by default. We do not currently have options to set custom input devices, etc. (I.e., this program currently requires that the user's operating system / shell environment is configured in such a way that the camera device is properly detected and used by OpenCV's videoCapture function when called with an argument of 0 to the "index" parameter, which denotes an enumerated video device "index".)
During the training process, the user should hold up their hand into the box in the camera feed and make a "rock", "paper", or "scissors" hand gesture and press "r", "p", or "s", respectively. When the key is pressed, the program will begin capturing a set of 100 images for each gesture type. The user should move their hand slightly during the image capture process to introduce variation into the dataset, which will improve the recognition ability. The user must also press "n" and leave their hand out of the box in order to create a set of images labeled as "nothing", as a control group. After gathering all of the image sets, press "q" to close the window and move on to the next stage.
At this stage, the console will display output from the TensorFlow training process as each "epoch" of the training process is executed. This can take several minutes and is likely to use upwards of 8GB of RAM. There is currently no mechanism to assert limits on memory usage outside of anything that might be built in to TensorFlow, etc., and it is very much possible to experience issues due to running out of memory and/or swap capacity.
After the training process is complete, the user will be prompted in the console to choose whether they would like to test the model for accuracy. If the user response is "y" for "yes", a window will appear with the camera feed and the boxed detection region. This window will also display text showing the prediction generated by the model for the current frame along with a confidence number as a percentage. The default confidence threshold used in the game-playing functionality is 70%, so it is ideal that a number above this threshold is shown with reasonable consistency along with the correct category label whenever the user makes the intended gesture in the box. After reviewing the accuracy of the model's predictions, the user can exit the testing screen by pressing 'q'.
After the testing phase, another window will appear with the same camera feed and image box, but with a text overlay showing a score for both the Player and the opposing Computer agent. Whenever the user places a hand gesture into the box, the model will determine the classification of the gesture and compare it to a move chosen randomly on behalf of the Computer agent. Each time a new gesture is detected, the box will turn green if the Player's move beats the Computer, red if the Computer's move beats the Player, and white if there is a tie. The game is won by accumulating the greater score after 5 matches by default. After 5 matches, the camera feed is suspended and the winner of the game is displayed. At this point, the user can press Enter to play again, or any other key to exit the program.