Repository to demonstrate how to get started with the Data Science Research Infrastructure (DSRI) using Python.
Start a JupyterLab workspace from the template in the DSRI web UI catalog.
Optionally: provide this repository URL when asked for a git repository to clone in the JupyterLab workspace, required packages will be automatically installed at the start of JupyterLab, from the
requirements.txt
andpackages.txt
files presents at the root of the repository. Otherwise you can simply clone the repository after deploying the workspace.
Clone the repository:
git clone https://github.com/MaastrichtU-IDS/dsri-demo.git
cd dsri-demo
- Start the MySQL database from the template in the DSRI web UI catalog
- Access the JupyterLab web UI, and open the terminal
- Run the
mysql.ipynb
notebook to load and query data in the MySQL database
- Start the PostgreSQL database from the template in the DSRI web UI catalog
- Access the JupyterLab web UI, and open the terminal
- Run the
postgresql.ipynb
notebook to load and query data in the PostgreSQL database
Long running tasks cannot be run via the JupyterLab web UI, as the connection might be lost and JupyterLab is designed to visualize data, not run and manage long running tasks.
Multiple options are available, the easiest being to just run your jobs as a python script without the need to use any library (just copy/paste the codeblocks of the notebooks in a .py
file)
Start the script via the ZSH terminal in detached mode (it will continue to run even if you close the terminal and the session):
python script_get_data.py &
The print()
will be shown in the terminal session where you started the script, but it will not be stopped if you leave the window or close the terminal session.
You can also send the logs to a file:
python script_get_data.py > script.log &
You can also start the script with the Bash terminal in detached mode
bash
nohup python script_get_data.py &
You can see the output generated by your python script in the file nohup.out
in the folder where you started the script (instead of outputting directly to the terminal)
Use this command in the terminal to show all processes running and see if your script is still running:
ps aux
You can also filter the output to see only your script
ps aux | grep script_get_data
To do.
MongoDB to be tested
git clone https://github.com/pedrohserrano/twitter-covid-scam
- Download
json_covid.zip
(to be hosted somewhere, currently on a USB stick in Pedro's moving boxes) - Run
Dump2Mongo.ipynb
- Run
Analysis.ipynb