Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider how to prevent access to feedback from other students #1202

Closed
jhamrick opened this issue Aug 26, 2019 · 20 comments
Closed

Consider how to prevent access to feedback from other students #1202

jhamrick opened this issue Aug 26, 2019 · 20 comments
Milestone

Comments

@jhamrick
Copy link
Member

@rkdarst noted in #1200:

One idea I had was to, at submission time, generate a random token and put it into the submitted notebook and some cache. Then, there is less coupling between a hash algorithm and retrieving the notebook later (which is probably a good thing)...

On the other hand, tracking tokens separately from the notebook makes more work. They could be embedded in the notebook metadata, but you have the same problem as the next paragraph.

All this actually this makes me think: if a student A shares their notebook with student B, is B then able to retrieve A's results and grade? It would be best to avoid this if possible...

As I responded there, I'm not really sure of a way around this access issue, barring having better access controls (which we should have once hubshare exists...). But I'm opening an issue here to at least track that this is a problem, and maybe brainstorm ways to deal with it.

@jhamrick jhamrick added this to the hubshare milestone Aug 26, 2019
@jhamrick
Copy link
Member Author

Also moving discussion from #1120 over here since that thread is already so long 🙂Ping @rkdarst and @Sefriol .

I guess with @Sefriol 's PR (#1186) the concern with being able to list and read other students' feedback is alleviated (with the exception of #1202), now that the feedback directory in the exchange is not listable and the hash is computed from the cache.

@Sefriol brought up the additional concern that:

currently fetch assignment is meant for students only (according to documentation), but it accepts student ID or no ID specified as a parameter

I think with #1200 this should be ok now, since the hash is now computed from the student ID and the original notebook. You can try to override the student id if you want, but it won't work unless you actually have the original notebook submitted by that student.

@Sefriol
Copy link
Contributor

Sefriol commented Aug 26, 2019

hubshare seems pretty dead project to me or is there any other alternatives on jupyterlab side?

Anyway, I think using cache and random string seems like the best option to me, unless we really want to store additional information into the hash.

@rkdarst
Copy link
Contributor

rkdarst commented Aug 26, 2019

One interesting thing is that, if grading is a function of only the input state, knowing the notebook (plus the tests from your own notebook) is the same as knowing the grade. Except manual grading. And I doubt data protection authorities would be impressed by that argument. Anyway I don't consider this a top-priority problem, but maybe I should...

We can almost use the random string added to the filename when it's submitted as a token to retrieve it (either directly or part of the nb hash). But it's not stored in the student cache (per #1104, if it was students could more easily find their submitted assignments and modify them).

One could add two random strings, one to be used when retrieving and one not saved...

@rkdarst
Copy link
Contributor

rkdarst commented Aug 26, 2019

Wait... the latest algorithm hashes the timestamp, right? Which is stored only in the cache, not in student dir. So student would have to share more than just the notebook in order to have someone else able to compete the hash... there's not a lot of entropy but it's not small either. So actually it's not that bad. (actually almost good!)

Thinking a bit more, if we process a timestamp.txt in the submission dir, it's not that far off to add a feedbacktoken.txt file there, too, which gets treated the same way as the timestamp, and since the name is obvious it's up to student to not reveal it.

@jhamrick
Copy link
Member Author

hubshare seems pretty dead project to me or is there any other alternatives on jupyterlab side?

It's not dead, just delayed... I inquired about the status a few months ago and there are still plans to work on it. There's nothing separate on the JuptyerLab side; hubshare is the official Jupyter plan for how to do filesharing on JupyterHub.

One interesting thing is that, if grading is a function of only the input state, knowing the notebook (plus the tests from your own notebook) is the same as knowing the grade. Except manual grading.

Well, if there are hidden tests then the grades also wouldn't be a function of just the input state.

the latest algorithm hashes the timestamp, right? Which is stored only in the cache, not in student dir. So student would have to share more than just the notebook in order to have someone else able to compete the hash

Yes, it definitely adds an additional barrier to entry. And I guess you're right, if we added a secret token it would act basically like a shared secret between the student and the instructor. Other student's wouldn't be able to (easily) guess it and so wouldn't be able to read any feedback files without explicitly having the file, the exact time it was submitted, and the token (I guess they could try to brute force it, but if we make it long enough it shouldn't be an issue). So the main way someone's feedback could be leaked is if they explicitly gave the token away too.

I think that's a reasonable compromise for the moment.

If you're happy with that solution, would you like to have a go at implementing it? (I'm unfortunately not going to have much time to work on things for the next few weeks, though I might be able to next weekend)

@perllaghu
Copy link
Contributor

This problem becomes significantly less if we move to an API-based exchange, with back-end database:
Such an exchange will already have to distinguish between students & instructors for releasing assignments & grading assignments, so the exchange will be able to return the feedback for any submission (as it will be able to match a piece of feedback to a specific submission), and restrict access to that feedback to only that student [plus any instructor on the course, I presume]

having said all that, we've not yet looked at implementing the feedback code in our local external exchange code yet :oops:

@danielmaitre
Copy link
Contributor

Hi, I think the grading should be deterministic, even if there are hidden tests. If the code between two notebooks is exactly the same, character for character, surely it would have the same outcome for all tests, hidden or not?

Something I suggested in the hackathon in Edinburgh was to add encryption to the submission and feedback mechanism (but I have not implemented yet). The way it could work is that the instructor includes a public key in the metadata of the notebook in the "generate" step, the student uses this key to encrypt the notebook before submitting it, and before that adds a feedback public key to the metadata of the notebook. The instructor decrypts the submitted notebook, grades and generate the feedback, this feedback is then encrypted with the public key of the student and send out.

The advantages of this is that

  • students cannot see (decrypt) other student's feedback
  • students cannot copy other student's work from the exchange
  • students cannot just copy another student's submission and place it in the exchange: they would both have the same feedback public key and this could be detected.

Unfortunately I don't have time this week to have a look at this.

@Sefriol
Copy link
Contributor

Sefriol commented Aug 27, 2019

API-like system would have many advantages, allow many additional features (for example realtime grading) and possibility to use already existing solutions for python-code grading (i.e. fighting against plagiarism etc), but that would require pretty much rewriting the whole thing from scratch (not quite, but not far off. I don't think that there is much to reuse. Even UI things should be moved to JupyterLab).

Encryption has it's positives and negatives. I currently like how instructors can debug the system quite easily by going over certain directories. Doing that at the same time with encryption is quite a time consuming task since you would have to make a solution for dealing with private/public keys at the same time (most likely using database entries).

How would you handle the public/private key storage? Store them in the database and provide API for students to receive their own public key?

Other than that it would be quite nice solution, since privacy and data security are quite lacking in the current solution.

@perllaghu
Copy link
Contributor

API-like system would have many advantages [snip] but that would require pretty much rewriting the whole thing from scratch......

Actually, we're already using one.
The hard part is getting nbgrader to use it.

There is a plan to push the code into the public domain, however our time keeps getting distracted by people wanting things done

@danielmaitre
Copy link
Contributor

I agree debugging would be made more difficult, but it would not be too bad: to publish assignment there is only a change in metadata in the notebook. One step is added when collecting, first we collect, say in a folder "collected_encrypted" then we decrypt into the "collected" folder and from there it follows the normal path, fully decrypted. THe feedback generation is the same as before but with the encryption step at the very end, so we still have a chance to look at the feedback generated before it is sent off to the students.

The key management is a more tricky issue. The encryption key for submission could be part of the assignment data in the nbgrader database. For the students key one could use the same for all submission and have it stored somewhere in the student's home directory. I think the same key could be used for all submissions.

@danielmaitre
Copy link
Contributor

One more thought about students accessing other student's feedback: if students can access other feedback, they still cannot know who the feedback was for. So as long as the number of students is large enough there would still be some anonymity. One could even argue for publishing all feedbacks, so students can learn from other people's mistakes or solutions...

@Sefriol
Copy link
Contributor

Sefriol commented Aug 27, 2019

Actually, we're already using one.
The hard part is getting nbgrader to use it.

Pretty much my point. nbgrader was not designed API-service in mind. In general I don't think it would be that complicated to implement (time consuming, most likely), but in the end that would lead into redoing the whole extension (especially when combined with the JupyterLab translation).

But in the end, I would say that it might be a good solution choice either way. For example, we are already using external grading server for autograding the notebooks, since it takes too much resources compared to normal instructor instance.

@rkdarst
Copy link
Contributor

rkdarst commented Aug 28, 2019

This thread has moved to a whole other topic, it seems. If nbgrader is used with separate, feedback is already designed to be secure (just could use a bit more entropy, but it's OK for now). And if a student doesn't send an exact copy of their notebook file AND timestamp file, there is chance of other feedback being found. Having dedicated randomness for the hash would be in line with [Kerckhoff's principle(https://en.wikipedia.org/wiki/Kerckhoffs%27s_principle).

Other than that it would be quite nice solution, since privacy and data security are quite lacking in the current solution.

Our setup using kubernetes+actual users has no known security problems using current nbgrader (except the minor entropy consideration above). In a certain way I actually quite like the current exchange for its elegant use of standards.

If there is a setup where every user has the same filesystem user ID, then yes, everything above is not true. It's nice to discuss encryption in theory, but if you ask me that's trying a bit too much to work around the problem. Even if there is encryption, as long as student has same uid as instructors, then students can also delete files, alter the released files, release their own assignments, and so on (removing formgrader/config is security by obscurity). I think as long as there is a filesystem-based exchange, it can only be considered secure as long as instructors have different UIDs from students (and then everything works nicely and securely). If not true, nothing will work.

So in short, the world (cloud deployments) has sort of moved beyond the concept of users with their own uid, so there's just a bit of a culture mismatch now, not some fundamental flaw with the old generation. I don't think there is need to do such big changes to the filesystem-based exchange, and instead wait for an API-based exchange (modern solution to a modern world). But of course I'm not the one doing anything about it!

... perhaps we could add some instructions for changing UIDs of the formgrader service notebooks?

@perllaghu
Copy link
Contributor

perllaghu commented Aug 28, 2019

The hard part is getting nbgrader to use it.

Pretty much my point. nbgrader was not designed API-service in mind.

Again, not true.... the fix is to fork nbgrader, and modify nbgrader/exchange/__init__.py to pull in our external-exchange code.... the formgrader & assignments-list code all remains the same.

There is a piece of work to make the exchange service a plugin - which I believe was started, but may have fallen by the wayside.
The rest is pretty easy - it's just Python, and all you need to do is replicate the methods in the original exchange service & pass back data-structures in the right form

..... however, as @rkdarst says, we're drifting off target here

@rkdarst
Copy link
Contributor

rkdarst commented Sep 9, 2019

One hackish idea I had to solve this problem was to add 50 extra decimal digits to the end of the timestamp microseconds. That provides enough entropy that it could be considered secure, and requires no other changes (assuming that all the parsers can handle the excess digits and ignore them). They should probably be written only to timestamp.txt and not used as part of the filenames.

I don't really like the idea since it's repurposing something which is not obviously security related into something security related. But the simplicity is nice.

@Sefriol
Copy link
Contributor

Sefriol commented Sep 9, 2019

One hackish idea I had to solve this problem was to add 50 extra decimal digits to the end of the timestamp microseconds.

There are actually libraries which use processor cycles to create timestamps (in addition to normal timestamp). I don't know how this works however when there multiple processors and kubernetes in the mix.

@perllaghu
Copy link
Contributor

One hackish idea I had to solve this problem was to add 50 extra decimal digits to the end of the timestamp microseconds.

There are actually libraries which use processor cycles to create timestamps (in addition to normal timestamp). I don't know how this works however when there multiple processors and kubernetes in the mix.

Once you push the exchange through an API, you can do stuff like check user IDs [feedback is for a notebook belonging to user <userID>, only said user can download it...

@Sefriol
Copy link
Contributor

Sefriol commented Nov 9, 2019

#1083 (comment)

One of the problems with current feedback system if instructor is ever required to change the contents of the notebook.

@perllaghu
Copy link
Contributor

This is a problem anyway... If a lecturer releases an updated assignment after a student has fetched it, the update won't show.

..... and even if it did, you now have the problem that there's the original released version, the student worked-on version, and now a changed-instructor version - how do you resolve the differences?
Simple: you don't - if you mess up tje assignment, make a new one.... just as you would if this was a paper exercise.

@Sefriol
Copy link
Contributor

Sefriol commented Nov 9, 2019

Yeah, basically to roll updates to the students, we would need a system where:

  • Student has a websocket connection to the server, which would then transmit changes to the notebook (i.e. student subscribes to a channel which is dedicated to the assignment notebook and it sends updates with cell IDs. Some kind of version check for the notebook so client would be notified that there are changes).
  • Non-student solution cells would be updated without student intervention
  • For solution cells Git Diff/Merge -styled window opens and student can choose which parts he includes.

@jupyter jupyter locked and limited conversation to collaborators Jul 13, 2022
@jhamrick jhamrick converted this issue into discussion #1638 Jul 13, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants