Skip to content
This repository has been archived by the owner on Jan 24, 2018. It is now read-only.

Peer service #1556

Merged
merged 31 commits into from
Mar 6, 2017
Merged

Peer service #1556

merged 31 commits into from
Mar 6, 2017

Conversation

david4096
Copy link
Member

@david4096 david4096 commented Feb 7, 2017

This PR implements the peer service ga4gh/ga4gh-schemas#806 . It does not add any new dependencies, but makes explicit the dependency the server has on the client for performing the announce request (switched to using requests, which was already a dependency). This could easily be replaced by using the requests module directly.

It is meant to present the minimum functionality to support implementing the schemas PR. Coordinated testing amongst nodes is not addressed. Atlhough interesting, making peer to peer connections and hostname resolution is not addressed.

Peers can be announced to a server by sending a AnnouncePeerRequest to /announce that looks like {peer: {uri: "http://localhost"}}. These will be added as a timestamped row in the registry so they can be evaluated for being added as potential peers.

sqlite> select * from announcement;
1|{"attr": {"version": {"values": [{"int32Value": 32}]}}}|http://1kgenomes.ga4gh.org|127.0.0.1|HTTPie/0.9.6|2017-02-28 16:48:47.515380
2|{}|http://1kgenomes.ga4gh.org|127.0.0.1|python-requests/2.7.0 CPython/2.7.12+ Linux/4.8.0-39-generic|2017-02-28 16:49:33.037886
3|{}|http://1kgenomes.ga4gh.org|127.0.0.1|python-requests/2.7.0 CPython/2.7.12+ Linux/4.8.0-39-generic|2017-02-28 16:49:43.286461
sqlite> 

Peers are managed via the repo manager, ga4gh_repo add-peer http://1kgenomes.ga4gh.org will add that peer to the registry. These peers are exposed via the peers/list endpoint (without rebooting the server).

In order to support the feature of not reloading the server, some design decisions that expose the SQL of the data repository were made. My hope is that with this PR will help begin to form an opinion on how to separate the messages that require some compound ID (variant), with others that can use auto-incrementing IDs (dataset, peer, referenceset, and so on).

To gather more details about the AnnouncePeerRequest before logging it, the request is handled directly instead of via handleFlaskPostRequest breaking some design patterns.

TODO

  • Add repo manager CLI methods for listing and clearning announcements.
  • Repo manager documentation
  • Configuration setting documentation
  • Add an initial announce to peers at start. Tested against 1kgenomes, will send an announce packet at server start for any initial peers listed. Setting the URL to announce is still a challenge because of the different configurations we work under.
  • Log the user-agent of the AnnouncePeerRequest. A JSON with a smattering of extra fields about the announce request are stored to help detect spam. The submitter IP and user agent are stored in the database along with the announce request.
  • Protocol version should be set automatically.
  • Make peer and announcement datamodel objects.
  • Add end to end tests that use the client (send announcement, list peers, get info).
  • Add "initialpeers" file that is read at startup.

@david4096 david4096 changed the title 1507 Peer service (WIP) Peer service (WIP) Feb 7, 2017
Allows peers to be added by CLI
Lists peers at peers/list
Modifies models to have tables for announcements and peers
Stub out some tests
Each announcement's timestamp is recorded, as well as the host it came from
@david4096 david4096 force-pushed the 1507_peers branch 3 times, most recently from 4923fe9 to 6a66b93 Compare February 8, 2017 04:34
@codecov-io
Copy link

codecov-io commented Feb 8, 2017

Codecov Report

Merging #1556 into master will decrease coverage by -0.03%.
The diff coverage is 87.53%.

@@            Coverage Diff             @@
##           master    #1556      +/-   ##
==========================================
- Coverage   85.95%   85.93%   -0.03%     
==========================================
  Files          34       36       +2     
  Lines        7519     7783     +264     
  Branches      947      967      +20     
==========================================
+ Hits         6463     6688     +225     
- Misses        858      890      +32     
- Partials      198      205       +7
Impacted Files Coverage Δ
ga4gh/server/exceptions.py 98.63% <100%> (+0.03%)
ga4gh/server/serverconfig.py 100% <100%> (ø)
ga4gh/server/frontend.py 91.57% <100%> (+0.21%)
ga4gh/server/repo/models.py 99.49% <100%> (+0.03%)
ga4gh/server/datamodel/peers.py 100% <100%> (ø)
ga4gh/server/network/init.py 70.27% <70.27%> (ø)
ga4gh/server/backend.py 91.92% <75.86%> (-0.98%)
ga4gh/server/paging.py 90.31% <85.71%> (-0.27%)
ga4gh/server/datarepo.py 89.78% <86.71%> (-1.34%)
ga4gh/server/cli/repomanager.py 88.47% <89.74%> (+0.07%)
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e94bf86...7cb30fa. Read the comment docs.

Remove peer adding from compliance as its part of initial peers list
Flake fixes
Fix tests
Handle empty page token
@david4096 david4096 closed this Feb 28, 2017
@david4096 david4096 reopened this Feb 28, 2017
@david4096
Copy link
Member Author

S3 is down and so Travis is not working...

Add build to test data
Fix peers list paging
Turn down client verbosity for initial announce
@david4096
Copy link
Member Author

david4096 commented Mar 1, 2017

This code is ready for review @dcolligan @ejacox . I can tell you the ugliest part is probably how announcements are handled. As we continue to develop the use cases on how we will work with announcements I expect to clean this up.

For now, you can add a peer ga4gh_repo add-peer registry.db http://mypeer.org and then call ga4gh_client list-peers http://localhost:8000 to see your peer has been added. You can then remove that peer using ga4gh_repo remove-peer registry.db http://mypeer.org.

To test the announcement functionality, ga4gh_client announce http://localhost:8000 http://mypeer.org. Then open the registry using sqlite3 ga4gh-example-data/registry.db and select * from announcement;. You should see a timestamped message that includes the user agent that made the announcement.

Finally, you can observe the protocol version the service provides using ga4gh_client get-info http://localhost:8000.

For a quick demo of using a client to discover an unknown peer: https://gist.github.com/david4096/8b5cdbfb3bfb00538ad0d4d4b302020e

@kozbo
Copy link
Contributor

kozbo commented Mar 1, 2017

Exciting!

Copy link
Collaborator

@ejacox ejacox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get a message from the server; "Couldn't announce to initial peer init() got an unexpected keyword argument 'verbosity'".

Are the peer functions in the abstract data repo for testing only? Can you mark them as such if so.

Why is the initialization done in the frontend? Should this initialization code be put in the Peer class?

Otherwise, it works for me and the functionality that is there looks good.

@david4096
Copy link
Member Author

david4096 commented Mar 1, 2017

Thanks @ejacox good questions.

First, that first one is a bug, the announce on init isn't currently working since I was trying to suppress its output incorrectly.

Second, those methods in AbstractDataRepository are only used during testing, yes, I'll make that more clear. I hope we can remove this class entirely as it's in-memory model isn't useful.

Third, it does look a bit ugly to have that code in frontend and I don't like the idea of using frontend as a catch-all for operations. In a couple of cases I needed to have the app around to get at the config, and I like isolating code that touches app to the configuration functions. I don't think we want the datamodel code to act directly on configuration values.

Considering that, I would like to move it to another module and isolate how app is used from frontend to reduce the LOC. Sound ok? It's really only a few dozen lines, but frontend already has too much in it.

@ejacox
Copy link
Collaborator

ejacox commented Mar 1, 2017

I would say move what you can and otherwise mark it as being in the wrong place. We can refactor it later.

@david4096
Copy link
Member Author

david4096 commented Mar 1, 2017

Updated

Warning: Because of flask's use of the SimpleHTTPServer when running in debug mode, if you set the service itself as an initial peer it will result in a broken pipe.

There isn't a great way to set the host the announcement is coming from, as this requires one to have found the public internet address of the service. To do that you need to make a request out, look at the response and then set the host. We could do this by including in the GetInfo response a field that echoed the remote address/hostname making the request.

Then, when initiating a peer connection one first makes a GetInfo request to learn the public IP originating the request. This allows the potential peer to make a sensible announcement using other network nodes, and to ensure that the hostname/IP will at least resolve in the current peer to peer network context. The hostname/IP might not be portable to other peers, though.

If two peers run behind a NAT and peer using their local IPs (192.168.1.1 and 2), and one peer exposes its service to the wider internet at a public IP (72.X.X.X), then a peer on the public internet might get a peer that doesn't resolve for them when asking for its peer list (192.168.1.1). This is both a feature and a problem. It makes it easy to create private networks and for systems designers to create network architectures that suit their use case. It also makes it easy to create network architectures that quickly break assumptions of the overall network topology.

A further complication is that though the request will originate from some host, the service itself might be running at a different port or path. When a potential peer is formulating its announce request, it will have to be able to sensibly identify the path and port it is running from. When running under Apache or nginx, at least, we don't have good way of determining the path a WSGI application might be running from.

I don't think it's a good idea to require that the API runs at the top-level of a path. We want folks to be able to flexibly add the software at any path in their existing infrastructure. I think to support dynamic peer resolution under Apache/nginx, we'll have to be able to receive configuration details about the publicly visible port and path.

@david4096
Copy link
Member Author

david4096 commented Mar 2, 2017

Although 1kgenomes is on a released version and so doesn't have the announce endpoint, you can see the requests being logged when the server is started from a server running this PR.

X.X.X.X - - [01/Mar/2017:16:35:38 -0800] "POST /announce HTTP/1.1" 404 271 "-" "python-requests/2.7.0 CPython/2.7.12+ Linux/4.8.0-39-generic"

Compliance PR here ga4gh/compliance#247

@david4096 david4096 changed the title Peer service (WIP) Peer service Mar 2, 2017
Copy link
Member

@dcolligan dcolligan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good, I have a few minor things that I want to see changed, but overall I agree with the design.

import ga4gh.schemas.protocol as protocol


def isUrl(urlString):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there isn't some way to do this with urllib or urlparse or something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and if there isn't, this is general enough to be pushed down into a lower level... at least put a TODO, we don't need to put it in, say, common presently...)

Copy link
Member Author

@david4096 david4096 Mar 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, instead of a regex you can just catch an exception from urllib, I'll do that.

http://grokbase.com/t/python/python-list/018fex83ty/check-url-simply

On closer inspection I found that the urlparse method was too lenient for valid addresses and that in practice some regex is needed. I added it and kept the regex for now. We could definitely move it out, added an issue for it here #1594

return regex.match(urlString)


class Peer:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should always use new-style classes, which have object as a superclass

import ga4gh.server.exceptions as exceptions

import repo.models as m
import ga4gh.server.repo.models as m
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use single-letter variable names; models here is fine instead

def removePeer(self, url):
"""
Remove peers by URL.
:return:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete :return:

@@ -0,0 +1,76 @@
"""
Provides methods to initialize the server's peer to peer connections.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete blank line

import os
import requests

from ga4gh.server import datamodel, exceptions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the canonical import style: datamodel and exceptions should be on their own line, and in the idiom import x.y as z

Copy link
Member Author

@david4096 david4096 Mar 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had you in mind when I wrote that line @dcolligan :). I do like that we have uniform import statements, I just find this syntax nicer to read. We'd have to do it all at once of course. I'll change it back but continue to try to convince you that we could clean up the code base by switching to this style.

582 characters on 10 lines

import ga4gh.server.datamodel.datasets as datasets
import ga4gh.server.datamodel.ontologies as ontologies
import ga4gh.server.datamodel.reads as reads
import ga4gh.server.datamodel.references as references
import ga4gh.server.datamodel.variants as variants
import ga4gh.server.datamodel.sequence_annotations as sequence_annotations
import ga4gh.server.datamodel.continuous as continuous
import ga4gh.server.datamodel.bio_metadata as biodata
import ga4gh.server.datamodel.genotype_phenotype as genotype_phenotype
import ga4gh.server.datamodel.rna_quantification as rna_quantification

Might become 169 characters on 3 lines

from ga4gh.server.datamodel import ontologies, reads, \
    references, variants, sequence_annotations, continuous, \
    biodata, genotype_phenotype, rna_quantification

I think it might actually be more PEPpy?

http://stackoverflow.com/questions/15011367/python-module-import-single-line-vs-multi-line

This style also is highly readable but doesn't earn you any LOC, 10 lines 199 characters.

from ga4gh.server.datamodel import (
    ontologies, 
    reads, 
    references,
    variants,
    sequence_annotations,
    continuous, 
    biodata,
    genotype_phenotype,
    rna_quantification)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another consideration is searching through the code... it is much easier to grep for 'import ga4gh.server.datamodel.variants' to find all the places it is imported than... I'm not even sure what regex one would write for the alternatives.

"file named 'initial_peers.txt' "
"to {}".format(os.getcwd()))
ret = []
# Remove lines that start with a hash or are empty.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps another future candidate for a common method... we do this a lot with our config files

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"""

try:
textFile = open(filePath, 'r')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

easier to do:

with open(filePath) as textFile:
    lines = textFile.readlines()

@@ -88,7 +88,7 @@ def testDirectory(self):
with self.assertRaises(exceptions.RepoInvalidDatabaseException):
repo.open(datarepo.MODE_READ)

def testNonexistantFile(self):
def testNonexistentFile(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always friggin spell this wrong. Thanks.

@david4096
Copy link
Member Author

david4096 commented Mar 2, 2017

Thanks @dcolligan for your review! All good suggestions, I'll get to work on them. I am also going to add functionality for listing and clearing announcements using the CLI.

Add positive check in test of https
Use readline and with in place of try, except
Change import statement
@ejacox ejacox merged commit 83a9417 into ga4gh:master Mar 6, 2017
@david4096 david4096 deleted the 1507_peers branch March 8, 2017 22:10
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants