-
Notifications
You must be signed in to change notification settings - Fork 6
Database implementation #92
base: main
Are you sure you want to change the base?
Conversation
* Create tfidf method for search engine * Pylint fixes * Add testing * Fix linting * Add numpy to requirements.txt * Adapt to articles/keyword object return * Change naming to agreed db convention * disable numpy import error Co-authored-by: Domene99 <jadomene99@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall pretty impressive, I can't pinpoint where the code is becoming so slow, from 1 to 10 mins is more than just communication delay
if doc is not None: | ||
return doc.to_dict() | ||
else: | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should return some info for debugging purposes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like your last PR where you added more error handling?
I don´t know either, but it is probably not on my end since it is the same code as the in-memory implementation. |
Would like to comment that the PR is still in draft form so there might be a lot of prints there and comments too, those are being addressed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some preliminary feedback to guide your implementation, hope it helps!
from connector import get_article_by_id_db | ||
from connector import get_articles_that_match_keywords_db |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @anniefu mentioned, you'll probably want to just have a db_connector.py
that defines methods with the same names that you already use in the code, so that you can just change the import statement and not have to change any code.
def save_keywords_in_db(keywords, article): | ||
"""Saves the keywords from an article in memory | ||
|
||
Args: | ||
keywords (JSON): contains keywords | ||
article (Article): article object | ||
""" | ||
for keyword in keywords: | ||
frequency = article["content"].count(keyword) | ||
|
||
doc_ref = db.collection(u'keywords').where('keyword', '==', keyword) | ||
doc = doc_ref.get() | ||
|
||
if len(doc) != 0 and doc[0] is not None: | ||
from_db = doc[0].to_dict() | ||
print(from_db) | ||
from_db["matching_articles"][article["id"]] = frequency | ||
#print(from_db) | ||
db.collection(u'keywords').document(doc[0].id).set(from_db) | ||
else: | ||
to_send = {"keyword": keyword, "matching_articles": {article["id"]: frequency}} | ||
db.collection(u'keywords').add(to_send) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you're trying to mirror the same structure that you had for the in-memory implementation, which is not necessarily the best implementation. For example, keywords
could be a field of the article document that gets queried directly, which eliminates the need to store keywords
as a separate collection. See this guide on how to perform different queries against Firestore, particularly the example for array membership.
@@ -38,7 +38,7 @@ class Article: | |||
def __init__(self, number, content): | |||
self.number = number | |||
self.content = content | |||
self.id = str(number) | |||
self.id = 'monterrey'+str(number) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this ID be generated by Firebase instead?
Started implementing the database without unit testing. While probably it won't get merged but here is the code.