-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tutorial on how to work with auxiliary data #264
Conversation
@@ -0,0 +1,89 @@ | |||
# Working with auxiliary data | |||
|
|||
This tutorial gives an overview of how to deal with auxiliary information in recommender systems using Cornac. In our context, auxiliary data stands for information beyond user-item interactions or preferences, which often hold a clue on how users consume items. Examples of such information or modalities are: item textual descriptions, user/item reviews, product images, social network, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tutorial gives an overview of how to work
with auxiliary information in recommender systems using Cornac. In our context, auxiliary data stands for information beyond user-item interactions or preferences, which often holds
a clue on how users consume items. Examples of such information or modalities are
item textual descriptions, user/item reviews, product images, social networks
, etc.
|
||
## Modality classes and utilities | ||
|
||
In addition to implementing readers and utilities for different types of data, the `cornac.data` module provides modality classes, namely GraphModality, ImageModality and TextModality. The purpose of the latter classes is to make it convinient to work with the corresponding modalities by: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to implementing readers and utilities for different types of data, cornac.data
module provides modality classes, namely GraphModality
, ImageModality
and TextModality
. The purpose of the latter classes is to make it convenient
to work with the corresponding modalities by:
|
||
In addition to implementing readers and utilities for different types of data, the `cornac.data` module provides modality classes, namely GraphModality, ImageModality and TextModality. The purpose of the latter classes is to make it convinient to work with the corresponding modalities by: | ||
|
||
- Offering a number of useful routines for data formatting, representation, manipulation and transformation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Offering a number of useful routines for data formatting, representation, manipulation
,
and transformation. - Freeing users from the tedious process of aligning auxiliary data with the set of training users, items or ratings.
- Enabling cross-utilization of models designed for one modality to work with a different modality. This topic is covered by the tutorials under the Cross-Modality section.
- Freeing users from the tedious process of aligning auxiliary data with the set of training users, items or ratings. | ||
- Enabling cross-utilization of models designed for one modality to work with a different modality. This topic is covered by the tutorials under the [Cross-Modality](./README.md#Cross-Modality) section. | ||
|
||
In the following we will discover the text modality by going through a concrete example involving text auxiliary information. The same principles would apply for the other modalities, when dealing with graph (e.g, social network) or visual (e.g., product images) auxiliary data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the following,
we will discover the text modality by going through a concrete example involving text auxiliary information. The same principles would apply for the other modalities when dealing with graph data (e.g, social network) or visual data (e.g., product images).
|
||
|
||
### Dataset | ||
We use the well know MovieLens ML100K dataset. It consists of user-movie interactions in the triplet format `(user_id, movie_id, rating)`, as well as movie plots in the format: `(movie_id, text)`, which represent our textual auxiliary information. This dataset is already accessible from Cornac, and we can load it as follows, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use the well know MovieLens 100K
dataset. It consists of user-movie interactions in the triplet format (user_id, movie_id, rating)
, as well as movie plots in the format (movie_id, text)
, which represents our textual auxiliary information. This dataset is already accessible from Cornac, and we can load it as follows
plots, movie_ids = movielens.load_plot() | ||
rating_data = movielens.load_100k(reader=Reader(item_set=movie_ids, bin_threshold=3)) | ||
``` | ||
where we have filtered out movies without plots and binarized the integer ratings thanks to the `Reader`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
, where we have filtered out movies without plots and binarized the integer ratings using cornac.data.Reader
.
``` | ||
where we have filtered out movies without plots and binarized the integer ratings thanks to the `Reader`. | ||
|
||
### The Text Modality class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TextModality class
|
||
### The Text Modality class | ||
|
||
With our dataset in place, the next step is to instantiate a TextModality, which will allow us to manipulate and represent our auxiliary data in the desired format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With our dataset in place, the next step is to instantiate a TextModality
, which will allow us to manipulate and represent our auxiliary data in the desired format.
tokenizer=BaseTokenizer(sep='\t', stop_words='english'), | ||
max_vocab=5000, max_doc_freq=0.5) | ||
``` | ||
In addition the movies plots and ids, we have specified a tokenizer to split text, we limited the maximum vocabulary size to 5000, as well as filtered out words occurring in more than 50% of documents (plots in our case) by setting `max_doc_freq = 0.5`. For more options/details please refer to the [docs](https://cornac.readthedocs.io/en/latest/data.html#module-cornac.data.text). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to
the movie plots
and ids, we have specified a cornac.data.text.Tokenizer
to split text, we limited the maximum vocabulary size to 5000, as well as filtered out words appearing
in more than 50% of documents (plots in our case) by setting max_doc_freq=0.5
. For more options/details of the TextModality
, please refer to the docs.
In addition the movies plots and ids, we have specified a tokenizer to split text, we limited the maximum vocabulary size to 5000, as well as filtered out words occurring in more than 50% of documents (plots in our case) by setting `max_doc_freq = 0.5`. For more options/details please refer to the [docs](https://cornac.readthedocs.io/en/latest/data.html#module-cornac.data.text). | ||
|
||
|
||
**Bag-of-Word text representation.** CDL assumes the bag-of-word representation for text information, i.e., in the form of a document-word matrix. The good news is that we don't have to worry about how to generate such a representation from our raw texts. The TextModality class implements the necessary routines to process and output different representations for text data, e.g., sequence, bag of word. That is, to get our auxiliary data under the desired format, all we need inside the CLD code `cornac/models/cdl/recom_cld` is the following line of code: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bag-of-Word text representation. CDL assumes the bag-of-word representation for text information, i.e., in the form of a document-word matrix. The good news is that we don't have to worry about how to generate such representation from our raw texts. The TextModality
class implements necessary routines to process and output different representations for textual data, e.g., text sequences, bag of words, tf-idf. That is, to get our auxiliary data under the desired format, all we need inside the CDL
model implementation is the following line of code:
```Python | ||
text_feature = self.train_set.item_text.batch_bow(np.arange(n_items)) | ||
``` | ||
where `self.train_set.item_text` is set to our instantiated text modality `item_text_modality`, `n_items` is number of training items, and the `batch_bow()` function returns the bag-of-word vectors of the specified item ids, in our case we want the text features for all training items: `np.arange(n_items)`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
, where self.train_set.item_text
is our item_text_modality
, n_items
is the number of training items, and the batch_bow()
function returns bag-of-words vectors of the specified item ids. In our case, we want text representations for all the training items thus all the training item ids are passed to the batch_bow()
function. The returned text_feature
matrix contains rows, which are the corresponding bag-of-words feature vectors of the provided item ids to batch_bow()
.
``` | ||
where `self.train_set.item_text` is set to our instantiated text modality `item_text_modality`, `n_items` is number of training items, and the `batch_bow()` function returns the bag-of-word vectors of the specified item ids, in our case we want the text features for all training items: `np.arange(n_items)`. | ||
|
||
**Aligning auxiliary information with the set of training items.** Another important aspect worth mentioning at this level is that, we don't have to take extra actions to align the set of training movies with their plots. They are already aligned. That is, the first row in `text_feature` corresponds to training `movie_id: 1`, the second row to training `movie_id: 2`, and so on. This is made possible thanks to passing `item_text_modality` through the evaluation (splitting) method as we shall see shortly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is expected behavior, and I've already added a sentence to explain it earlier. I think this will confuse the readers. Can we remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this comment. I still want to highlight this aspect. I will change it into a note, make it much shorter (1 or 2 sentence), and refer to the previous text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, there are two things that we need to make clear and consistent. First, when passing into an evaluation method, users and items will go through the indexing process where each of them will be assigned a unique index
, not id
(to avoid confusion when referring to their original ids). Second, the user/item indices start counting from 0, not 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine. Can you please check the new version and let me if its ok. There is no reference to id anymore. I don't want to bring in too much details either, as this can make things harder to follow.
item_text=item_text_modality, verbose=True, | ||
seed=123, rating_threshold=0.5) | ||
``` | ||
The text modality passed for the evaluation method. As mentioned earlier, this make it possible to avoid the tedious process of aligning the set of training items with their auxiliary data. Next, we need to instantiate the CLD model and evaluation metrics, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The item text modality is passed into the evaluation method. As mentioned earlier, this makes it possible to avoid the tedious process of aligning the set of training items with their auxiliary data. Moving forward, we need to instantiate the CDL model and evaluation metrics:
--- + ---------- + --------- + -------- | ||
CDL | 0.5494 | 42.1279 | 0.3018 | ||
``` | ||
Note that one may achieve higher results by careful parameter tuning. The purpose here is to illustrate how to handle auxiliary data using Cornac. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to mention it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't, I clear it!
|
||
## Other Modality classes | ||
|
||
The usage of the GraphModality and ImageModality, to deal with graph and visual auxiliary data, follows the same principles as above. The following two examples involve the GraphModality class to handle item graph (network) auxiliary data: [c2pf_example](../examples/c2pf_example.py), [mcf_example](../examples/mcf_office.py). For a usage example of the ImageModality one may refer to [vbpr_tradesy](../examples/vbpr_tradesy.py). The `cornac.data` module's [documentation](https://cornac.readthedocs.io/en/latest/data.html) is also a good resource to know more about the modality classes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The usage of the GraphModality
and ImageModality
to deal with graph and visual auxiliary data follows the same principles as above. The following two examples, c2pf_example, mcf_example, involve GraphModality
to handle item network. For the ImageModality
, one may refer to vbpr_tradesy example. The cornac.data
module's documentation is also a good resource to know more about the modality classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rephrased this part according to your comments.
```Python | ||
text_feature = self.train_set.item_text.batch_bow(np.arange(n_items)) | ||
``` | ||
where `self.train_set.item_text` is our `item_text_modality`, `n_items` is number of training items, and the `batch_bow()` function returns the bag-of-words vectors of the specified item ids, in our case we want the text features for all training items. In more details, the rows of `text_feature` correspond to the bag-of-words vectors of the provided item ids to `batch_bow()`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where self.train_set.item_text
is our item_text_modality
, n_items
is the
number of training items, and the batch_bow()
function returns the bag-of-words vectors of the specified item idsindices
, in our case we want the text features for all training items. In more details, the rows of text_feature
correspond to the bag-of-words vectors of the provided item idsindices
to batch_bow()
.
In addition to the movie plots and ids, we have specified a `cornac.data.text.Tokenizer` to split text, we limited the maximum vocabulary size to 5000, as well as filtered out words occurring in more than 50% of documents (plots in our case) by setting `max_doc_freq = 0.5`. For more options/details on the `TextModality` please refer to the [docs](https://cornac.readthedocs.io/en/latest/data.html#module-cornac.data.text). | ||
|
||
|
||
**Bag-of-Words text representation.** CDL assumes the bag-of-words representation for text information, i.e., in the form of a document-word matrix. The good news is that we don't have to worry about how to generate such representation from our raw texts. The `TextModality` class implements the necessary routines to process and output different representations for text data, e.g., sequence, bag of words, tf-idf. That is, to get our auxiliary data under the desired format, all we need inside the `CLD` [implementation](../cornac/models/cdl/recom_cld.py) is the following line of code: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bag-of-Words text representation. CDL assumes the bag-of-words representation for text information, i.e., in the form of a document-word matrix. The good news is that we don't have to worry about how to generate such representation from our raw texts. The TextModality
class implements the necessary routines to process and output different representations for text data, e.g., sequence, bag of words, tf-idf. That is, to get our auxiliary data under the desired format, all we need inside the CLD
CDL
implementation is the following line of code:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The link to implementation needs to be fixed as well (cld -> cdl)
item_text=item_text_modality, verbose=True, | ||
seed=123, rating_threshold=0.5) | ||
``` | ||
The item text modality is passed for the evaluation method. As mentioned earlier, this make it possible to avoid the tedious process of aligning the set of training items with their auxiliary data. Moving forward, we need to instantiate the CLD model and evaluation metrics: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The item text modality is passed for the evaluation method. As mentioned earlier, this makemakes
it possible to avoid the tedious process of aligning the set of training items with their auxiliary data. Moving forward, we need to instantiate the CLDCDL
model and evaluation metrics:
|
||
|
||
### Dataset | ||
We use the well know MovieLens 100K dataset. It consists of user-movie interactions in the triplet format `(user_id, movie_id, rating)`, as well as movie plots in the format `(movie_id, text)`, which represent our textual auxiliary information. This dataset is already accessible from Cornac, and we can load it as follows, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use the well knowwell-known MovieLens 100K dataset. It consists of user-movie interactions in the triplet format (user_id, movie_id, rating)
, as well as movie plots in the format (movie_id, text)
, which represent our textual auxiliary information. This dataset is already accessible from Cornac, and we can load it as follows,
item_text=item_text_modality, verbose=True, | ||
seed=123, rating_threshold=0.5) | ||
``` | ||
The item text modality is passed for the evaluation method. As mentioned earlier, this makes it possible to avoid the tedious process of aligning the set of training items with their auxiliary data. Moving forward, we need to instantiate the CDL model and evaluation metrics: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be passed to
instead of passed for
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is passed to
, thanks!
Description
Related Issues
Checklist:
README.md
(if you are adding a new model).examples/README.md
(if you are adding a new example).