Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tutorial on how to work with auxiliary data #264

Merged
merged 7 commits into from
Dec 20, 2019

Conversation

saghiles
Copy link
Member

Description

Related Issues

Checklist:

  • I have added tests.
  • I have updated the documentation accordingly.
  • I have updated README.md (if you are adding a new model).
  • I have updated examples/README.md (if you are adding a new example).

@saghiles saghiles assigned tqtg and unassigned tqtg Nov 25, 2019
@saghiles saghiles requested a review from tqtg November 25, 2019 10:12
@@ -0,0 +1,89 @@
# Working with auxiliary data

This tutorial gives an overview of how to deal with auxiliary information in recommender systems using Cornac. In our context, auxiliary data stands for information beyond user-item interactions or preferences, which often hold a clue on how users consume items. Examples of such information or modalities are: item textual descriptions, user/item reviews, product images, social network, etc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tutorial gives an overview of how to work with auxiliary information in recommender systems using Cornac. In our context, auxiliary data stands for information beyond user-item interactions or preferences, which often holds a clue on how users consume items. Examples of such information or modalities are item textual descriptions, user/item reviews, product images, social networks, etc.


## Modality classes and utilities

In addition to implementing readers and utilities for different types of data, the `cornac.data` module provides modality classes, namely GraphModality, ImageModality and TextModality. The purpose of the latter classes is to make it convinient to work with the corresponding modalities by:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to implementing readers and utilities for different types of data, cornac.data module provides modality classes, namely GraphModality, ImageModality and TextModality. The purpose of the latter classes is to make it convenient to work with the corresponding modalities by:


In addition to implementing readers and utilities for different types of data, the `cornac.data` module provides modality classes, namely GraphModality, ImageModality and TextModality. The purpose of the latter classes is to make it convinient to work with the corresponding modalities by:

- Offering a number of useful routines for data formatting, representation, manipulation and transformation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Offering a number of useful routines for data formatting, representation, manipulation, and transformation.
  • Freeing users from the tedious process of aligning auxiliary data with the set of training users, items or ratings.
  • Enabling cross-utilization of models designed for one modality to work with a different modality. This topic is covered by the tutorials under the Cross-Modality section.

- Freeing users from the tedious process of aligning auxiliary data with the set of training users, items or ratings.
- Enabling cross-utilization of models designed for one modality to work with a different modality. This topic is covered by the tutorials under the [Cross-Modality](./README.md#Cross-Modality) section.

In the following we will discover the text modality by going through a concrete example involving text auxiliary information. The same principles would apply for the other modalities, when dealing with graph (e.g, social network) or visual (e.g., product images) auxiliary data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the following, we will discover the text modality by going through a concrete example involving text auxiliary information. The same principles would apply for the other modalities when dealing with graph data (e.g, social network) or visual data (e.g., product images).



### Dataset
We use the well know MovieLens ML100K dataset. It consists of user-movie interactions in the triplet format `(user_id, movie_id, rating)`, as well as movie plots in the format: `(movie_id, text)`, which represent our textual auxiliary information. This dataset is already accessible from Cornac, and we can load it as follows,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use the well know MovieLens 100K dataset. It consists of user-movie interactions in the triplet format (user_id, movie_id, rating), as well as movie plots in the format (movie_id, text), which represents our textual auxiliary information. This dataset is already accessible from Cornac, and we can load it as follows

plots, movie_ids = movielens.load_plot()
rating_data = movielens.load_100k(reader=Reader(item_set=movie_ids, bin_threshold=3))
```
where we have filtered out movies without plots and binarized the integer ratings thanks to the `Reader`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

, where we have filtered out movies without plots and binarized the integer ratings using cornac.data.Reader.

```
where we have filtered out movies without plots and binarized the integer ratings thanks to the `Reader`.

### The Text Modality class
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TextModality class


### The Text Modality class

With our dataset in place, the next step is to instantiate a TextModality, which will allow us to manipulate and represent our auxiliary data in the desired format.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With our dataset in place, the next step is to instantiate a TextModality, which will allow us to manipulate and represent our auxiliary data in the desired format.

tokenizer=BaseTokenizer(sep='\t', stop_words='english'),
max_vocab=5000, max_doc_freq=0.5)
```
In addition the movies plots and ids, we have specified a tokenizer to split text, we limited the maximum vocabulary size to 5000, as well as filtered out words occurring in more than 50% of documents (plots in our case) by setting `max_doc_freq = 0.5`. For more options/details please refer to the [docs](https://cornac.readthedocs.io/en/latest/data.html#module-cornac.data.text).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the movie plots and ids, we have specified a cornac.data.text.Tokenizer to split text, we limited the maximum vocabulary size to 5000, as well as filtered out words appearing in more than 50% of documents (plots in our case) by setting max_doc_freq=0.5. For more options/details of the TextModality, please refer to the docs.

In addition the movies plots and ids, we have specified a tokenizer to split text, we limited the maximum vocabulary size to 5000, as well as filtered out words occurring in more than 50% of documents (plots in our case) by setting `max_doc_freq = 0.5`. For more options/details please refer to the [docs](https://cornac.readthedocs.io/en/latest/data.html#module-cornac.data.text).


**Bag-of-Word text representation.** CDL assumes the bag-of-word representation for text information, i.e., in the form of a document-word matrix. The good news is that we don't have to worry about how to generate such a representation from our raw texts. The TextModality class implements the necessary routines to process and output different representations for text data, e.g., sequence, bag of word. That is, to get our auxiliary data under the desired format, all we need inside the CLD code `cornac/models/cdl/recom_cld` is the following line of code:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bag-of-Word text representation. CDL assumes the bag-of-word representation for text information, i.e., in the form of a document-word matrix. The good news is that we don't have to worry about how to generate such representation from our raw texts. The TextModality class implements necessary routines to process and output different representations for textual data, e.g., text sequences, bag of words, tf-idf. That is, to get our auxiliary data under the desired format, all we need inside the CDL model implementation is the following line of code:

```Python
text_feature = self.train_set.item_text.batch_bow(np.arange(n_items))
```
where `self.train_set.item_text` is set to our instantiated text modality `item_text_modality`, `n_items` is number of training items, and the `batch_bow()` function returns the bag-of-word vectors of the specified item ids, in our case we want the text features for all training items: `np.arange(n_items)`.
Copy link
Member

@tqtg tqtg Dec 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

, where self.train_set.item_text is our item_text_modality, n_items is the number of training items, and the batch_bow() function returns bag-of-words vectors of the specified item ids. In our case, we want text representations for all the training items thus all the training item ids are passed to the batch_bow() function. The returned text_feature matrix contains rows, which are the corresponding bag-of-words feature vectors of the provided item ids to batch_bow().

```
where `self.train_set.item_text` is set to our instantiated text modality `item_text_modality`, `n_items` is number of training items, and the `batch_bow()` function returns the bag-of-word vectors of the specified item ids, in our case we want the text features for all training items: `np.arange(n_items)`.

**Aligning auxiliary information with the set of training items.** Another important aspect worth mentioning at this level is that, we don't have to take extra actions to align the set of training movies with their plots. They are already aligned. That is, the first row in `text_feature` corresponds to training `movie_id: 1`, the second row to training `movie_id: 2`, and so on. This is made possible thanks to passing `item_text_modality` through the evaluation (splitting) method as we shall see shortly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is expected behavior, and I've already added a sentence to explain it earlier. I think this will confuse the readers. Can we remove it?

Copy link
Member Author

@saghiles saghiles Dec 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this comment. I still want to highlight this aspect. I will change it into a note, make it much shorter (1 or 2 sentence), and refer to the previous text.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, there are two things that we need to make clear and consistent. First, when passing into an evaluation method, users and items will go through the indexing process where each of them will be assigned a unique index, not id (to avoid confusion when referring to their original ids). Second, the user/item indices start counting from 0, not 1.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine. Can you please check the new version and let me if its ok. There is no reference to id anymore. I don't want to bring in too much details either, as this can make things harder to follow.

item_text=item_text_modality, verbose=True,
seed=123, rating_threshold=0.5)
```
The text modality passed for the evaluation method. As mentioned earlier, this make it possible to avoid the tedious process of aligning the set of training items with their auxiliary data. Next, we need to instantiate the CLD model and evaluation metrics,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The item text modality is passed into the evaluation method. As mentioned earlier, this makes it possible to avoid the tedious process of aligning the set of training items with their auxiliary data. Moving forward, we need to instantiate the CDL model and evaluation metrics:

--- + ---------- + --------- + --------
CDL | 0.5494 | 42.1279 | 0.3018
```
Note that one may achieve higher results by careful parameter tuning. The purpose here is to illustrate how to handle auxiliary data using Cornac.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to mention it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't, I clear it!


## Other Modality classes

The usage of the GraphModality and ImageModality, to deal with graph and visual auxiliary data, follows the same principles as above. The following two examples involve the GraphModality class to handle item graph (network) auxiliary data: [c2pf_example](../examples/c2pf_example.py), [mcf_example](../examples/mcf_office.py). For a usage example of the ImageModality one may refer to [vbpr_tradesy](../examples/vbpr_tradesy.py). The `cornac.data` module's [documentation](https://cornac.readthedocs.io/en/latest/data.html) is also a good resource to know more about the modality classes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usage of the GraphModality and ImageModality to deal with graph and visual auxiliary data follows the same principles as above. The following two examples, c2pf_example, mcf_example, involve GraphModality to handle item network. For the ImageModality, one may refer to vbpr_tradesy example. The cornac.data module's documentation is also a good resource to know more about the modality classes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rephrased this part according to your comments.

```Python
text_feature = self.train_set.item_text.batch_bow(np.arange(n_items))
```
where `self.train_set.item_text` is our `item_text_modality`, `n_items` is number of training items, and the `batch_bow()` function returns the bag-of-words vectors of the specified item ids, in our case we want the text features for all training items. In more details, the rows of `text_feature` correspond to the bag-of-words vectors of the provided item ids to `batch_bow()`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where self.train_set.item_text is our item_text_modality, n_items is the number of training items, and the batch_bow() function returns the bag-of-words vectors of the specified item idsindices, in our case we want the text features for all training items. In more details, the rows of text_feature correspond to the bag-of-words vectors of the provided item idsindices to batch_bow().

In addition to the movie plots and ids, we have specified a `cornac.data.text.Tokenizer` to split text, we limited the maximum vocabulary size to 5000, as well as filtered out words occurring in more than 50% of documents (plots in our case) by setting `max_doc_freq = 0.5`. For more options/details on the `TextModality` please refer to the [docs](https://cornac.readthedocs.io/en/latest/data.html#module-cornac.data.text).


**Bag-of-Words text representation.** CDL assumes the bag-of-words representation for text information, i.e., in the form of a document-word matrix. The good news is that we don't have to worry about how to generate such representation from our raw texts. The `TextModality` class implements the necessary routines to process and output different representations for text data, e.g., sequence, bag of words, tf-idf. That is, to get our auxiliary data under the desired format, all we need inside the `CLD` [implementation](../cornac/models/cdl/recom_cld.py) is the following line of code:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bag-of-Words text representation. CDL assumes the bag-of-words representation for text information, i.e., in the form of a document-word matrix. The good news is that we don't have to worry about how to generate such representation from our raw texts. The TextModality class implements the necessary routines to process and output different representations for text data, e.g., sequence, bag of words, tf-idf. That is, to get our auxiliary data under the desired format, all we need inside the CLDCDL implementation is the following line of code:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The link to implementation needs to be fixed as well (cld -> cdl)

item_text=item_text_modality, verbose=True,
seed=123, rating_threshold=0.5)
```
The item text modality is passed for the evaluation method. As mentioned earlier, this make it possible to avoid the tedious process of aligning the set of training items with their auxiliary data. Moving forward, we need to instantiate the CLD model and evaluation metrics:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The item text modality is passed for the evaluation method. As mentioned earlier, this makemakes it possible to avoid the tedious process of aligning the set of training items with their auxiliary data. Moving forward, we need to instantiate the CLDCDL model and evaluation metrics:



### Dataset
We use the well know MovieLens 100K dataset. It consists of user-movie interactions in the triplet format `(user_id, movie_id, rating)`, as well as movie plots in the format `(movie_id, text)`, which represent our textual auxiliary information. This dataset is already accessible from Cornac, and we can load it as follows,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use the well knowwell-known MovieLens 100K dataset. It consists of user-movie interactions in the triplet format (user_id, movie_id, rating), as well as movie plots in the format (movie_id, text), which represent our textual auxiliary information. This dataset is already accessible from Cornac, and we can load it as follows,

item_text=item_text_modality, verbose=True,
seed=123, rating_threshold=0.5)
```
The item text modality is passed for the evaluation method. As mentioned earlier, this makes it possible to avoid the tedious process of aligning the set of training items with their auxiliary data. Moving forward, we need to instantiate the CDL model and evaluation metrics:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be passed to instead of passed for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is passed to, thanks!

@saghiles saghiles changed the title Add tutorial on how to work with auxiliary data using Cornac's modality classes Add tutorial on how to work with auxiliary data Dec 20, 2019
@saghiles saghiles merged commit 0f2a9e7 into PreferredAI:master Dec 20, 2019
@saghiles saghiles deleted the multimodality branch December 28, 2020 04:20
tqtg pushed a commit to amirj/cornac that referenced this pull request May 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants