Add tutorial on how to work with auxiliary data #264

saghiles · 2019-11-25T10:11:27Z

Description

Related Issues

Checklist:

I have added tests.
I have updated the documentation accordingly.
I have updated README.md (if you are adding a new model).
I have updated examples/README.md (if you are adding a new example).

… classes

…ultimodality

tqtg · 2019-12-02T09:40:31Z

tutorials/working_with_auxiliary_data.md

@@ -0,0 +1,89 @@
+# Working with auxiliary data
+
+This tutorial gives an overview of how to deal with auxiliary information in recommender systems using Cornac. In our context, auxiliary data stands for information beyond user-item interactions or preferences, which often hold a clue on how users consume items. Examples of such information or modalities are: item textual descriptions, user/item reviews, product images, social network, etc.


This tutorial gives an overview of how to work with auxiliary information in recommender systems using Cornac. In our context, auxiliary data stands for information beyond user-item interactions or preferences, which often holds a clue on how users consume items. Examples of such information or modalities are item textual descriptions, user/item reviews, product images, social networks, etc.

tqtg · 2019-12-02T09:42:28Z

tutorials/working_with_auxiliary_data.md

+
+## Modality classes and utilities
+
+In addition to implementing readers and utilities for different types of data, the `cornac.data` module provides modality classes, namely GraphModality, ImageModality and TextModality. The purpose of the latter classes is to make it convinient to work with the corresponding modalities by:


In addition to implementing readers and utilities for different types of data, cornac.data module provides modality classes, namely GraphModality, ImageModality and TextModality. The purpose of the latter classes is to make it convenient to work with the corresponding modalities by:

tqtg · 2019-12-02T09:46:45Z

tutorials/working_with_auxiliary_data.md

+
+In addition to implementing readers and utilities for different types of data, the `cornac.data` module provides modality classes, namely GraphModality, ImageModality and TextModality. The purpose of the latter classes is to make it convinient to work with the corresponding modalities by:
+
+- Offering a number of useful routines for data formatting, representation, manipulation and transformation.


Offering a number of useful routines for data formatting, representation, manipulation, and transformation.

Freeing users from the tedious process of aligning auxiliary data with the set of training users, items or ratings.

Enabling cross-utilization of models designed for one modality to work with a different modality. This topic is covered by the tutorials under the Cross-Modality section.

tqtg · 2019-12-02T09:49:30Z

tutorials/working_with_auxiliary_data.md

+- Freeing users from the tedious process of aligning auxiliary data with the set of training users, items or ratings.
+- Enabling cross-utilization of models designed for one modality to work with a different modality. This topic is covered by the tutorials under the [Cross-Modality](./README.md#Cross-Modality) section.    
+
+In the following we will discover the text modality by going through a concrete example involving text auxiliary information. The same principles would apply for the other modalities, when dealing with graph (e.g, social network) or visual (e.g., product images) auxiliary data.


In the following, we will discover the text modality by going through a concrete example involving text auxiliary information. The same principles would apply for the other modalities when dealing with graph data (e.g, social network) or visual data (e.g., product images).

tqtg · 2019-12-02T09:54:47Z

tutorials/working_with_auxiliary_data.md

+
+
+### Dataset
+We use the well know MovieLens ML100K dataset. It consists of user-movie interactions in the triplet format `(user_id, movie_id, rating)`, as well as movie plots in the format: `(movie_id, text)`, which represent our textual auxiliary information. This dataset is already accessible from Cornac, and we can load it as follows,


We use the well know MovieLens 100K dataset. It consists of user-movie interactions in the triplet format (user_id, movie_id, rating), as well as movie plots in the format (movie_id, text), which represents our textual auxiliary information. This dataset is already accessible from Cornac, and we can load it as follows

tqtg · 2019-12-02T09:55:20Z

tutorials/working_with_auxiliary_data.md

+plots, movie_ids = movielens.load_plot()
+rating_data = movielens.load_100k(reader=Reader(item_set=movie_ids, bin_threshold=3))
+```
+where we have filtered out movies without plots and binarized the integer ratings thanks to the `Reader`.


, where we have filtered out movies without plots and binarized the integer ratings using cornac.data.Reader.

tqtg · 2019-12-02T09:56:22Z

tutorials/working_with_auxiliary_data.md

+```
+where we have filtered out movies without plots and binarized the integer ratings thanks to the `Reader`.
+
+### The Text Modality class


The TextModality class

tqtg · 2019-12-02T09:56:59Z

tutorials/working_with_auxiliary_data.md

+
+### The Text Modality class
+
+With our dataset in place, the next step is to instantiate a TextModality, which will allow us to manipulate and represent our auxiliary data in the desired format.  


With our dataset in place, the next step is to instantiate a TextModality, which will allow us to manipulate and represent our auxiliary data in the desired format.

tqtg · 2019-12-02T10:00:05Z

tutorials/working_with_auxiliary_data.md

+                                  tokenizer=BaseTokenizer(sep='\t', stop_words='english'),
+                                  max_vocab=5000, max_doc_freq=0.5)
+```
+In addition the movies plots and ids, we have specified a tokenizer to split text, we limited the maximum vocabulary size to 5000, as well as filtered out words occurring in more than 50% of documents (plots in our case) by setting `max_doc_freq = 0.5`. For more options/details please refer to the [docs](https://cornac.readthedocs.io/en/latest/data.html#module-cornac.data.text). 


In addition to the movie plots and ids, we have specified a cornac.data.text.Tokenizer to split text, we limited the maximum vocabulary size to 5000, as well as filtered out words appearing in more than 50% of documents (plots in our case) by setting max_doc_freq=0.5. For more options/details of the TextModality, please refer to the docs.

tqtg · 2019-12-02T10:08:47Z

tutorials/working_with_auxiliary_data.md

+In addition the movies plots and ids, we have specified a tokenizer to split text, we limited the maximum vocabulary size to 5000, as well as filtered out words occurring in more than 50% of documents (plots in our case) by setting `max_doc_freq = 0.5`. For more options/details please refer to the [docs](https://cornac.readthedocs.io/en/latest/data.html#module-cornac.data.text). 
+
+
+**Bag-of-Word text representation.** CDL assumes the bag-of-word representation for text information, i.e., in the form of a document-word matrix. The good news is that we don't have to worry about how to generate such a representation from our raw texts. The TextModality class implements the necessary routines to process and output different representations for text data, e.g., sequence, bag of word. That is, to get our auxiliary data under the desired format, all we need inside the CLD code `cornac/models/cdl/recom_cld` is the following line of code:


Bag-of-Word text representation. CDL assumes the bag-of-word representation for text information, i.e., in the form of a document-word matrix. The good news is that we don't have to worry about how to generate such representation from our raw texts. The TextModality class implements necessary routines to process and output different representations for textual data, e.g., text sequences, bag of words, tf-idf. That is, to get our auxiliary data under the desired format, all we need inside the CDL model implementation is the following line of code:

tqtg · 2019-12-02T10:11:28Z

tutorials/working_with_auxiliary_data.md

+```Python
+text_feature = self.train_set.item_text.batch_bow(np.arange(n_items))
+``` 
+where `self.train_set.item_text` is set to our instantiated text modality `item_text_modality`, `n_items` is number of training items, and the `batch_bow()` function returns the bag-of-word vectors of the specified item ids, in our case we want the text features for all training items: `np.arange(n_items)`.


, where self.train_set.item_text is our item_text_modality, n_items is the number of training items, and the batch_bow() function returns bag-of-words vectors of the specified item ids. In our case, we want text representations for all the training items thus all the training item ids are passed to the batch_bow() function. The returned text_feature matrix contains rows, which are the corresponding bag-of-words feature vectors of the provided item ids to batch_bow().

tqtg · 2019-12-02T10:27:58Z

tutorials/working_with_auxiliary_data.md

+``` 
+where `self.train_set.item_text` is set to our instantiated text modality `item_text_modality`, `n_items` is number of training items, and the `batch_bow()` function returns the bag-of-word vectors of the specified item ids, in our case we want the text features for all training items: `np.arange(n_items)`.
+
+**Aligning auxiliary information with the set of training items.** Another important aspect worth mentioning at this level is that, we don't have to take extra actions to align the set of training movies with their plots. They are already aligned. That is, the first row in `text_feature` corresponds to training `movie_id: 1`, the second row to training `movie_id: 2`, and so on. This is made possible thanks to passing `item_text_modality` through the evaluation (splitting) method as we shall see shortly. 


This is expected behavior, and I've already added a sentence to explain it earlier. I think this will confuse the readers. Can we remove it?

Thanks for this comment. I still want to highlight this aspect. I will change it into a note, make it much shorter (1 or 2 sentence), and refer to the previous text.

In that case, there are two things that we need to make clear and consistent. First, when passing into an evaluation method, users and items will go through the indexing process where each of them will be assigned a unique index, not id (to avoid confusion when referring to their original ids). Second, the user/item indices start counting from 0, not 1.

That's fine. Can you please check the new version and let me if its ok. There is no reference to id anymore. I don't want to bring in too much details either, as this can make things harder to follow.

tqtg · 2019-12-02T10:33:24Z

tutorials/working_with_auxiliary_data.md

+                         item_text=item_text_modality, verbose=True,
+                         seed=123, rating_threshold=0.5)
+``` 
+The text modality passed for the evaluation method. As mentioned earlier, this make it possible to avoid the tedious process of aligning the set of training items with their auxiliary data. Next, we need to instantiate the CLD model and evaluation metrics,


The item text modality is passed into the evaluation method. As mentioned earlier, this makes it possible to avoid the tedious process of aligning the set of training items with their auxiliary data. Moving forward, we need to instantiate the CDL model and evaluation metrics:

tqtg · 2019-12-02T10:35:36Z

tutorials/working_with_auxiliary_data.md

+--- + ---------- + --------- + --------
+CDL |     0.5494 |   42.1279 |   0.3018
+```
+Note that one may achieve higher results by careful parameter tuning. The purpose here is to illustrate how to handle auxiliary data using Cornac.  


Do we need to mention it?

We don't, I clear it!

tqtg · 2019-12-02T10:39:32Z

tutorials/working_with_auxiliary_data.md

+
+## Other Modality classes
+
+The usage of the GraphModality and ImageModality, to deal with graph and visual auxiliary data, follows the same principles as above. The following two examples involve the GraphModality class to handle item graph (network) auxiliary data: [c2pf_example](../examples/c2pf_example.py), [mcf_example](../examples/mcf_office.py). For a usage example of the ImageModality one may refer to [vbpr_tradesy](../examples/vbpr_tradesy.py). The `cornac.data` module's [documentation](https://cornac.readthedocs.io/en/latest/data.html) is also a good resource to know more about the modality classes. 


The usage of the GraphModality and ImageModality to deal with graph and visual auxiliary data follows the same principles as above. The following two examples, c2pf_example, mcf_example, involve GraphModality to handle item network. For the ImageModality, one may refer to vbpr_tradesy example. The cornac.data module's documentation is also a good resource to know more about the modality classes.

I rephrased this part according to your comments.

tqtg · 2019-12-04T10:25:02Z

tutorials/working_with_auxiliary_data.md

+```Python
+text_feature = self.train_set.item_text.batch_bow(np.arange(n_items))
+``` 
+where `self.train_set.item_text` is our `item_text_modality`, `n_items` is number of training items, and the `batch_bow()` function returns the bag-of-words vectors of the specified item ids, in our case we want the text features for all training items. In more details, the rows of `text_feature` correspond to the bag-of-words vectors of the provided item ids to `batch_bow()`.


where self.train_set.item_text is our item_text_modality, n_items is the number of training items, and the batch_bow() function returns the bag-of-words vectors of the specified item ~~ids~~indices, in our case we want the text features for all training items. In more details, the rows of text_feature correspond to the bag-of-words vectors of the provided item ~~ids~~indices to batch_bow().

tqtg · 2019-12-04T10:26:01Z

tutorials/working_with_auxiliary_data.md

+In addition to the movie plots and ids, we have specified a `cornac.data.text.Tokenizer` to split text, we limited the maximum vocabulary size to 5000, as well as filtered out words occurring in more than 50% of documents (plots in our case) by setting `max_doc_freq = 0.5`. For more options/details on the `TextModality` please refer to the [docs](https://cornac.readthedocs.io/en/latest/data.html#module-cornac.data.text). 
+
+
+**Bag-of-Words text representation.** CDL assumes the bag-of-words representation for text information, i.e., in the form of a document-word matrix. The good news is that we don't have to worry about how to generate such representation from our raw texts. The `TextModality` class implements the necessary routines to process and output different representations for text data, e.g., sequence, bag of words, tf-idf. That is, to get our auxiliary data under the desired format, all we need inside the `CLD` [implementation](../cornac/models/cdl/recom_cld.py) is the following line of code:


Bag-of-Words text representation. CDL assumes the bag-of-words representation for text information, i.e., in the form of a document-word matrix. The good news is that we don't have to worry about how to generate such representation from our raw texts. The TextModality class implements the necessary routines to process and output different representations for text data, e.g., sequence, bag of words, tf-idf. That is, to get our auxiliary data under the desired format, all we need inside the ~~CLD~~CDL implementation is the following line of code:

The link to implementation needs to be fixed as well (cld -> cdl)

tqtg · 2019-12-04T10:27:50Z

tutorials/working_with_auxiliary_data.md

+                         item_text=item_text_modality, verbose=True,
+                         seed=123, rating_threshold=0.5)
+``` 
+The item text modality is passed for the evaluation method. As mentioned earlier, this make it possible to avoid the tedious process of aligning the set of training items with their auxiliary data. Moving forward, we need to instantiate the CLD model and evaluation metrics:


The item text modality is passed for the evaluation method. As mentioned earlier, this ~~make~~makes it possible to avoid the tedious process of aligning the set of training items with their auxiliary data. Moving forward, we need to instantiate the ~~CLD~~CDL model and evaluation metrics:

tqtg · 2019-12-05T10:19:55Z

tutorials/working_with_auxiliary_data.md

+
+
+### Dataset
+We use the well know MovieLens 100K dataset. It consists of user-movie interactions in the triplet format `(user_id, movie_id, rating)`, as well as movie plots in the format `(movie_id, text)`, which represent our textual auxiliary information. This dataset is already accessible from Cornac, and we can load it as follows,


We use the ~~well know~~well-known MovieLens 100K dataset. It consists of user-movie interactions in the triplet format (user_id, movie_id, rating), as well as movie plots in the format (movie_id, text), which represent our textual auxiliary information. This dataset is already accessible from Cornac, and we can load it as follows,

tqtg · 2019-12-05T10:29:17Z

tutorials/working_with_auxiliary_data.md

+                         item_text=item_text_modality, verbose=True,
+                         seed=123, rating_threshold=0.5)
+``` 
+The item text modality is passed for the evaluation method. As mentioned earlier, this makes it possible to avoid the tedious process of aligning the set of training items with their auxiliary data. Moving forward, we need to instantiate the CDL model and evaluation metrics:


Should it be passed to instead of passed for?

It is passed to, thanks!

…ultimodality

saghiles added 2 commits November 25, 2019 18:07

Add tutorial on how to deal with auxiliary data using Cornac modality…

e049f1e

… classes

Merge branch 'master' of https://github.com/PreferredAI/cornac into m…

f207cfc

…ultimodality

saghiles assigned tqtg and unassigned tqtg Nov 25, 2019

saghiles requested a review from tqtg November 25, 2019 10:12

tqtg reviewed Dec 2, 2019

View reviewed changes

Revise the working_with_axiliary_data tutorial

b11b818

tqtg reviewed Dec 4, 2019

View reviewed changes

Fix typos in working with aux data tuto

1dc4511

tqtg reviewed Dec 5, 2019

View reviewed changes

saghiles added 3 commits December 6, 2019 12:38

Fix typos

40d34b9

Update tutorials readme

caffd6b

Merge branch 'master' of https://github.com/PreferredAI/cornac into m…

dfca407

…ultimodality

saghiles changed the title ~~Add tutorial on how to work with auxiliary data using Cornac's modality classes~~ Add tutorial on how to work with auxiliary data Dec 20, 2019

saghiles merged commit 0f2a9e7 into PreferredAI:master Dec 20, 2019

saghiles deleted the multimodality branch December 28, 2020 04:20

tqtg pushed a commit to amirj/cornac that referenced this pull request May 22, 2021

Add tutorial on how to work with auxiliary data (PreferredAI#264)

232638a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tutorial on how to work with auxiliary data #264

Add tutorial on how to work with auxiliary data #264

saghiles commented Nov 25, 2019

tqtg Dec 2, 2019

tqtg Dec 2, 2019

tqtg Dec 2, 2019

tqtg Dec 2, 2019

tqtg Dec 2, 2019

tqtg Dec 2, 2019

tqtg Dec 2, 2019

tqtg Dec 2, 2019

tqtg Dec 2, 2019

tqtg Dec 2, 2019

tqtg Dec 2, 2019 •

edited

Loading

tqtg Dec 2, 2019

saghiles Dec 4, 2019 •

edited

Loading

tqtg Dec 4, 2019

saghiles Dec 4, 2019

tqtg Dec 2, 2019

tqtg Dec 2, 2019

saghiles Dec 4, 2019

tqtg Dec 2, 2019

saghiles Dec 4, 2019

tqtg Dec 4, 2019

tqtg Dec 4, 2019

tqtg Dec 4, 2019

tqtg Dec 4, 2019

tqtg Dec 5, 2019

tqtg Dec 5, 2019

saghiles Dec 6, 2019

		@@ -0,0 +1,89 @@
		# Working with auxiliary data

		This tutorial gives an overview of how to deal with auxiliary information in recommender systems using Cornac. In our context, auxiliary data stands for information beyond user-item interactions or preferences, which often hold a clue on how users consume items. Examples of such information or modalities are: item textual descriptions, user/item reviews, product images, social network, etc.


		## Modality classes and utilities

		In addition to implementing readers and utilities for different types of data, the `cornac.data` module provides modality classes, namely GraphModality, ImageModality and TextModality. The purpose of the latter classes is to make it convinient to work with the corresponding modalities by:


		In addition to implementing readers and utilities for different types of data, the `cornac.data` module provides modality classes, namely GraphModality, ImageModality and TextModality. The purpose of the latter classes is to make it convinient to work with the corresponding modalities by:

		- Offering a number of useful routines for data formatting, representation, manipulation and transformation.



		### Dataset
		We use the well know MovieLens ML100K dataset. It consists of user-movie interactions in the triplet format `(user_id, movie_id, rating)`, as well as movie plots in the format: `(movie_id, text)`, which represent our textual auxiliary information. This dataset is already accessible from Cornac, and we can load it as follows,


		### The Text Modality class

		With our dataset in place, the next step is to instantiate a TextModality, which will allow us to manipulate and represent our auxiliary data in the desired format.

		In addition the movies plots and ids, we have specified a tokenizer to split text, we limited the maximum vocabulary size to 5000, as well as filtered out words occurring in more than 50% of documents (plots in our case) by setting `max_doc_freq = 0.5`. For more options/details please refer to the [docs](https://cornac.readthedocs.io/en/latest/data.html#module-cornac.data.text).


		Bag-of-Word text representation. CDL assumes the bag-of-word representation for text information, i.e., in the form of a document-word matrix. The good news is that we don't have to worry about how to generate such a representation from our raw texts. The TextModality class implements the necessary routines to process and output different representations for text data, e.g., sequence, bag of word. That is, to get our auxiliary data under the desired format, all we need inside the CLD code `cornac/models/cdl/recom_cld` is the following line of code:


		## Other Modality classes

		The usage of the GraphModality and ImageModality, to deal with graph and visual auxiliary data, follows the same principles as above. The following two examples involve the GraphModality class to handle item graph (network) auxiliary data: [c2pf_example](../examples/c2pf_example.py), [mcf_example](../examples/mcf_office.py). For a usage example of the ImageModality one may refer to [vbpr_tradesy](../examples/vbpr_tradesy.py). The `cornac.data` module's [documentation](https://cornac.readthedocs.io/en/latest/data.html) is also a good resource to know more about the modality classes.

		In addition to the movie plots and ids, we have specified a `cornac.data.text.Tokenizer` to split text, we limited the maximum vocabulary size to 5000, as well as filtered out words occurring in more than 50% of documents (plots in our case) by setting `max_doc_freq = 0.5`. For more options/details on the `TextModality` please refer to the [docs](https://cornac.readthedocs.io/en/latest/data.html#module-cornac.data.text).


		Bag-of-Words text representation. CDL assumes the bag-of-words representation for text information, i.e., in the form of a document-word matrix. The good news is that we don't have to worry about how to generate such representation from our raw texts. The `TextModality` class implements the necessary routines to process and output different representations for text data, e.g., sequence, bag of words, tf-idf. That is, to get our auxiliary data under the desired format, all we need inside the `CLD` [implementation](../cornac/models/cdl/recom_cld.py) is the following line of code:



		### Dataset
		We use the well know MovieLens 100K dataset. It consists of user-movie interactions in the triplet format `(user_id, movie_id, rating)`, as well as movie plots in the format `(movie_id, text)`, which represent our textual auxiliary information. This dataset is already accessible from Cornac, and we can load it as follows,

Add tutorial on how to work with auxiliary data #264

Add tutorial on how to work with auxiliary data #264

Conversation

saghiles commented Nov 25, 2019

Description

Related Issues

Checklist:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

The TextModality class

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tqtg Dec 2, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

saghiles Dec 4, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tqtg Dec 2, 2019 •

edited

Loading

saghiles Dec 4, 2019 •

edited

Loading