Please check the latest news (change log) and keep this package updated.
- Use
\donttest{}
in more examples to avoid unnecessary errors. - Improved
text_unmask()
, though it has been deprecated.
- Now use "YYYY.M" as package version number.
- Deprecated
text_unmask()
since I have developed a new package FMAT as an integrative toolbox of the Fill-Mask Association Test (FMAT).
- Changed welcome messages by using
packageStartupMessage()
so that the messages can be suppressed. - Improved
text_unmask()
, but a new package (currently not publicly available) has been developed for a more general purpose of using masked language models to measure conceptual associations. Please wait for the release of this new package and the publication of a related methodological article.
- Fixed problematic
normalized
attribute when usingdata_wordvec_load()
.
- New S3
[
method forembed
, see new examples inas_embed()
. - New S3
unique()
method to delete duplicate words. - New S3
str()
method to print the data structure and attributes. - New
pattern()
function designed for S3[
method ofembed
: Users can directly use regular expression likeembed[pattern("^for")]
to extract a subset of embedding matrix. - New
plot_network()
function: Visualize a (partial correlation) network graph of words. Very useful for identifying potential semantic clusters from a list of words and even useful for disentangling antonyms from synonyms. - New
targets
argument oftext_unmask()
: Return specific fill-mask results for certain target words (rather than the top n results).
- Most functions now have been substantially enhanced for a faster speed, especially
tab_similarity()
,most_similar()
,dict_expand()
,dict_reliability()
,test_WEAT()
,test_RND()
. - Improved S3
print()
method forembed
andwordvec
. pair_similarity()
has been improved by using matrix operationtcrossprod(embed, embed)
to compute cosine similarity, withembed
normalized.data_wordvec_load()
has got two wrapper functionsload_wordvec()
andload_embed()
for faster use.data_wordvec_normalize()
(deprecated) has been renamed tonormalize()
.get_wordvecs()
(deprecated) has been integrated intoget_wordvec()
.tab_similarity_cross()
(deprecated) has been integrated intotab_similarity()
.test_WEAT()
andtest_RND()
: Warning ifT1
andT2
orA1
andA2
have duplicate values.
- Fixed the issue of unexpected long loading and processing time in 0.2.0, which was related to duplicate words in .RData, too many words in
embed
orwordvec
, and too many words to be printed to console. Now all related functions have been substantially improved so that they would not take unnecessarily long time.
- Most functions now internally use
embed
(an extended class of matrix) rather thanwordvec
in order to enhance the speed! - New series of
text_*
functions for contextualized word embeddings! Based on the R packagetext
(and using the R packagereticulate
to call functions from the Python moduletransformers
), a series of new functions have been developed to (1) download HuggingFace Transformers pre-trained language models (PLM; thousands of options such as GPT, BERT, RoBERTa, DeBERTa, DistilBERT, etc.), (2) extract contextualized token (roughly word) embeddings and text embeddings, and (3) fill in the blank mask(s) in a query (e.g., "Beijing is the [MASK] of China.").text_init()
: set up a Python environment for PLMtext_model_download()
: download PLMs from HuggingFace to local ".cache" foldertext_model_remove()
: remove PLMs from local ".cache" foldertext_to_vec()
: extract contextualized token and text embeddingstext_unmask()
: fill in the blank mask(s) in a query
- New
orth_procrustes()
function: Orthogonal Procrustes matrix alignment. Users can input either two matrices of word embeddings or twowordvec
objects as loaded bydata_wordvec_load()
or transformed from matrices byas_wordvec()
. - New
dict_expand()
function: Expand a dictionary from the most similar words, based onmost_similar()
. - New
dict_reliability()
function: Reliability analysis (Cronbach's α) and Principal Component Analysis (PCA) of a dictionary. Note that Cronbach's α may be misleading when the number of items/words is large.
- New
sum_wordvec()
function: Calculate the sum vector of multiple words. - New
plot_similarity()
function: Visualize cosine similarities between word pairs in a style of correlation matrix plot. - New
tab_similarity_cross()
function: A wrapper oftab_similarity()
to tabulate cosine similarities for only n1 * n2 word pairs from two sets of words (arguments:words1
,words2
). - New S3 methods:
print.wordvec()
,print.embed()
,rbind.wordvec()
,rbind.embed()
,subset.wordvec()
,subset.embed()
as_matrix()
has been renamed toas_embed()
: NowPsychWordVec
supports two classes of data objects --wordvec
(data.table) andembed
(matrix). Most functions now useembed
(or transformwordvec
toembed
) internally so as to enhance the speed. Matrix is much faster!- Deprecated
data_wordvec_reshape()
: Now useas_wordvec()
andas_embed()
.
- Defaults changed in
data_wordvec_subset()
,get_wordvecs()
,tab_similarity()
, andplot_similarity()
: If neitherwords
norpattern
are specified (NULL
), then all words indata
will be extracted. - Improved S3 methods
print.weat()
andprint.rnd()
.
- Added permutation test of significance for both
test_WEAT()
andtest_RND()
: Users can specify the number of permutation samples and choose to calculate either one-sided or two-sided p value. It can well reproduce the results in Caliskan et al.'s (2017) article. - Added the
pooled.sd
argument fortest_WEAT()
: Users can choose the method used to calculate the pooled SD for effect size estimate in WEAT. However, the original approach proposed by Caliskan et al. (2017) is the default and highly suggested. - Wrapper functions
as_matrix()
andas_wordvec()
fordata_wordvec_reshape()
, which can make it easier to reshape word embeddings data frommatrix
to "wordvec"data.table
or vice versa.
- Both
test_WEAT()
andtest_RND()
now have changed the element names and S3 print method of their returned objects (of new classweat
andrnd
, respectively): The elements$eff.raw
,$eff.size
, and$eff.sum
are now deprecated and replaced by$eff
, which is adata.table
containing the overall raw/standardized effects and permutation p value. The new S3 print methodsprint.weat()
andprint.rnd()
can make a tidy report of the test results when you directly type and print the returned object (see code examples). - Improved command line interfaces using the
cli
package. - Improved welcome messages when
library(PsychWordVec)
.
- CRAN initial release.
- Fixed all issues in the CRAN manual inspection.
- Added
wordvec
as the primary class of word vectors data: Now the data classes containwordvec
,data.table
, anddata.frame
, which actually perform as adata.table
. - New
train_wordvec()
function: Train word vectors using the Word2Vec, GloVe, or FastText algorithm with multi-threading. - New
tokenize()
function: Tokenize raw texts for training word vectors. - New
data_wordvec_reshape()
function: Reshape word vectors data from dense (adata.table
of new classswordvec
with two variablesword
andvec
) to plain (amatrix
of word vectors) or vice versa. - New
test_RND()
function, andtab_WEAT()
is renamed totest_WEAT()
: These two functions serve as convenient tools of word semantic similarity analysis and conceptual association test. - New
plot_wordvec_tSNE()
function: Visualize 2-D or 3-D word vectors with dimensionality reduced using the t-Distributed Stochastic Neighbor Embedding (t-SNE) method.
- Enhanced all functions.
- New
data_wordvec_subset()
function. - Added the
unique
argument fortab_similarity()
. - Added support to use regular expression pattern in
test_WEAT()
.
- Initial public release on GitHub with more functions.
- Basic functions and the WordVector_RData.pdf file.