From 43362fd4aea25caedf59f610fb02f3aaa30334ca Mon Sep 17 00:00:00 2001 From: Andrew Goldstone Date: Sat, 23 Jul 2016 10:30:25 -0400 Subject: [PATCH] generated vignette files --- inst/doc/introduction.R | 4 ++-- inst/doc/introduction.Rmd | 4 ++-- inst/doc/introduction.html | 27 +++++++++++---------------- 3 files changed, 15 insertions(+), 20 deletions(-) diff --git a/inst/doc/introduction.R b/inst/doc/introduction.R index f0f5dc2..8a3d67a 100644 --- a/inst/doc/introduction.R +++ b/inst/doc/introduction.R @@ -66,7 +66,7 @@ m <- train_model(ilist, n_topics=40, # many more parameters... ) -## ----message=F----------------------------------------------------------- +## ----message=F, results="hide"------------------------------------------- write_mallet_model(m, "modeling_results") ## ----eval=F-------------------------------------------------------------- @@ -102,7 +102,7 @@ srs <- topic_series(m, breaks="years") head(srs) ## ------------------------------------------------------------------------ -journal <- factor(metadata(m)$journal) +journal <- factor(metadata(m)$journaltitle) doc_topics(m) %>% sum_row_groups(journal) %>% normalize_cols() diff --git a/inst/doc/introduction.Rmd b/inst/doc/introduction.Rmd index b652939..91579b3 100644 --- a/inst/doc/introduction.Rmd +++ b/inst/doc/introduction.Rmd @@ -177,7 +177,7 @@ The metadata supplied here as a parameter to `train_model` is not used in modeli Though this `r n_docs(m)`-corpus needs only minutes to model, it often takes hours or more to produce a topic model of even a moderately-sized corpus. You are likely to want to save the results. It is most convenient, I have found, to save both the richest possible MALLET outputs and user-friendlier transformations: many analyses need only the estimated document-topic and topic-word matrices, for example. For this reason, the default `write_mallet_model` function takes the results of `train_model` and outputs a directory of files. -```{r message=F} +```{r message=F, results="hide"} write_mallet_model(m, "modeling_results") ``` @@ -272,7 +272,7 @@ This is a "long" data frame suitable for plotting, which we turn to shortly. But To make this more general operation a little easier, I have supplied generalized aggregator functions `sum_row_groups` and `sum_col_groups` which take a matrix and a grouping factor. As a simple example, suppose we wanted to tabulate the way topics are split up between the two journals in our corpus: ```{r} -journal <- factor(metadata(m)$journal) +journal <- factor(metadata(m)$journaltitle) doc_topics(m) %>% sum_row_groups(journal) %>% normalize_cols() diff --git a/inst/doc/introduction.html b/inst/doc/introduction.html index 25771c9..3905410 100644 --- a/inst/doc/introduction.html +++ b/inst/doc/introduction.html @@ -12,7 +12,7 @@ - + Introduction to dfrtopics @@ -67,14 +67,11 @@ -

This package seeks to provide some help creating and exploring topic models using MALLET from R. It builds on the mallet package. Parts of this package are specialized for working with the metadata and pre-aggregated text data supplied by JSTOR’s Data for Research service; the topic-modeling parts are independent of this, however.

@@ -133,10 +130,9 @@

Tailoring the corpus

group_by(id) %>% summarize(total=sum(weight), stopped=sum(weight[word %in% stoplist])) -
## Source: local data frame [605 x 3]
-## 
+
## # A tibble: 605 x 3
 ##                 id total stopped
-##              (chr) (int)   (int)
+##              <chr> <int>   <int>
 ## 1  10.2307/3693731 10073    3028
 ## 2  10.2307/3693732  6855    3798
 ## 3  10.2307/3693733  4791    2554
@@ -147,7 +143,7 @@ 

Tailoring the corpus

## 8 10.2307/3693738 7248 4213 ## 9 10.2307/3693739 739 399 ## 10 10.2307/432387 5692 3132 -## .. ... ... ...
+## # ... with 595 more rows

As always, Zipf’s law is rather remarkable. In any case, we can remove stopwords now with a simple filter or (equivalently) wordcounts_remove_stopwords.

counts <- counts %>% wordcounts_remove_stopwords(stoplist)
  • Filter infrequent words. OCR’d text in particular is littered with hapax legomena. The long tail of one-off features means a lot of noise for the modeling process, and you’ll likely want to get rid of these.

    @@ -270,10 +266,9 @@

    Saving and loading the results

    Exploring model results

    A good sanity check on a model is to examine the list of the words most frequently assigned to each topic. This is easily obtained from the topic-word matrix, but this is such a common operation that we have a shortcut.

    top_words(m, n=10) # n is the number of words to return for each topic
    -
    ## Source: local data frame [400 x 3]
    -## 
    +
    ## # A tibble: 400 x 3
     ##    topic     word weight
    -##    (int)    (chr)  (int)
    +##    <int>    <chr>  <int>
     ## 1      1      two   3602
     ## 2      1 evidence   1779
     ## 3      1 original   1472
    @@ -284,7 +279,7 @@ 

    Exploring model results

    ## 8 1 line 1086 ## 9 1 given 1029 ## 10 1 question 968 -## .. ... ... ...
    +## # ... with 390 more rows

    This data frame is in fact separately saved to disk and stored, even if the full topic-word matrix is not available. It is in essence a sparse representation of the topic-word matrix.5

    As even this data frame is too long to read if you have more than few topics, a conveniently human-readable summary can be generated from

    topic_labels(m, n=8)
    @@ -361,7 +356,7 @@

    Topics, time, metadata

    ## 6 1 1911-01-01 0.07378674

    This is a “long” data frame suitable for plotting, which we turn to shortly. But it is important to underline that topic_series is a special case of the more general operation of combining modeled topic scores for groups of documents. That is, one of the main uses of a topic model is to consider estimated topics as dependent variables, and metadata as independent variables.7

    To make this more general operation a little easier, I have supplied generalized aggregator functions sum_row_groups and sum_col_groups which take a matrix and a grouping factor. As a simple example, suppose we wanted to tabulate the way topics are split up between the two journals in our corpus:

    -
    journal <- factor(metadata(m)$journal)
    +
    journal <- factor(metadata(m)$journaltitle)
     doc_topics(m) %>%
         sum_row_groups(journal) %>%
         normalize_cols()
    @@ -435,7 +430,7 @@

    A more elaborate visualization: a word’s topic assignments

    sampling_state(m)
    ## An object of class "big.matrix"
     ## Slot "address":
    -## <pointer: 0x7feadded4df0>
    +## <pointer: 0x111c64530>
    dim(sampling_state(m))
    ## [1] 693350      4

    What we now want is to examine the topic-document matrix conditional on the word poem. This is easy to do with the mwhich function from bigmemory, but as a convenience this package provides a function for this particular application (as well as the the term-document matrices conditioned on a topic, tdm_topic):