Returns document subsets of a tokens that meet certain conditions, including
direct logical operations on docvars (document-level variables).
tokens_subset()
functions identically to subset.data.frame()
, using
non-standard evaluation to evaluate conditions based on the docvars in the
tokens.
tokens_subset(
x,
subset,
min_ntoken = NULL,
max_ntoken = NULL,
drop_docid = TRUE,
...
)
tokens object to be subsetted.
logical expression indicating the documents to keep: missing values are taken as false.
minimum and maximum lengths of the documents to extract.
if TRUE
, docid
for documents are removed as the result
of subsetting.
not used
tokens object, with a subset of documents (and docvars) selected according to arguments
corp <- corpus(c(d1 = "a b c d", d2 = "a a b e",
d3 = "b b c e", d4 = "e e f a b"),
docvars = data.frame(grp = c(1, 1, 2, 3)))
toks <- tokens(corp)
# selecting on a docvars condition
tokens_subset(toks, grp > 1)
#> Tokens consisting of 2 documents and 1 docvar.
#> d3 :
#> [1] "b" "b" "c" "e"
#>
#> d4 :
#> [1] "e" "e" "f" "a" "b"
#>
# selecting on a supplied vector
tokens_subset(toks, c(TRUE, FALSE, TRUE, FALSE))
#> Tokens consisting of 2 documents and 1 docvar.
#> d1 :
#> [1] "a" "b" "c" "d"
#>
#> d3 :
#> [1] "b" "b" "c" "e"
#>