This function selects or removes features from a dfm or fcm,
based on feature name matches with pattern
. The most common usages
are to eliminate features from a dfm already constructed, such as stopwords,
or to select only terms of interest from a dictionary.
dfm_select(
x,
pattern = NULL,
selection = c("keep", "remove"),
valuetype = c("glob", "regex", "fixed"),
case_insensitive = TRUE,
min_nchar = NULL,
max_nchar = NULL,
padding = FALSE,
verbose = quanteda_options("verbose")
)
dfm_remove(x, ...)
dfm_keep(x, ...)
fcm_select(
x,
pattern = NULL,
selection = c("keep", "remove"),
valuetype = c("glob", "regex", "fixed"),
case_insensitive = TRUE,
verbose = quanteda_options("verbose"),
...
)
fcm_remove(x, ...)
fcm_keep(x, ...)
a character vector, list of character vectors, dictionary, or collocations object. See pattern for details.
whether to keep
or remove
the features
the type of pattern matching: "glob"
for "glob"-style
wildcard expressions; "regex"
for regular expressions; or "fixed"
for
exact matching. See valuetype for details.
logical; if TRUE
, ignore case when matching a
pattern
or dictionary values
optional numerics specifying the minimum and
maximum length in characters for tokens to be removed or kept; defaults are
NULL
for no limits. These are applied after (and hence, in addition
to) any selection based on pattern matches.
if TRUE
, record the number of removed tokens in the first column.
if TRUE
, print message about how many pattern were
removed
used only for passing arguments from dfm_remove
or
dfm_keep
to dfm_select
. Cannot include
selection
.
A dfm or fcm object, after the feature selection has been applied.
For compatibility with earlier versions, when pattern
is a
dfm object and selection = "keep"
, then this will be
equivalent to calling dfm_match()
. In this case, the following
settings are always used: case_insensitive = FALSE
, and
valuetype = "fixed"
. This functionality is deprecated, however, and
you should use dfm_match()
instead.
dfm_remove
and fcm_remove
are simply a convenience
wrappers to calling dfm_select
and fcm_select
with
selection = "remove"
.
dfm_keep
and fcm_keep
are simply a convenience wrappers to
calling dfm_select
and fcm_select
with selection = "keep"
.
This function selects features based on their labels. To select
features based on the values of the document-feature matrix, use
dfm_trim()
.
dfmat <- tokens(c("My Christmas was ruined by your opposition tax plan.",
"Does the United_States or Sweden have more progressive taxation?")) |>
dfm(tolower = FALSE)
dict <- dictionary(list(countries = c("United_States", "Sweden", "France"),
wordsEndingInY = c("by", "my"),
notintext = "blahblah"))
dfm_select(dfmat, pattern = dict)
#> Document-feature matrix of: 2 documents, 4 features (50.00% sparse) and 0 docvars.
#> features
#> docs My by United_States Sweden
#> text1 1 1 0 0
#> text2 0 0 1 1
dfm_select(dfmat, pattern = dict, case_insensitive = FALSE)
#> Document-feature matrix of: 2 documents, 1 feature (50.00% sparse) and 0 docvars.
#> features
#> docs by
#> text1 1
#> text2 0
dfm_select(dfmat, pattern = c("s$", ".y"), selection = "keep", valuetype = "regex")
#> Document-feature matrix of: 2 documents, 6 features (50.00% sparse) and 0 docvars.
#> features
#> docs My Christmas was by Does United_States
#> text1 1 1 1 1 0 0
#> text2 0 0 0 0 1 1
dfm_select(dfmat, pattern = c("s$", ".y"), selection = "remove", valuetype = "regex")
#> Document-feature matrix of: 2 documents, 14 features (50.00% sparse) and 0 docvars.
#> features
#> docs ruined your opposition tax plan . the or Sweden have
#> text1 1 1 1 1 1 1 0 0 0 0
#> text2 0 0 0 0 0 0 1 1 1 1
#> [ reached max_nfeat ... 4 more features ]
dfm_select(dfmat, pattern = stopwords("english"), selection = "keep", valuetype = "fixed")
#> Document-feature matrix of: 2 documents, 9 features (50.00% sparse) and 0 docvars.
#> features
#> docs My was by your Does the or have more
#> text1 1 1 1 1 0 0 0 0 0
#> text2 0 0 0 0 1 1 1 1 1
dfm_select(dfmat, pattern = stopwords("english"), selection = "remove", valuetype = "fixed")
#> Document-feature matrix of: 2 documents, 11 features (50.00% sparse) and 0 docvars.
#> features
#> docs Christmas ruined opposition tax plan . United_States Sweden progressive
#> text1 1 1 1 1 1 1 0 0 0
#> text2 0 0 0 0 0 0 1 1 1
#> features
#> docs taxation
#> text1 0
#> text2 1
#> [ reached max_nfeat ... 1 more feature ]
# select based on character length
dfm_select(dfmat, min_nchar = 5)
#> Document-feature matrix of: 2 documents, 7 features (50.00% sparse) and 0 docvars.
#> features
#> docs Christmas ruined opposition United_States Sweden progressive taxation
#> text1 1 1 1 0 0 0 0
#> text2 0 0 0 1 1 1 1
dfmat <- dfm(tokens(c("This is a document with lots of stopwords.",
"No if, and, or but about it: lots of stopwords.")))
dfmat
#> Document-feature matrix of: 2 documents, 18 features (38.89% sparse) and 0 docvars.
#> features
#> docs this is a document with lots of stopwords . no
#> text1 1 1 1 1 1 1 1 1 1 0
#> text2 0 0 0 0 0 1 1 1 1 1
#> [ reached max_nfeat ... 8 more features ]
dfm_remove(dfmat, stopwords("english"))
#> Document-feature matrix of: 2 documents, 6 features (25.00% sparse) and 0 docvars.
#> features
#> docs document lots stopwords . , :
#> text1 1 1 1 1 0 0
#> text2 0 1 1 1 2 1
toks <- tokens(c("this contains lots of stopwords",
"no if, and, or but about it: lots"),
remove_punct = TRUE)
fcmat <- fcm(toks)
fcmat
#> Feature co-occurrence matrix of: 12 by 12 features.
#> features
#> features this contains lots of stopwords no if and or but
#> this 0 1 1 1 1 0 0 0 0 0
#> contains 0 0 1 1 1 0 0 0 0 0
#> lots 0 0 0 1 1 1 1 1 1 1
#> of 0 0 0 0 1 0 0 0 0 0
#> stopwords 0 0 0 0 0 0 0 0 0 0
#> no 0 0 0 0 0 0 1 1 1 1
#> if 0 0 0 0 0 0 0 1 1 1
#> and 0 0 0 0 0 0 0 0 1 1
#> or 0 0 0 0 0 0 0 0 0 1
#> but 0 0 0 0 0 0 0 0 0 0
#> [ reached max_feat ... 2 more features, reached max_nfeat ... 2 more features ]
fcm_remove(fcmat, stopwords("english"))
#> Feature co-occurrence matrix of: 3 by 3 features.
#> features
#> features contains lots stopwords
#> contains 0 1 1
#> lots 0 0 1
#> stopwords 0 0 0