textstat_keyness()
now returns a data.frame with p-values as well as the test statistic, and rownames containing the feature. This is more consistent with the other textstat functions.tokens_lookup()
implements new rules for nested and linked sequences in dictionary values. See #502.tokens_compound()
has a new join
argument for better handling of nested and linked sequences. See #517.tokens
are now significantly faster due to a reimplementation of the hash table functions in C++. (#510)dfm()
now works with multi-word dictionaries and thesauruses, which previously worked only with tokens_lookup()
.fcm()
is now parallelized for improved performance on multi-core systems.convert(x, to = "lsa")
that transposed row and column names (#526)fcm()
method for corpus objects (#538)dfm
and tokens
to break on > 10,000 documents. (#438)tokens(x, what = "character", removeSeparators = TRUE)
that returned an empty string.corpus.VCorpus
if the VCorpus contains a single document. (#445)dfm_compress
in which the function failed on documents that contained zero feature counts. (#467)textmodel_NB
that caused the class priors Pc
to be refactored alphabetically instead of in the order of assignment (#471), also affecting predicted classes (#476).textstat_keyness()
discovers words that occur at differential rates between partitions of a dfm (using chi-squared, Fisher’s exact test, and the G^2 likelihood ratio test to measure the strength of associations).data_corpus_inaugual
and data_char_inaugural
).groups
argument in texts()
(and in dfm()
that uses this function), which will now coerce to a factor rather than requiring one.sequences()
: ordered
and max_length
, the latter to prevent memory leaks from extremely long sequences.dictionary()
now accepts YAML as an input file format.dfm_lookup
and tokens_lookup
now accept a levels
argument to determine which level of a hierarchical dictionary should be applied.min_nchar
and max_nchar
arguments to dfm_select
.dictionary()
can now be called on the argument of a list()
without explicitly wrapping it in list()
.fcm
now works directly on a dfm object when context = "documents"
.This release has some major changes to the API, described below.
new name | original name | notes |
---|---|---|
data_char_sampletext |
exampleString |
|
data_char_mobydick |
mobydickText |
|
data_dfm_LBGexample |
LBGexample |
|
data_char_sampletext |
exampleString |
The following objects have been renamed, but will not affect user-level functionality because they are primarily internal. Their man pages have been moved to a common ?data-internal
man page, hidden from the index, but linked from some of the functions that use them.
new name | original name | notes |
---|---|---|
data_int_syllables |
englishSyllables |
(used by textcount_syllables() ) |
data_char_wordlists |
wordlists |
(used by readability() ) |
data_char_stopwords |
.stopwords |
(used by stopwords()
|
The following functions will still work, but issue a deprecation warning:
new function | deprecated function | constructs: |
---|---|---|
tokens |
tokenize() |
tokens class object |
corpus_subset |
subset.corpus |
corpus class object |
corpus_reshape |
changeunits |
corpus class object |
corpus_sample |
sample |
corpus class object |
corpus_segment |
segment |
corpus class object |
dfm_compress |
compress |
dfm class object |
dfm_lookup |
applyDictionary |
dfm class object |
dfm_remove |
removeFeatures.dfm |
dfm class object |
dfm_sample |
sample.dfm |
dfm class object |
dfm_select |
selectFeatures.dfm |
dfm class object |
dfm_smooth |
smoother |
dfm class object |
dfm_sort |
sort.dfm |
dfm class object |
dfm_trim |
trim.dfm |
dfm class object |
dfm_weight |
weight |
dfm class object |
textplot_wordcloud |
plot.dfm |
(plot) |
textplot_xray |
plot.kwic |
(plot) |
textstat_readability |
readability |
data.frame |
textstat_lexdiv |
lexdiv |
data.frame |
textstat_simil |
similarity |
dist |
textstat_dist |
similarity |
dist |
featnames |
features |
character |
nsyllable |
syllables |
(named) integer
|
nscrabble |
scrabble |
(named) integer
|
tokens_ngrams |
ngrams |
tokens class object |
tokens_skipgrams |
skipgrams |
tokens class object |
tokens_toupper |
toUpper.tokens , toUpper.tokenizedTexts
|
tokens , tokenizedTexts
|
tokens_tolower |
toLower.tokens , toLower.tokenizedTexts
|
tokens , tokenizedTexts
|
char_toupper |
toUpper.character , toUpper.character
|
character |
char_tolower |
toLower.character , toLower.character
|
character |
tokens_compound |
joinTokens , phrasetotoken
|
tokens class object |
The following are new to v0.9.9 (and not associated with deprecated functions):
new function | description | output class |
---|---|---|
fcm() |
constructor for a feature co-occurrence matrix | fcm |
fcm_select |
selects features from an fcm
|
fcm |
fcm_remove |
removes features from an fcm
|
fcm |
fcm_sort |
sorts an fcm in alphabetical order of its features |
fcm |
fcm_compress |
compacts an fcm
|
fcm |
fcm_tolower |
lowercases the features of an fcm and compacts |
fcm |
fcm_toupper |
uppercases the features of an fcm and compacts |
fcm |
dfm_tolower |
lowercases the features of a dfm and compacts |
dfm |
dfm_toupper |
uppercases the features of a dfm and compacts |
dfm |
sequences |
experimental collocation detection | sequences |
new name | reason |
---|---|
encodedTextFiles.zip |
moved to the readtext package |
describeTexts |
deprecated several versions ago for summary.character
|
textfile |
moved to package readtext |
encodedTexts |
moved to package readtext, as data_char_encodedtexts
|
findSequences |
replaced by sequences
|
to = "lsa"
functionality added to convert()
(#414)valuetype
matches work for many functions.View
methods for kwic
objects, based on Javascript Datatables.kwic
is completely rewritten, now uses fast hashed index matching in C++ and fully implements vectorized matches (#306) and all valuetype
s (#307).tokens_lookup
, tokens_select
, and tokens_remove
are faster and use parallelization (based on the TBB library).textstat_dist
and textstat_simil
add fast, sparse, and parallel computation of many new distance and similarity matrices.textmodel_wordshoal
fitting function.max_docfreq
and min_docfreq
arguments, and better verbose output, to dfm_trim
(#383).tokens()
, for more memory-efficient token hashing when dealing with very large numbers of documents.corpus()
through the metacorpus
list argument.[
, [[
, and $
for (hashed) tokens
objects.collocations()
and kwic()
.tokens_select()
(formerly selectFeatures.tokens()
).ngrams()
and joinTokens()
performance for hashed tokens
class objects.dfm.character()
by using new tokens()
constructor to create hashed tokenized texts by default when creating a dfm, resulting in performance gains when constructing a dfm. Creating a dfm from a hashed tokens
object is now 4-5 times faster than the older tokenizedTexts
object.tokens
class object.textmodel_wordscores objects
.tokens_lookup()
method (formerly applyDictionary()
), that also works with dictionaries that have multi-word keys. Addresses but does not entirely yet solve #188.sparsity()
function to compute the sparsity of a dfm.fcm
).corpus_reshape()
can now go from sentences and paragraph units back to documents.by =
argument to corpus_sample()
, for use in bootstrap resampling of sub-document units.bootstrap_dfm()
to generate a list of dimensionally-equivalent dfm objects based on sentence-level resampling of the original documents.tokens()
and dfm()
for passing docvars through to to tokens and dfm objects, and added docvars()
and metadoc()
methods for tokens and dfm class objects. Overall, the code for docvars and metadoc is now more robust and consistent.docvars()
on eligible objects that contain no docvars now returns an empty 0 x 0 data.frame (in the spirit of #242).textmodel_scale1d
now produces sorted and grouped document positions for fitted wordfish models, and produces a ggplot2 plot object.textmodel_wordfish()
now preserves sparsity while processing the dfm, and uses a fast approximation to an SVD to get starting values. This also dramatically improves performance in computing this model. (#482, #124)kwic()
is now dramatically improved, and also returns an indexed set of tokens that makes subsequent commands on a kwic class object much faster. (#603)quanteda_options()
.corpus_segment()
. (#634)corpus_trimsentences()
and char_trimsentences()
to remove sentences from a corpus or character object, based on token length or pattern matching.textstat_readability()
: min_sentence_length
and max_sentence_length
. (#632)[
), or accessing values directly ([[
). (#651)textstat_collocations()
, which combines the existing collocations()
and sequences()
functions. (#434) Collocations now behave as sequences for other functions (such as tokens_compound()
) and have a greatly improved performance for such uses.docvars()
now permits direct access to “metadoc” fields (starting with _
, e.g. _document
)metadoc()
now returns a vector instead of a data.frame for a single variable, similar to docvars()
verbose
options now take the default from getOption("verbose")
rather than fixing the value in the function signatures. (#577)textstat_dist()
and textstat_simil()
now return a matrix if a selection
argument is supplied, and coercion to a list produces a list of distances or similarities only for that selection.tokens()
, the old arguments (e.g. removePunct
) still produce the same behaviour but with a deprecation warning.n_target
and n_reference
columns to textstat_keyness()
to return counts for each category being compared for keyness.str()
on a corpus with no docvars (#571).removeURL
in tokens()
now removes URLs where the first part of the URL is a single letter (#587).dfm_select
now works correctly for ngram features (#589).dfm_select(x, features)
when features
was a dfm, that failed to produce the intended featnames matches for the output dfm.corpus_segment(x, what = "tags")
when a document contained a whitespace just before a tag, at the beginning of the file, or ended with a tag followed by no text (#618, #634).corpus()
now works for a tm::SimpleCorpus
object. (#680)corpus_trim()
and char_trim()
functions for selecting documents or subsets of documents based on sentence, paragraph, or document lengths.$meta
of the return object.dfm_group(x, groups = )
command, a convenience wrapper around dfm.dfm(x, groups = )
(#725).doc_id
and text
fields, which also provides interoperability with the readtext package. corpus construction methods are now more explicitly tailored to input object classes.dfm_lookup()
behaves more robustly on different platforms, especially for keys whose values match no features (#704).textstat_simil()
and textstat_dist()
no longer take the n
argument, as this was not sorting features in correct order.tokens(x, what = "character")
when x
included Twitter characters @
and #
(#637).ntype.dfm()
produced an incorrect result.textstat_readability()
and textstat_lexdiv()
for single-document returns when drop = TRUE
.corpus_reshape()
.print
, and head
, and tail
methods for dfm
are more robust (#684).convert(x, to = "stm")
caused by zero-count documents and zero-count features in a dfm (#699, #700, #701). This also removes docvar rows from $meta
when this is passed through the dfm, for zero-count documents.dictionary()
. (#722)dfm_compress
now preserves a dfm’s docvars if collapsing only on the features margin, which means that dfm_tolower()
and dfm_toupper()
no longer remove the docvars.fcm_compress()
now retains the fcm class, and generates and error when an asymmetric compression is attempted (#728).textstat_collocations()
now returns the collocations as character, not as a factor (#736)dfm_lookup(x, exclusive = FALSE)
wherein an empty dfm ws returned with there was no no match (#116).dfm()
to tokens()
is now robust, and preserves variables defined in the calling environment (#721).str()
, names()
, or other indexing operations, which started happening on Linux and Windows platforms following the CRAN move to 3.4.0. (#744)dfm_weight()
now print friendlier error messages when the weight vector contains features not found in the dfm. See this Stack Overflow question for the use case that sparked this improvement.textstat_collocations()
, which computes only the lambda
method for now, but does so accurately and efficiently. (#753, #803). This function is still under development and likely to change further.quanteda_options
that affect the maximum documents and features displayed by the dfm print method (#756).ngram
formation is now significantly faster, including with skips (skipgrams).topfeatures()
:
phrase()
converts whitespace-separated multi-word patterns into a list of patterns. This affects the feature/pattern matching in tokens/dfm_select/remove
, tokens_compound
, tokens/dfm_lookup
, and kwic
. phrase()
and the associated changes also make the behaviour of using character vectors, lists of characters, dictionaries, and collocation objects for pattern matches far more consistent. (See #820, #787, #740, #837, #836, #838)corpus.Corpus()
for creating a corpus from a tm Corpus now works with more complex objects that include document-level variables, such as data from the manifestoR package (#849).textplot_keyness()
plots term “keyness”, the association of words with contrasting classes as measured by textstat_keyness()
.tokens()
that improve the consistency and efficiency of the tokenization.quanteda_options()
: language_stemmer
and language_stopwords
, now used for default in *_wordstem
functions and stopwords()
for defaults, respectively. Also uses this option in dfm()
when stem = TRUE
, rather than hard-wiring in the “english” stemmer (#386).textstat_frequency()
to compile feature frequencies, possibly by groups. (#825)nomatch
option to tokens_lookup()
and dfm_lookup()
, to provide tokens or feature counts for categories not matched to any dictionary key. (#496)sequences()
and collocations()
have been removed and replaced by textstat_collocations()
.dfm
objects with one or both dimensions having zero length, and empty kwic
objects now display more appropriately in their print methods (per #811).*_select
, *_remove
, tokens_compound
, features
has been replaced by pattern
, and in kwic
, keywords
has been replaced by pattern
. These all behave consistently with respect to pattern
, which now has a unified single help page and parameter description.(#839) See also above new features related to phrase()
.tokens_*
functions using hashed tokens, making some of them 10x faster (#853).dfm_group()
function now allow “empty” documents to be created using the fill = TRUE
option, for making documents conform to a selection (similar to how dfm_select()
works for features, when supplied a dfm as the pattern argument). The groups
argument now behaves consistently across the functions where it is used. (#854)dictionary()
now requires its main argument to be a list, not a series of elements that can be used to build a list.tokens()
have improved the behaviour of remove_hyphens = FALSE
, which now behaves more correctly regardless of the setting of remove_punct
(#887).cbind.dfm()
function allows cbinding vectors, matrixes, and (recyclable) scalars to dfm objects.textstat_collocations()
, we corrected the word matching, and lambda and z calculation methods, which were slightly incorrect before. We also removed the chi2, G2, and pmi statistics, because these were incorrectly calculated for size > 2.textmodel_NB(x, y, distribution = "Bernoulli")
was previously inactive even when this option was set. It has now been fully implemented and tested (#776, #780).remove_separators
argument in tokens()
. See #796.ntoken()
and ntype()
. (#795)quanteda_options()
now does not throw an error when quanteda functions are called directly without attaching the package. In addition, quanteda options can be set now in .Rprofile and will not be overwritten when the options initialization takes place when attaching the package.textstat_readability()
that wrongly computed the number of words with fewer than 3 syllables in a text; this affected the FOG.NRI
and the Linsear.Write
measures only."logave"
and "inverseprob"
.quanteda_options()
did not actually set the number of threads. In addition, we fixed a bug causing threading to be turned off on macOS (due to a check for a gcc version that is not used for compiling the macOS binaries) prevented multi-threading from being used at all on that platform.quanteda_options()
are called without the namespace or package being attached or loaded (#864).