The workflow is now more logical and more streamlined, with a new workflow vignette as well as a design vignette explaining the principles behind the workflow and the commands that encourage this workflow. The document also details the development plans and things remaining to be done on the project.
Newly rewritten command encoding() detects encoding for character, corpus, and corpusSource objects (created by textfile). When creating a corpus using corpus(), detection is automatic to UTF-8 if an encoding other than UTF-8, ASCII, or ISO-8859-1 is detected.
The tokenization, cleaning, lower-casing, and dfm construction functions now use the stringi
package, based on the ICU library. This results not only in substantial speed improvements, but also more correctly handles Unicode characters and strings.
tokenize() and clean() now using stringi, resulting in much faster performance and more consistent behaviour across platforms.
tokenize() now works on sentences
summary.corpus() and summary.character() now use the new tokenization functions for counting tokens
dfm(x, dictionary = mydict) now uses stringi and is both more reliable and many many times faster.
phrasetotoken() now using stringi.
removeFeatures() now using stringi and fixed binary matches on tokenized texts
textfile has a new option, cache = FALSE, for not writing the data to a temporary file, but rather storing the object in memory if that is preferred.
language() is removed. (See Encoding… section above for changes to encoding().)
new object encodedTexts contains some encoded character objects for testing.
ie2010Corpus now has UTF-8 encoded texts (previously was Unicode escaped for non-ASCII characters)
texts() and docvars() methods added for corpusSource objects.
new methods for tokenizedTexts
objects: dfm()
, removeFeatures()
, and syllables()
syllables()
is now much faster, using matching through stringi
and merging using data.table
.
added readability()
to compute (fast!) readability indexes on a text or corpus
tokenize() now creates ngrams of any length, with two new arguments: ngrams =
concatenator = "_"
. The new arguments to tokenize()
can be passed through from dfm()
.
0.8.2-1: Changed R version dependency to 3.2.0 so that Mac binary would build on CRAN.
0.8.2-1: sample.corpus()
now samples documents from a corpus, and sample.dfm()
samples documents or features from a dfm. trim()
method for with nsample
argument now calls sample.dfm()
.
sample.corpus()
now samples documents from a corpus, and sample.dfm()
samples documents or features from a dfm. trim()
method for with nsample
argument now calls sample.dfm()
.
tokenize improvements for what = “sentence”: more robust to specifying options, and does not split sentences after common abbreviations such as “Dr.”, “Prof.”, etc.
corpus() no longer automatically converts encodings detected as non-UTF-8, as this detection is too imprecise.
new function scrabble()
computes English Scrabble word values for any text, applying any summary numerical function.
dfm() now 2x faster, replacing previous data.table matching with direct construction of sparse matrix from match().
Code is also much simpler, based on using three new functions that are also available directly:
subset.corpus()
related to environments that sometimes caused the method to break if nested in function environments.addto
option removed from dfm()
ignoredFeatures
and removeFeatures()
applied to ngrams; change behaviour of stem = TRUE applied to ngrams (in dfm()
)ngrams.tokenizedTexts()
method, replacing current ngrams()
, bigrams()
ngrams() rewritten to accept fully vectorized arguments for n
and for window
, thus implementing “skip-grams”. Separate function skipgrams() behaves in the standard “skipgram” fashion. bigrams(), deprecated since 0.7, has been removed from the namespace.
corpus() no longer checks all documents for text encoding; rather, this is now based on a random sample of max()
wordstem.dfm() both faster and more robust when working with large objects.
toLower.NULL() now allows toLower() to work on texts with no words (returns NULL for NULL input)
textfile() now works on zip archives of *.txt files, although this may not be entirely portable.
removeFeatures.dfm(x, stopwords), selectFeatures.dfm(x, features), and dfm(x, ignoredFeatures) now work on objects created with ngrams. (Any ngram containing a stopword is removed.) Performance on these functions is already good but will be improved further soon.
selectFeatures(x, features =
head.dfm() and tail.dfm() methods added.
kwic() has new formals and new functionality, including a completely flexible set of matching for phrases, as well as control over how the texts and matching keyword(s) are tokenized.
segment(x, what = “sentence”), and changeunits(x, to = “sentences”) now uses tokenize(x, what = “sentence”). Annoying warning messages now gone.
smoother() and weight() formal “smooth” now changed to “smoothing” to avoid clashes with stats::smooth().
Updated corpus.VCorpus()
to work with recent updates to the tm package.
added print method for tokenizedTexts
fixed signature error message caused by weight(x, "relFreq")
and weight(x, "tfidf")
. Both now correctly produce objects of class dfmSparse.
fixed bug in dfm(, keptFeatures = “whatever”) that passed it through as a glob rather than a regex to selectFeatures(). Now takes a regex, as per the manual description.
fixed textfeatures() for type json, where now it can call jsonlite::fromJSON() on a file directly.
dictionary(x, format = “LIWC”) now expanded to 25 categories by default, and handles entries that are listed on multiple lines in .dic files, such as those distributed with the LIWC.