Implements a "word4"
tokeniser that is based on new RBBI (RuleBasedBreakIterator) rules, implemented in a new .yml file that can be edited and changed by users, but whose defaults represent a significant improvement in pattern handling for words, sentences, and other forms of patterns. These rules are customised from the ICU rules for breaks, with the standard and customised rules found now in the breakrules/
system folder, so that they could, in principle, be modified by the user.
Other minor changes:
"word2"
:
\\p{M}
).preserve_special()
that rejoined splits created by the default stringi tokeniser machinery.dfm_group()
now works correctly with an empty dfm (#2225).convert(x, to = "stm")
no longer vulnerable to large numbers of removed features as in #2189.Fixed a potential crash when calling tokens_compound()
with patterns containing paddings (#2254).
Updated for compatibility with (forthcoming) Matrix 1.5.5 handling of dimnames() for empty dimensions.
restores readtext
object class method extensions, to work better with the readtext package.
Removes some unused internal methods, such as docvars.kwic()
that were not exported despite matching exported generics.