Create a set of n-grams (tokens in sequence) from already tokenized text objects, with an optional skip argument to form skip-grams. Both the n-gram length and the skip lengths take vectors of arguments to form multiple lengths or skips in one pass. Implemented in C++ for efficiency.
a tokens object, or a character vector, or a list of characters
integer vector specifying the number of elements to be concatenated in each n-gram. Each element of this vector will define a \(n\) in the \(n\)-gram(s) that are produced.
integer vector specifying the adjacency skip size for tokens
forming the n-grams, default is 0 for only immediately neighbouring words.
For skipgrams
, skip
can be a vector of integers, as the
"classic" approach to forming skip-grams is to set skip = \(k\) where
\(k\) is the distance for which \(k\) or fewer skips are used to
construct the \(n\)-gram. Thus a "4-skip-n-gram" defined as skip = 0:4
produces results that include 4 skips, 3 skips, 2 skips, 1 skip, and 0
skips (where 0 skips are typical n-grams formed from adjacent words). See
Guthrie et al (2006).
character for combining words, default is _
(underscore) character
a tokens object consisting a list of character vectors of n-grams, one list element per text, or a character vector if called on a simple character vector
Normally, these functions will be called through
[tokens](x, ngrams = , ...)
, but these functions are provided
in case a user wants to perform lower-level n-gram construction on tokenized
texts.
tokens_skipgrams()
is a wrapper to tokens_ngrams()
that requires
arguments to be supplied for both n
and skip
. For \(k\)-skip
skip-grams, set skip
to 0:
\(k\), in order to conform to the
definition of skip-grams found in Guthrie et al (2006): A \(k\) skip-gram
is an n-gram which is a superset of all n-grams and each \((k-i)\)
skip-gram until \((k-i)==0\) (which includes 0 skip-grams).
char_ngrams
is a convenience wrapper for a (non-list)
vector of characters, so named to be consistent with quanteda's naming
scheme.
Guthrie, David, Ben Allison, Wei Liu, Louise Guthrie, and Yorick Wilks. 2006.
"A Closer Look at Skip-Gram Modelling." https://aclanthology.org/L06-1210/
# ngrams
tokens_ngrams(tokens(c("a b c d e", "c d e f g")), n = 2:3)
#> Tokens consisting of 2 documents.
#> text1 :
#> [1] "a_b" "b_c" "c_d" "d_e" "a_b_c" "b_c_d" "c_d_e"
#>
#> text2 :
#> [1] "c_d" "d_e" "e_f" "f_g" "c_d_e" "d_e_f" "e_f_g"
#>
toks <- tokens(c(text1 = "the quick brown fox jumped over the lazy dog"))
tokens_ngrams(toks, n = 1:3)
#> Tokens consisting of 1 document.
#> text1 :
#> [1] "the" "quick" "brown" "fox" "jumped"
#> [6] "over" "the" "lazy" "dog" "the_quick"
#> [11] "quick_brown" "brown_fox"
#> [ ... and 12 more ]
#>
tokens_ngrams(toks, n = c(2,4), concatenator = " ")
#> Tokens consisting of 1 document.
#> text1 :
#> [1] "the quick" "quick brown" "brown fox"
#> [4] "fox jumped" "jumped over" "over the"
#> [7] "the lazy" "lazy dog" "the quick brown fox"
#> [10] "quick brown fox jumped" "brown fox jumped over" "fox jumped over the"
#> [ ... and 2 more ]
#>
tokens_ngrams(toks, n = c(2,4), skip = 1, concatenator = " ")
#> Tokens consisting of 1 document.
#> text1 :
#> [1] "the brown" "quick fox" "brown jumped"
#> [4] "fox over" "jumped the" "over lazy"
#> [7] "the dog" "the brown jumped the" "quick fox over lazy"
#> [10] "brown jumped the dog"
#>
# skipgrams
toks <- tokens("insurgents killed in ongoing fighting")
tokens_skipgrams(toks, n = 2, skip = 0:1, concatenator = " ")
#> Tokens consisting of 1 document.
#> text1 :
#> [1] "insurgents killed" "insurgents in" "killed in"
#> [4] "killed ongoing" "in ongoing" "in fighting"
#> [7] "ongoing fighting"
#>
tokens_skipgrams(toks, n = 2, skip = 0:2, concatenator = " ")
#> Tokens consisting of 1 document.
#> text1 :
#> [1] "insurgents killed" "insurgents in" "insurgents ongoing"
#> [4] "killed in" "killed ongoing" "killed fighting"
#> [7] "in ongoing" "in fighting" "ongoing fighting"
#>
tokens_skipgrams(toks, n = 3, skip = 0:2, concatenator = " ")
#> Tokens consisting of 1 document.
#> text1 :
#> [1] "insurgents killed in" "insurgents killed ongoing"
#> [3] "insurgents killed fighting" "insurgents in ongoing"
#> [5] "insurgents in fighting" "insurgents ongoing fighting"
#> [7] "killed in ongoing" "killed in fighting"
#> [9] "killed ongoing fighting" "in ongoing fighting"
#>