vignettes/pkgdown/examples/phrase.Rmd
phrase.Rmd
quanteda has the functionality to select, remove or compound multi-word expressions, such as phrasal verbs (“try on”, “wake up” etc.) and place names (“New York”, “South Korea” etc.).
toks <- tokens(data_corpus_inaugural)
Functions for tokens objects take a character vector, a dictionary or
collocations as pattern
. All those three can be used for
multi-word expressions, but you have to be aware their differences.
The most basic way to define multi-word expressions is separating
words by whitespaces and wrap the character vector by
phrase()
.
multiword <- c("United States", "New York")
kwic()
is useful to find multi-word expressions in
tokens. If you are not sure if “United” and “States” are separated,
check their positions (e.g. “434:435”).
## Keyword-in-context with 6 matches.
## [1789-Washington, 433:434] of the people of the | United States |
## [1789-Washington, 529:530] more than those of the | United States |
## [1797-Adams, 524:525] saw the Constitution of the | United States |
## [1797-Adams, 1716:1717] to the Constitution of the | United States |
## [1797-Adams, 2480:2481] support the Constitution of the | United States |
## [1805-Jefferson, 441:442] sees a taxgatherer of the | United States |
##
## a Government instituted by themselves
## . Every step by which
## in a foreign country.
## , and a conscientious determination
## , I entertain no doubt
## ? These contributions enable us
Similarly, you can select or remove multi-word expression using
tokens_select()
.
head(tokens_select(toks, pattern = phrase(multiword)))
## Tokens consisting of 6 documents and 4 docvars.
## 1789-Washington :
## [1] "United" "States" "United" "States"
##
## 1793-Washington :
## character(0)
##
## 1797-Adams :
## [1] "United" "States" "United" "States" "United" "States"
##
## 1801-Jefferson :
## character(0)
##
## 1805-Jefferson :
## [1] "United" "States"
##
## 1809-Madison :
## [1] "United" "States" "United" "States"
tokens_compound()
joins elements of multi-word
expressions by underscore, so they become “United_States” and
“New_York”.
comp_toks <- tokens_compound(toks, pattern = phrase(multiword))
head(tokens_select(comp_toks, pattern = c("United_States", "New_York")))
## Tokens consisting of 6 documents and 4 docvars.
## 1789-Washington :
## [1] "United_States" "United_States"
##
## 1793-Washington :
## character(0)
##
## 1797-Adams :
## [1] "United_States" "United_States" "United_States"
##
## 1801-Jefferson :
## character(0)
##
## 1805-Jefferson :
## [1] "United_States"
##
## 1809-Madison :
## [1] "United_States" "United_States"
Elements of multi-word expressions should be separately by
whitespaces in a dictionary, but you do not use phrase()
here.
dict_multiword <- dictionary(list(country = "United States",
city = "New York"))
head(tokens_lookup(toks, dictionary = dict_multiword))
## Tokens consisting of 6 documents and 4 docvars.
## 1789-Washington :
## [1] "country" "country"
##
## 1793-Washington :
## character(0)
##
## 1797-Adams :
## [1] "country" "country" "country"
##
## 1801-Jefferson :
## character(0)
##
## 1805-Jefferson :
## [1] "country"
##
## 1809-Madison :
## [1] "country" "country"
With textstat_collocations()
, it is possible to discover
multi-word expressions through statistical scoring of the associations
of adjacent words.
If textstat_collocations()
is applied to a tokens object
comprised only of capitalize words, it usually returns multi-word proper
names.
library("quanteda.textstats")
col <- toks |>
tokens_remove(stopwords("en")) |>
tokens_select(pattern = "^[A-Z]", valuetype = "regex",
case_insensitive = FALSE, padding = TRUE) |>
textstat_collocations(min_count = 5, tolower = FALSE)
head(col)
## collocation count count_nested length lambda z
## 1 United States 158 0 2 8.669783 28.56498
## 2 Federal Government 32 0 2 5.594674 21.95312
## 3 Chief Justice 14 0 2 8.932609 18.49827
## 4 Almighty God 15 0 2 7.071014 18.16631
## 5 Constitution United 19 0 2 4.028614 15.99904
## 6 North South 8 0 2 8.170469 15.44760
Collocations are automatically recognized as multi-word expressions
by tokens_compound()
in case-sensitive fixed pattern
matching. This is the fastest way to compound large numbers of
multi-word expressions, but make sure that tolower = FALSE
in textstat_collocations()
to do this.
comp_toks2 <- tokens_compound(toks, pattern = col)
head(kwic(comp_toks2, pattern = c("United_States", "New_York")))
## Keyword-in-context with 6 matches.
## [1789-Washington, 433] of the people of the | United_States |
## [1789-Washington, 528] more than those of the | United_States |
## [1797-Adams, 524] saw the Constitution of the | United_States |
## [1797-Adams, 1715] to the Constitution of the | United_States |
## [1797-Adams, 2478] support the Constitution of the | United_States |
## [1805-Jefferson, 441] sees a taxgatherer of the | United_States |
##
## a Government instituted by themselves
## . Every step by which
## in a foreign country.
## , and a conscientious determination
## , I entertain no doubt
## ? These contributions enable us
You can use phrase()
on collocations if more flexibility
is needed. This is usually the case if you compound tokens from
different corpus.
comp_toks3 <- tokens_compound(toks, pattern = phrase(col$collocation))
head(kwic(comp_toks3, pattern = c("United_States", "New_York")))
## Keyword-in-context with 6 matches.
## [1789-Washington, 433] of the people of the | United_States |
## [1789-Washington, 528] more than those of the | United_States |
## [1797-Adams, 524] saw the Constitution of the | United_States |
## [1797-Adams, 1715] to the Constitution of the | United_States |
## [1797-Adams, 2478] support the Constitution of the | United_States |
## [1805-Jefferson, 441] sees a taxgatherer of the | United_States |
##
## a Government instituted by themselves
## . Every step by which
## in a foreign country.
## , and a conscientious determination
## , I entertain no doubt
## ? These contributions enable us