欢快的南瓜 · HTTP请求示例--AI中台公用文档-火山引擎· 1 周前 · |
眉毛粗的登山鞋 · 亚洲史概说-书籍资料库-· 2 月前 · |
大鼻子的手术刀 · 依科部掛號 - 天主教輔仁大學附設醫院· 3 月前 · |
坏坏的海龟 · 干货丨傣族舞“三道弯”体态详解,零基础必读! ...· 1 年前 · |
filter()
arrange()
select()
mutate()
if_else()
rename()
drop_na()
sample_n()
與
sample_frac()
distinct()
與
n_distinct()
slice()
summarise()
group_by()
summarise_all()
str_trim()
與
str_pad()
str_detect()
str_detect()
str_subset()
與
str_which()
str_extract()
str_match()
str_replace()
str_split()
duration
創建以秒計算的時間長度的物件
period
創建以人類文明的時間單位計算時間長度
interval
計算二個 date-time 物件之間的時間長度.
paste()
,
substr()
,
substring()
,
grep()
,
gsub()
,
strsplit()
等.
{R} 套件
stringr
有更多處理文字或字串資料函式.
輸入文字遠比數字複雜, 必須考慮大小寫, 空格或
Tab
,
單引號或雙引號, 特殊符號與字元等等.
{R} 輸入特殊符號的顯示與實際想要輸入的特殊符號有些差別.
st1 <- "This is a book"
## [1] "This is a book"
st2 <- 'To include a double "quote" inside a string, use single quotes'
## [1] "To include a double \"quote\" inside a string, use single quotes"
st3 <- "To include a single 'quote' inside a string, use double quotes"
## [1] "To include a single 'quote' inside a string, use double quotes"
double_quote <- "\"" # or '"'
double_quote
## [1] "\""
single_quote <- '\'' # or "'"
single_quote
## [1] "'"
類似情形, 若要輸入反斜線
\
, 則須輸入連續 2 個反斜線:
\\
.
{R} 輸入特殊符號反斜線
\
的顯示
"\\" 與實際想要輸入的單一個反斜線有些差別
`.
若要呈現實際想要輸入的特殊符號,
可使用函式
writeLines()
.
x.char <- c("\"", "\\") x.char ## [1] "\"" "\\" writeLines(x.char)
利用指令
?'"'
或?"'"
可以得到特殊符號的輸入方式.\n
newline
\r
carriage return
\t
tab
\b
backspace
\a
alert (bell)
\f
form feed
\v
vertical tab
\\
backslash\
\'
ASCII apostrophe'
\"
ASCII quotation mark"
\
` ASCII grave accent (backtick) `\nnn
character with given octal code (1, 2 or 3 digits)
\xnn
character with given hex code (1 or 2 hex digits)
\unnnn
Unicode character with given code (1–4 hex digits)
\Unnnnnnnn
Unicode character with given code (1–8 hex digits)10.2 套件 stringr
{R} base 有許多函式處理文字或字串, 但函式的引數並不具有一致性, 容易混淆, 因此
tidyverse
系列的套件stringr
內的函式都以str_
為起始, 例如,str_length()
回傳文字向量內的文字字數.library(stringr) str_length(c("a", "Biostatistics", "Medical Statistics", "\'\b\t", NA)) ## [1] 1 13 18 3 NA
10.3 合併文字或字串 str_c()
函式
str_c()
可以合併文字或字串, 類似 {R} base 函式paste()
. 使用引數sep
設定合併的中間字元.str_c("medical", "statistics") ## [1] "medicalstatistics" str_c("medical", "statistics", sep = " ") ## [1] "medical statistics" str_c("medical", "statistics", sep = "-") ## [1] "medical-statistics" str_c("medical", "statistics", sep = " + ") ## [1] "medical + statistics" str_c("|-", "medical", "statistics", "-|") ## [1] "|-medicalstatistics-|"
若是遇到缺失值 NA, 則仍回傳 NA, 若要改變遇到缺失值 NA, 回傳列印
NA
, 可以使用加用函式str_replace_na()
.x.char <- c("bio", NA, "statistics") str_c("pre-", x.char, "-end") ## [1] "pre-bio-end" NA "pre-statistics-end" str_c("pre-", str_replace_na(x.char), "-end") ## [1] "pre-bio-end" "pre-NA-end" "pre-statistics-end"
若要合併 2 個字串向量為單一字串, 可以使用引數
collpse
.char.vec <- c("I", "love", "biostatistics") str_c(char.vec, collapse = ", ") ## [1] "I, love, biostatistics" str_c(char.vec, collapse = "+") ## [1] "I+love+biostatistics" str_c(char.vec, collapse = " ") ## [1] "I love biostatistics" str_c(char.vec, sep = " ") ## [1] "I" "love" "biostatistics" str_c("I", "love", "biostatistics", sep = " ") ## [1] "I love biostatistics"
10.4 取出文字或字串向量中的部分元素 str_sub()
函式
str_sub()
可以取出取出文字或字串向量中元素的部分文字.引數
start
與end
分別為向量中元素內文字之起始位置與結束位置. 即使向量長度不足仍會回傳.char.vec <- c("I", "love", "medical", "statistics") str_sub(char.vec, start = 1, end = 3) ## [1] "I" "lov" "med" "sta"
10.5 語言設定地區與文字大小寫排序
不同地區的文字, 可能有類似文字, 在大小寫轉換實常會出現轉換錯誤, 若要確保大小寫轉換或排序正確, 套件
stringr
內的函式可改設定 {R} 程式使用文字的地區. 例如, 大小寫轉換函式str_to_lower(),
str_to_upper()或
str_to_title()` 的使用.char.vec <- c("I", "Love", "Medical", "Statistics") str_to_upper(char.vec) ## [1] "I" "LOVE" "MEDICAL" "STATISTICS" str_to_lower(char.vec) ## [1] "i" "love" "medical" "statistics" str_to_title(str_to_upper(char.vec)) ## [1] "I" "Love" "Medical" "Statistics" str_to_title(str_to_lower(char.vec)) ## [1] "I" "Love" "Medical" "Statistics"
{R} base 函式
sort()
與order()
定 {R} 程式登入使用文字的地區套件
stringr
內的函式str_sort()
與str_order(), 可以使用引數
locale` 設定使用文字的地區.veg.vec <- c("apple", "eggplant", "banana") sort(veg.vec) ## [1] "apple" "banana" "eggplant" order(veg.vec) ## [1] 1 3 2 str_sort(veg.vec, locale = "en") # English ## [1] "apple" "banana" "eggplant" str_sort(veg.vec, locale = "haw") # Hawaiian ## [1] "apple" "eggplant" "banana"
10.6 移除空白, 加入空白, 截斷文字
str_trim()
與str_pad()
套件
stringr
內的函式str_trim()
與str_pad()
可以對文字或字串向量內的首尾之空白 (white space) 移除, 或是加入.str_trim(string, side = c("both", "left", "right")) str_pad(string, width, side = c("left", "right", "both"), pad = " ") str_trunc(string, width, side = c("right", "left", "center"), ellipsis = "...")
引數
both
,left
,right
分別處理在首尾二端, 左端, 右端之空白.width
為加入空白後字串的長度,pad
為替代加入空白的文字或符號.veg.vec <- c("apple ", " eggplant ", " banana") str_trim(veg.vec, side = c("both")) ## [1] "apple" "eggplant" "banana" str_trim(veg.vec, side = c("left")) ## [1] "apple " "eggplant " "banana" str_trim(veg.vec, side = c("right")) ## [1] "apple" " eggplant" " banana" veg.vec <- c("apple ", " eggplant ", " banana") str_pad("a", width = 15, side = c("both"), pad = " ") ## [1] " a " str_pad("a", width = 15, side = c("both"), pad = c("_")) ## [1] "_______a_______" str_pad(veg.vec, width = 15, side = c("both")) ## [1] " apple " " eggplant " " banana " str_pad(veg.vec, width = 15, side = c("left")) ## [1] " apple " " eggplant " " banana" str_pad(veg.vec, width = 15, side = c("right")) ## [1] "apple " " eggplant " " banana " str_pad(veg.vec, width = 15, side = c("both"), pad = c("_")) ## [1] "____apple _____" "__ eggplant ___" "____ banana____" char.vec <- c("I love biostatistics") str_trunc(char.vec, width = 10, side = c("center")) ## [1] "I lo...ics" str_trunc(char.vec, width = 10, side = c("left")) ## [1] "...tistics" str_trunc(char.vec, width = 10, side = c("right")) ## [1] "I love ..."
10.7 尋找特定形式文字或字串
文字或字串處理中一項重要的工作是尋找特定形式文字或字串 (pattern), 然後進行 detect (偵測), locate (確認位置), extract (取出), match (配對), replace (替代置換) 與 split (分割).
10.7.1 偵測函式
str_detect()
套件
stringr
內的函式str_detect()
偵測字串向量是否包含特定形式文字, 回傳邏輯向量. 這與 {R} base 函式grep(pattern, x)
類似. 函式str_count()
計算字串內配對成功的次數.引數
pattern
定義所要尋找特定形式的文字, 若negate = TRUE
同時回傳沒有配對成功的邏輯向量.char.vec <- c("statistics", "biostatistics", "probability", "distribution") str_detect(char.vec, pattern = "statistics", negate = FALSE) ## [1] TRUE TRUE FALSE FALSE str_detect(char.vec, pattern = "statistics", negate = TRUE) ## [1] FALSE FALSE TRUE TRUE str_detect(char.vec, pattern = "ti", negate = FALSE) ## [1] TRUE TRUE FALSE TRUE str_detect(char.vec, pattern = "function", negate = FALSE) ## [1] FALSE FALSE FALSE FALSE str_count(char.vec, pattern = "ti") ## [1] 2 2 0 1 str_count(char.vec, pattern = "b") ## [1] 0 1 2 1
10.7.2 確認位置函式
str_detect()
函式
str_locate()
尋找配對成功的字串之第 1 次位置, 回傳矩陣, 包含起始以末端的位置. 這與 {R} base 函式regexpr()
與gregexpr()
類似.另外函式
str_locate_all()
尋找配對成功的字串之所有位置, 回傳列表.char.vec <- c("statistics", "biostatistics", "probability", "distribution") str_locate(char.vec, pattern = "ti") ## start end ## [1,] 4 5 ## [2,] 7 8 ## [3,] NA NA ## [4,] 9 10 str_locate_all(char.vec, pattern = "ti") ## [[1]] ## start end ## [1,] 4 5 ## [2,] 7 8 ## [[2]] ## start end ## [1,] 7 8 ## [2,] 10 11 ## [[3]] ## start end ## [[4]] ## start end ## [1,] 9 10
10.7.3 確認索引函式
str_subset()
與str_which()
函式
str_subset()
尋找字串向量內配對成功的之第 1 次的元素內容, 而函式str_which()
尋找字串向量內配對成功的之第 1 次索引 (index).若引數
negate = TRUE
回傳沒有配對成功的元素內容或索引. 函式str_subset()
與函式x[str_detect(x, pattern)]
類似功能, 等同於 R base 函式grep(pattern, x, value = TRUE)
. 而函式str_which()
與函式which(str_detect(x, pattern))
類似功能, 等同於 R base 函式grep(pattern, x)
, 如同函式str_detect()
同於 R base 函式grepl(pattern, x)
.char.vec <- c("statistics", "biostatistics", "probability", "distribution") str_subset(char.vec, pattern = "ti") ## [1] "statistics" "biostatistics" "distribution" str_which(char.vec, pattern = "ti") ## [1] 1 2 4
10.7.4 取出函式
str_extract()
函式
str_extract()
尋找配對成功的字串之第 1 次位置, 回傳字串向量.另外函式
str_extract_all()
尋找配對成功的字串之所有位置, 回傳所有字串向量形成列表. 引數simplify = TRUE
簡化成文字矩陣.char.vec <- c("statistics", "biostatistics", "probability", "distribution") str_extract(char.vec, pattern = "ti") ## [1] "ti" "ti" NA "ti" str_extract_all(char.vec, pattern = "ti") ## [[1]] ## [1] "ti" "ti" ## [[2]] ## [1] "ti" "ti" ## [[3]] ## character(0) ## [[4]] ## [1] "ti"
10.7.5 配對函式
str_match()
函式
str_match()
使用在群組尋找特定形式文字或字串, 若尋到找配對成功的字串之第 1 次位置, 回傳文字矩陣,第一欄位為完全配對成功的文字, 其餘欄位為群組內個別配對成功的文字.另外函式
str_match_all()
尋找配對成功的字串之所有位置.char.vec <- c("statistics", "biostatistics", "probability", "distribution") str_match(char.vec, pattern = "(a|ti)") ## [,1] [,2] ## [1,] "a" "a" ## [2,] "a" "a" ## [3,] "a" "a" ## [4,] "ti" "ti" str_match_all(char.vec, pattern = "(a|ti)") ## [[1]] ## [,1] [,2] ## [1,] "a" "a" ## [2,] "ti" "ti" ## [3,] "ti" "ti" ## [[2]] ## [,1] [,2] ## [1,] "a" "a" ## [2,] "ti" "ti" ## [3,] "ti" "ti" ## [[3]] ## [,1] [,2] ## [1,] "a" "a" ## [[4]] ## [,1] [,2] ## [1,] "ti" "ti"
10.7.6 替代置換函式
str_replace()
函式
str_match()
使用在群組尋找特定形式文字或字串, 若尋找到配對成功的字串之第 1 次位置, 則使用其他特定字串替代置換.引數
replacement
設定新的替代字串置換原有尋找特定形式文字或字串. 另外函式str_replace_all()
尋找配對成功的字串之所有位置, 同時使用其他特定字串替代置換.char.vec <- c("statistics", "biostatistics", "probability", "distribution") str_replace(char.vec, pattern = "ti", replacement = "--") ## [1] "sta--stics" "biosta--stics" "probability" "distribu--on" str_replace_all(char.vec, pattern = "b", replacement = "+++") ## [1] "statistics" "+++iostatistics" "pro+++a+++ility" "distri+++ution"
10.7.7 分割函式
str_split()
函式
str_split()
使用在群組尋找特定形式文字或字串, 若尋找到配對成功的字串之第 1 次位置, 則從特定形式文字或字串分割字串向量, 回傳分割結果為列表物件.str_split(string, pattern, n = Inf, simplify = FALSE) str_split_fixed(string, pattern, n) str_split_n(string, pattern, n)
其中引數
n
設定回傳物件的數目,simplify = TRUE
回傳物件簡化成文字矩陣. 另外函式str_split_fixed()
回傳物件簡化成文字矩陣且欄位 (column) 數目為n
.str_split_n()
回傳物件簡化成文字向量, 長度為n
.char.vec <- c("a b c", "d e", "bio-statistics required-courses") str_split(char.vec, pattern = " ", n = Inf, simplify = FALSE) ## [[1]] ## [1] "a" "b" "c" ## [[2]] ## [1] "d" "e" ## [[3]] ## [1] "bio-statistics" "required-courses" str_split(char.vec, pattern = " ", n = Inf, simplify = TRUE) ## [,1] [,2] [,3] ## [1,] "a" "b" "c" ## [2,] "d" "e" "" ## [3,] "bio-statistics" "required-courses" "" str_split_fixed(char.vec, pattern = " ", n = 2) ## [,1] [,2] ## [1,] "a" "b c" ## [2,] "d" "e" ## [3,] "bio-statistics" "required-courses" str_split_fixed(char.vec, pattern = "-", n = 2) ## [,1] [,2] ## [1,] "a b c" "" ## [2,] "d e" "" ## [3,] "bio" "statistics required-courses"
10.8 群組尋找特定形式的文字與字串
有些時候在尋找特定形式的文字與字串, 須要尋找不只一種特定的形式, 此時須藉由
alternate
,anchor
與look around
概念處理. 例如, 同時尋找b
或ti
, 可以輸入b|ti
.char.vec <- c("statistics", "biostatistics", "probability", "distribution") str_replace(char.vec, pattern = "b|ti", replacement = "--") ## [1] "sta--stics" "--iostatistics" "pro--ability" "distri--ution" str_replace_all(char.vec, pattern = "b|ti", replacement = "+++") ## [1] "sta+++s+++cs" "+++iosta+++s+++cs" "pro+++a+++ility" "distri+++u+++on"
anchor
起始符號^
可以尋找字串的起始具有特定形式, 尾端符號$
可以尋找字串的尾端具有特定形式. 例如,^b
, 尋找字串的起始具有b
, 或n$
, 尋找字串的尾端具n
.char.vec <- c("statistics", "biostatistics", "probability", "distribution") str_replace(char.vec, pattern = "^b", replacement = "--") ## [1] "statistics" "--iostatistics" "probability" "distribution" str_replace_all(char.vec, pattern = "n$", replacement = "+++") ## [1] "statistics" "biostatistics" "probability" "distributio+++"
有些時候需要尋找字串前後具有特定形式的文字與字串, 例如, 尋找在
ti
之前的字元, 在p
之後的字元等等. 使用小括號()
代表特定形式的前後順序. 輸入a(?=c)
表示在a
之後有c
字元, 輸入a(?!c)
表示在a
之後無c
字元, 輸入(?<=b)a
表示在a
之前有b
字元, 輸入(?<!b)a
表示在a
之前無b
字元.char.vec <- c("statistics", "biostatistics", "probability", "distribution") str_replace(char.vec, pattern = "t(?=i)", replacement = "--") ## [1] "sta--istics" "biosta--istics" "probability" "distribu--ion" str_replace(char.vec, pattern = "t(?!i)", replacement = "--") ## [1] "s--atistics" "bios--atistics" "probabili--y" "dis--ribution" str_replace(char.vec, pattern = "(?<=i)o", replacement = "--") ## [1] "statistics" "bi--statistics" "probability" "distributi--n" str_replace_all(char.vec, pattern = "(?<=t)i", replacement = "--") ## [1] "stat--st--cs" "biostat--st--cs" "probability" "distribut--on" str_replace(char.vec, pattern = "(?<!t)i", replacement = "--") ## [1] "statistics" "b--ostatistics" "probab--lity" "d--stribution" str_replace_all(char.vec, pattern = "(?<!t)i", replacement = "--") ## [1] "statistics" "b--ostatistics" "probab--l--ty" "d--str--bution"
10.9 尋找連續重覆特定形式的文字與字串
一個字串可能不只一個特定形式的文字與字串連續重覆出現, 套件
stringr 輸入stringr
尋找特定形式的文字與字串, 可以合併考量連續重覆出現次數. 其中{}
內不可有空格.stringr
在群組()
之後加上\\1
,\\2
, … 等, 可以設定尋找連續重覆出現次數.x.vec <- c(".a.aa.aaa.aaaa") str_replace(x.vec, pattern = "a?", replacement = "-") ## [1] "-.a.aa.aaa.aaaa" str_replace(x.vec, pattern = "a*", replacement = "-") ## [1] "-.a.aa.aaa.aaaa" str_replace(x.vec, pattern = "a+", replacement = "-") ## [1] ".-.aa.aaa.aaaa" str_replace(x.vec, pattern = "a{2}", replacement = "-") ## [1] ".a.-.aaa.aaaa" str_replace(x.vec, pattern = "a{2,}", replacement = "-") ## [1] ".a.-.aaa.aaaa" str_replace(x.vec, pattern = "a{2,3}", replacement = "-") ## [1] ".a.-.aaa.aaaa" char.vec <- c("statistics", "biostatistics", "probability", "distribution") str_replace(char.vec, pattern = "i?", replacement = "-") ## [1] "-statistics" "-biostatistics" "-probability" "-distribution" str_replace(char.vec, pattern = "i*", replacement = "-") ## [1] "-statistics" "-biostatistics" "-probability" "-distribution" str_replace(char.vec, pattern = "i+", replacement = "-") ## [1] "stat-stics" "b-ostatistics" "probab-lity" "d-stribution" str_replace(char.vec, pattern = "i{2}", replacement = "-") ## [1] "statistics" "biostatistics" "probability" "distribution" str_replace(char.vec, pattern = "i{2}", replacement = "-") ## [1] "statistics" "biostatistics" "probability" "distribution" str_replace(char.vec, pattern = "i{2,3}", replacement = "-") ## [1] "statistics" "biostatistics" "probability" "distribution"
10.10 正規表示文字與字串 (萬用字元)
{R} 尋找特定形式的文字與字串, 可以使用程式語言通用的正規表示 (regular expression), 在使用套件
stringr 輸入 真實的文字與字串stringr
輸入時有些差異, 以下表摘要說明.Table 3: 正規表示文字與字串 (萬用字元)
char.vec <- c("statistics.123", "biostatistics.a.b.c", "probability.a.c", "distribution.a c") str_replace(char.vec, pattern = ".i.", replacement = "-") ## [1] "sta-tics.123" "-statistics.a.b.c" "proba-ity.a.c" "-tribution.a c" str_replace_all(char.vec, pattern = ".i.", replacement = "-") ## [1] "sta--s.123" "-sta--s.a.b.c" "proba-ity.a.c" "-t-u-n.a c" str_replace(char.vec, pattern = "y\\.a", replacement = "-") ## [1] "statistics.123" "biostatistics.a.b.c" "probabilit-.c" ## [4] "distribution.a c" str_replace(char.vec, pattern = "a[.]c", replacement = "-") ## [1] "statistics.123" "biostatistics.a.b.c" "probability.-" ## [4] "distribution.a c" str_replace(char.vec, pattern = "a[ ]", replacement = "-") ## [1] "statistics.123" "biostatistics.a.b.c" "probability.a.c" ## [4] "distribution.-c" str_replace(char.vec, pattern = "b[ab]+", replacement = "-") ## [1] "statistics.123" "biostatistics.a.b.c" "pro-ility.a.c" ## [4] "distribution.a c" y.vec <- c("set", "sat", "sit", "sout") str_replace(y.vec, pattern = "s(a|i)t", replacement = "-") ## [1] "set" "-" "-" "sout" fruits.vec <- c("banana", "coconut", "cucumber", "jujube", "papaya", "berry") str_replace(fruits.vec, pattern = "(..)\\1", replacement = "-") ## [1] "b-a" "-nut" "-mber" "-be" "-ya" "berry" str_replace(fruits.vec, pattern = "(.)(.)\\2\\1", replacement = "-") ## [1] "banana" "coconut" "cucumber" "jujube" "papaya" "berry" z.vec <- c("3 house", "4 cars", "5 dogs") str_replace_all(z.vec, c("3" = "three", "4" = "four", "5" = "five")) ## [1] "three house" "four cars" "five dogs" sent.vec <- sentences[1:5] sent.vec ## [1] "The birch canoe slid on the smooth planks." ## [2] "Glue the sheet to the dark blue background." ## [3] "It's easy to tell the depth of a well." ## [4] "These days a chicken leg is a rare dish." ## [5] "Rice is often served in round bowls." sent.vec %>% str_subset(pattern = "(a|the) ([^ ]+)") %>% str_extract(pattern = "(a|the) ([^ ]+)") ## [1] "the smooth" "the sheet" "the depth" "a chicken" sent.vec %>% str_subset(pattern = "(a|the) ([^ ]+)") %>% str_match(pattern = "(a|the) ([^ ]+)") ## [,1] [,2] [,3] ## [1,] "the smooth" "the" "smooth" ## [2,] "the sheet" "the" "sheet" ## [3,] "the depth" "the" "depth" ## [4,] "a chicken" "a" "chicken" sent.vec %>% str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") ## [1] "The canoe birch slid on the smooth planks." ## [2] "Glue sheet the to the dark blue background." ## [3] "It's to easy tell the depth of a well." ## [4] "These a days chicken leg is a rare dish." ## [5] "Rice often is served in round bowls."
欢快的南瓜 · HTTP请求示例--AI中台公用文档-火山引擎 1 周前 |
眉毛粗的登山鞋 · 亚洲史概说-书籍资料库- 2 月前 |
大鼻子的手术刀 · 依科部掛號 - 天主教輔仁大學附設醫院 3 月前 |