program story

dplyr을 사용하여 테이블의 모든 행에 함수를 적용 하시겠습니까?

inputbox 2020. 7. 28. 08:30

dplyr을 사용하여 테이블의 모든 행에 함수를 적용 하시겠습니까?

작업 할 때 plyr종종 adply각 행에 적용 해야하는 스칼라 함수 에 사용 하는 것이 유용하다는 것을 알았습니다 .

예 :

data(iris)
library(plyr)
head(
     adply(iris, 1, transform , Max.Len= max(Sepal.Length,Petal.Length))
    )
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species Max.Len
1          5.1         3.5          1.4         0.2  setosa     5.1
2          4.9         3.0          1.4         0.2  setosa     4.9
3          4.7         3.2          1.3         0.2  setosa     4.7
4          4.6         3.1          1.5         0.2  setosa     4.6
5          5.0         3.6          1.4         0.2  setosa     5.0
6          5.4         3.9          1.7         0.4  setosa     5.4

이제 dplyr더 많이 사용 하고 있는데, 깔끔하고 자연스러운 방법이 있는지 궁금합니다. 이것이 내가 원하는 것이 아니기 때문에 :

library(dplyr)
head(
     mutate(iris, Max.Len= max(Sepal.Length,Petal.Length))
    )
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species Max.Len
1          5.1         3.5          1.4         0.2  setosa     7.9
2          4.9         3.0          1.4         0.2  setosa     7.9
3          4.7         3.2          1.3         0.2  setosa     7.9
4          4.6         3.1          1.5         0.2  setosa     7.9
5          5.0         3.6          1.4         0.2  setosa     7.9
6          5.4         3.9          1.7         0.4  setosa     7.9

dplyr 0.2 현재 (제 생각에) rowwise()구현 되었으므로이 문제에 대한 답은 다음과 같습니다.

iris %>% 
  rowwise() %>% 
  mutate(Max.Len= max(Sepal.Length,Petal.Length))

비 `rowwise`대체

5 년 후 (!)이 답변은 여전히 많은 트래픽을 얻습니다. rowwise많은 사람들이 그것을 직관적으로 생각하지만 그것이 주어 졌으므로 점점 권장되지 않습니다. 이 주제에 대해 잘 다룰 수 있도록 tidyverse 자료를 사용하여 R에서 Jenny Bryan의 행 지향 워크 플로를 진행하십시오 .

내가 찾은 가장 간단한 방법은 다음을 사용하는 Hadley의 예 중 하나를 기반으로합니다 pmap.

iris %>% 
  mutate(Max.Len= purrr::pmap_dbl(list(Sepal.Length, Petal.Length), max))

이 방법을 사용하면 .f내부 함수 ( )에 임의의 수의 인수를 제공 할 수 있습니다 pmap.

pmap 행 현명한 작업을 수행 할 때 실제로 벡터 목록 (데이터 프레임의 열)에서 튜플을 사용하고 있다는 사실을 반영하기 때문에 좋은 개념적 접근 방식입니다.

관용적 접근법은 적절하게 벡터화 된 함수를 만드는 것입니다.

R제공 pmax하지만 그것은 또한 제공하고, 여기에 적합한하는 Vectorize래퍼로서 mapply당신이 임의의 함수의 벡터화 임의의 버전을 만들 수 있도록.

library(dplyr)
# use base R pmax (vectorized in C)
iris %>% mutate(max.len = pmax(Sepal.Length, Petal.Length))
# use vectorize to create your own function
# for example, a horribly inefficient get first non-Na value function
# a version that is not vectorized
coalesce <- function(a,b) {r <- c(a[1],b[1]); r[!is.na(r)][1]}
# a vectorized version
Coalesce <- Vectorize(coalesce, vectorize.args = c('a','b'))
# some example data
df <- data.frame(a = c(1:5,NA,7:10), b = c(1:3,NA,NA,6,NA,10:8))
df %>% mutate(ab =Coalesce(a,b))

C / C ++에서 벡터화를 구현하는 것이 더 빠를 것이지만 magicPony함수를 작성 하는 패키지 는 없습니다 .

행별로 그룹화해야합니다.

iris %>% group_by(1:n()) %>% mutate(Max.Len= max(Sepal.Length,Petal.Length))

이것이 1에서 한 일 adply입니다.

2017-08-03 업데이트

이 글을 쓴 후 Hadley는 몇 가지 사항을 다시 변경했습니다. purrr에 있던 함수는 이제 purrrlyr 라는 새로운 혼합 패키지에 있습니다.

purrrlyr에는 purrr와 dplyr의 교차점에있는 일부 기능이 포함되어 있습니다. 패키지를 더 가볍게하기 위해 purrr에서 제거되었으며 tidyverse의 다른 솔루션으로 교체 되었기 때문입니다.

따라서 아래 코드를 작동 시키려면 해당 패키지를 설치하고로드해야합니다.

원본 게시물

Hadley는 자주 사용해야 할 것에 대한 생각을 바꾸지 만 행별 기능 을 얻기 위해 purrr 의 기능으로 전환해야한다고 생각 합니다. 최소한 plyr 과 동일한 기능을 제공하며 거의 동일한 인터페이스 adply를 제공 합니다.

두 가지 관련 기능은, by_row하고 invoke_rows. 내 이해는 by_row행을 반복하고 결과를 data.frame에 추가 할 때 사용한다는 것 입니다. invoke_rowsdata.frame의 행을 반복하고 각 col을 인수로 함수에 전달할 때 사용됩니다. 우리는 첫 번째만을 사용할 것입니다.

예

library(tidyverse)

iris %>% 
  by_row(..f = function(this_row) {
    browser()
  })

이를 통해 내부를 볼 수 있으므로 (우리가하고있는 것을 볼 수 있습니다) 이는와 함께하는 것과 같습니다 adply.

Called from: ..f(.d[[i]], ...)
Browse[1]> this_row
# A tibble: 1 × 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
         <dbl>       <dbl>        <dbl>       <dbl>  <fctr>
1          5.1         3.5          1.4         0.2  setosa
Browse[1]> Q

기본적으로 by_row출력을 기반으로 목록 열을 추가합니다.

iris %>% 
  by_row(..f = function(this_row) {
      this_row[1:4] %>% unlist %>% mean
  })

제공합니다 :

# A tibble: 150 × 6
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species      .out
          <dbl>       <dbl>        <dbl>       <dbl>  <fctr>    <list>
1           5.1         3.5          1.4         0.2  setosa <dbl [1]>
2           4.9         3.0          1.4         0.2  setosa <dbl [1]>
3           4.7         3.2          1.3         0.2  setosa <dbl [1]>
4           4.6         3.1          1.5         0.2  setosa <dbl [1]>
5           5.0         3.6          1.4         0.2  setosa <dbl [1]>
6           5.4         3.9          1.7         0.4  setosa <dbl [1]>
7           4.6         3.4          1.4         0.3  setosa <dbl [1]>
8           5.0         3.4          1.5         0.2  setosa <dbl [1]>
9           4.4         2.9          1.4         0.2  setosa <dbl [1]>
10          4.9         3.1          1.5         0.1  setosa <dbl [1]>
# ... with 140 more rows

대신에 a를 반환하면 s data.frame가 포함 된 목록을 얻습니다 data.frame.

iris %>% 
  by_row( ..f = function(this_row) {
    data.frame(
      new_col_mean = this_row[1:4] %>% unlist %>% mean,
      new_col_median = this_row[1:4] %>% unlist %>% median
    )
  })

제공합니다 :

# A tibble: 150 × 6
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species                 .out
          <dbl>       <dbl>        <dbl>       <dbl>  <fctr>               <list>
1           5.1         3.5          1.4         0.2  setosa <data.frame [1 × 2]>
2           4.9         3.0          1.4         0.2  setosa <data.frame [1 × 2]>
3           4.7         3.2          1.3         0.2  setosa <data.frame [1 × 2]>
4           4.6         3.1          1.5         0.2  setosa <data.frame [1 × 2]>
5           5.0         3.6          1.4         0.2  setosa <data.frame [1 × 2]>
6           5.4         3.9          1.7         0.4  setosa <data.frame [1 × 2]>
7           4.6         3.4          1.4         0.3  setosa <data.frame [1 × 2]>
8           5.0         3.4          1.5         0.2  setosa <data.frame [1 × 2]>
9           4.4         2.9          1.4         0.2  setosa <data.frame [1 × 2]>
10          4.9         3.1          1.5         0.1  setosa <data.frame [1 × 2]>
# ... with 140 more rows

함수의 출력을 추가하는 방법은 .collate매개 변수에 의해 제어됩니다 . list, rows, cols의 세 가지 옵션이 있습니다. 출력의 길이가 1이면 행을 사용하는지 열을 사용하는지는 중요하지 않습니다.

iris %>% 
  by_row(.collate = "cols", ..f = function(this_row) {
    this_row[1:4] %>% unlist %>% mean
  })

iris %>% 
  by_row(.collate = "rows", ..f = function(this_row) {
    this_row[1:4] %>% unlist %>% mean
  })

둘 다 생산 :

# A tibble: 150 × 6
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species  .out
          <dbl>       <dbl>        <dbl>       <dbl>  <fctr> <dbl>
1           5.1         3.5          1.4         0.2  setosa 2.550
2           4.9         3.0          1.4         0.2  setosa 2.375
3           4.7         3.2          1.3         0.2  setosa 2.350
4           4.6         3.1          1.5         0.2  setosa 2.350
5           5.0         3.6          1.4         0.2  setosa 2.550
6           5.4         3.9          1.7         0.4  setosa 2.850
7           4.6         3.4          1.4         0.3  setosa 2.425
8           5.0         3.4          1.5         0.2  setosa 2.525
9           4.4         2.9          1.4         0.2  setosa 2.225
10          4.9         3.1          1.5         0.1  setosa 2.400
# ... with 140 more rows

If we output a data.frame with 1 row, it matters only slightly which we use:

iris %>% 
  by_row(.collate = "cols", ..f = function(this_row) {
    data.frame(
      new_col_mean = this_row[1:4] %>% unlist %>% mean,
      new_col_median = this_row[1:4] %>% unlist %>% median
      )
  })

iris %>% 
  by_row(.collate = "rows", ..f = function(this_row) {
    data.frame(
      new_col_mean = this_row[1:4] %>% unlist %>% mean,
      new_col_median = this_row[1:4] %>% unlist %>% median
    )
  })

both give:

# A tibble: 150 × 8
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species  .row new_col_mean new_col_median
          <dbl>       <dbl>        <dbl>       <dbl>  <fctr> <int>        <dbl>          <dbl>
1           5.1         3.5          1.4         0.2  setosa     1        2.550           2.45
2           4.9         3.0          1.4         0.2  setosa     2        2.375           2.20
3           4.7         3.2          1.3         0.2  setosa     3        2.350           2.25
4           4.6         3.1          1.5         0.2  setosa     4        2.350           2.30
5           5.0         3.6          1.4         0.2  setosa     5        2.550           2.50
6           5.4         3.9          1.7         0.4  setosa     6        2.850           2.80
7           4.6         3.4          1.4         0.3  setosa     7        2.425           2.40
8           5.0         3.4          1.5         0.2  setosa     8        2.525           2.45
9           4.4         2.9          1.4         0.2  setosa     9        2.225           2.15
10          4.9         3.1          1.5         0.1  setosa    10        2.400           2.30
# ... with 140 more rows

except that the second has the column called .row and the first does not.

Finally, if our output is longer than length 1 either as a vector or as a data.frame with rows, then it matters whether we use rows or cols for .collate:

mtcars[1:2] %>% by_row(function(x) 1:5)
mtcars[1:2] %>% by_row(function(x) 1:5, .collate = "rows")
mtcars[1:2] %>% by_row(function(x) 1:5, .collate = "cols")

produces, respectively:

# A tibble: 32 × 3
     mpg   cyl      .out
   <dbl> <dbl>    <list>
1   21.0     6 <int [5]>
2   21.0     6 <int [5]>
3   22.8     4 <int [5]>
4   21.4     6 <int [5]>
5   18.7     8 <int [5]>
6   18.1     6 <int [5]>
7   14.3     8 <int [5]>
8   24.4     4 <int [5]>
9   22.8     4 <int [5]>
10  19.2     6 <int [5]>
# ... with 22 more rows

# A tibble: 160 × 4
     mpg   cyl  .row  .out
   <dbl> <dbl> <int> <int>
1     21     6     1     1
2     21     6     1     2
3     21     6     1     3
4     21     6     1     4
5     21     6     1     5
6     21     6     2     1
7     21     6     2     2
8     21     6     2     3
9     21     6     2     4
10    21     6     2     5
# ... with 150 more rows

# A tibble: 32 × 7
     mpg   cyl .out1 .out2 .out3 .out4 .out5
   <dbl> <dbl> <int> <int> <int> <int> <int>
1   21.0     6     1     2     3     4     5
2   21.0     6     1     2     3     4     5
3   22.8     4     1     2     3     4     5
4   21.4     6     1     2     3     4     5
5   18.7     8     1     2     3     4     5
6   18.1     6     1     2     3     4     5
7   14.3     8     1     2     3     4     5
8   24.4     4     1     2     3     4     5
9   22.8     4     1     2     3     4     5
10  19.2     6     1     2     3     4     5
# ... with 22 more rows

So, bottom line. If you want the adply(.margins = 1, ...) functionality, you can use by_row.

Extending BrodieG's answer,

If the function returns more than one row, then instead of mutate(), do() must be used. Then to combine it back together, use rbind_all() from the dplyr package.

In dplyr version dplyr_0.1.2, using 1:n() in the group_by() clause doesn't work for me. Hopefully Hadley will implement rowwise() soon.

iris %>%
    group_by(1:nrow(iris)) %>%
    do(do_fn) %>%
    rbind_all()

Testing the performance,

library(plyr)    # plyr_1.8.4.9000
library(dplyr)   # dplyr_0.8.0.9000
library(purrr)   # purrr_0.2.99.9000
library(microbenchmark)

d1_count <- 1000
d2_count <- 10

d1 <- data.frame(a=runif(d1_count))

do_fn <- function(row){data.frame(a=row$a, b=runif(d2_count))}
do_fn2 <- function(a){data.frame(a=a, b=runif(d2_count))}

op <- microbenchmark(
        plyr_version = plyr::adply(d1, 1, do_fn),
        dplyr_version = d1 %>%
            dplyr::group_by(1:nrow(d1)) %>%
            dplyr::do(do_fn(.)) %>%
            dplyr::bind_rows(),
        purrr_version = d1 %>% purrr::pmap_dfr(do_fn2),
        times=50)

it has the following results:

Unit: milliseconds
          expr       min        lq      mean    median        uq       max neval
  plyr_version 1227.2589 1275.1363 1317.3431 1293.5759 1314.4266 1616.5449    50
 dplyr_version  977.3025 1012.6340 1035.9436 1025.6267 1040.5882 1449.0978    50
 purrr_version  609.5790  629.7565  643.8498  644.2505  656.1959  686.8128    50

This shows that the new purrr version is the fastest

Something like this?

iris$Max.Len <- pmax(iris$Sepal.Length, iris$Petal.Length)

참고URL : https://stackoverflow.com/questions/21818181/applying-a-function-to-every-row-of-a-table-using-dplyr

'program story' 카테고리의 다른 글

“-”대시를 사용하여 JSON 속성에 액세스 할 수 없습니다 (0)	2020.07.28
안드로이드에서 UI 스레드를 감지하는 방법? (0)	2020.07.28
C #에서 개체 속성 비교 (0)	2020.07.28
MySQL-하나의 쿼리에서 다른 값으로 여러 행 업데이트 (0)	2020.07.28
UIView를 강제로 다시 그리는 가장 강력한 방법은 무엇입니까? (0)	2020.07.28

현재글dplyr을 사용하여 테이블의 모든 행에 함수를 적용 하시겠습니까?

inputbox

dplyr을 사용하여 테이블의 모든 행에 함수를 적용 하시겠습니까?

dplyr을 사용하여 테이블의 모든 행에 함수를 적용 하시겠습니까?

비 `rowwise`대체

2017-08-03 업데이트

원본 게시물

예

'program story' 카테고리의 다른 글

'program story'의 다른글

티스토리툴바

dplyr을 사용하여 테이블의 모든 행에 함수를 적용 하시겠습니까?

dplyr을 사용하여 테이블의 모든 행에 함수를 적용 하시겠습니까?

비 rowwise대체

2017-08-03 업데이트

원본 게시물

예

'program story' 카테고리의 다른 글

'program story'의 다른글

관련글

티스토리툴바

비 `rowwise`대체