1 ggplot2 példák

1.1 Adatok

Az alábbi példákban főleg a ggplot2 könyvtár beépített adatállományait használjuk, melyek dokumentációja a help() függvénnyel érhető el a csomag betöltése után, mint például help(mpg) vagy ?mpg.

1.2 Pontdiagram

Vizsgáljuk meg az mpg adatállományt:

library(ggplot2)
str(mpg)

## Classes 'tbl_df', 'tbl' and 'data.frame':    234 obs. of  11 variables:
##  $ manufacturer: chr  "audi" "audi" "audi" "audi" ...
##  $ model       : chr  "a4" "a4" "a4" "a4" ...
##  $ displ       : num  1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int  1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int  4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : chr  "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
##  $ drv         : chr  "f" "f" "f" "f" ...
##  $ cty         : int  18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int  29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : chr  "p" "p" "p" "p" ...
##  $ class       : chr  "compact" "compact" "compact" "compact" ...

summary(mpg)

##  manufacturer          model               displ            year     
##  Length:234         Length:234         Min.   :1.600   Min.   :1999  
##  Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999  
##  Mode  :character   Mode  :character   Median :3.300   Median :2004  
##                                        Mean   :3.472   Mean   :2004  
##                                        3rd Qu.:4.600   3rd Qu.:2008  
##                                        Max.   :7.000   Max.   :2008  
##       cyl           trans               drv                 cty       
##  Min.   :4.000   Length:234         Length:234         Min.   : 9.00  
##  1st Qu.:4.000   Class :character   Class :character   1st Qu.:14.00  
##  Median :6.000   Mode  :character   Mode  :character   Median :17.00  
##  Mean   :5.889                                         Mean   :16.86  
##  3rd Qu.:8.000                                         3rd Qu.:19.00  
##  Max.   :8.000                                         Max.   :35.00  
##       hwy             fl               class          
##  Min.   :12.00   Length:234         Length:234        
##  1st Qu.:18.00   Class :character   Class :character  
##  Median :24.00   Mode  :character   Mode  :character  
##  Mean   :23.44                                        
##  3rd Qu.:27.00                                        
##  Max.   :44.00

mpg

## # A tibble: 234 x 11
##    manufacturer model displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4      1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4      1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4      2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4      2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4      2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4      2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4      3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 q…   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 q…   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 q…   2    2008     4 manu… 4        20    28 p     comp…
## # … with 224 more rows

Ábrázoljuk az országúti fogyasztást (hwy) a motor űrtartalom (displ) függvényében (az alábbiak mind ugyanazt az eredményt adják):

ggplot(mpg, aes(x = displ, y = hwy)) + geom_point()

ggplot(mpg, aes(displ, hwy)) + geom_point()

ggplot(mpg) + geom_point(aes(x = displ, y = hwy))

ggplot(mpg) + geom_point(aes(displ, hwy))

Az alábbi módon adható meg a pontok színe:

ggplot(mpg, aes(displ, hwy)) + geom_point(color = "blue")

Az ábrához az alábbi módon adható meg felirat:

ggplot(mpg, aes(displ, hwy)) + geom_point() + labs(title = "Engine displacement vs highway miles per gallon")

ggplot(mpg, aes(displ, hwy)) + geom_point() + ggtitle("Engine displacement vs highway miles per gallon")

A pontok színét, alakját, méretét szolgáltathatják az adatok:

ggplot(mpg, aes(displ, hwy, color = class)) + geom_point()

ggplot(mpg, aes(displ, hwy)) + geom_point(aes(shape = drv))

ggplot(mpg) + geom_point(aes(displ, hwy, size = cyl))

ggplot(mpg) + geom_point(aes(displ, hwy, color = drv, size = cty, shape = fl))

Ábra készítése minden egyes autótípushoz, hengerszámhoz és évjárathoz:

ggplot(mpg, aes(displ, hwy)) + geom_point() + facet_wrap(~class)

ggplot(mpg, aes(displ, hwy)) + geom_point() + facet_wrap(~cyl)

ggplot(mpg, aes(displ, hwy)) + geom_point() + facet_wrap(~year)

Regressziós modell illesztése a pontokra:

ggplot(mpg, aes(displ, hwy)) + geom_point() + geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(mpg, aes(displ, hwy)) + geom_point() + geom_smooth(se = FALSE)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(mpg, aes(displ, hwy)) + geom_point() +
    geom_smooth(method = lm) # lineáris modell illesztése

ggplot(mpg, aes(displ, hwy)) + geom_point() +
    geom_smooth(method = "lm") # lineáris modell illesztése

Akár több regressziós modell is illeszthető:

ggplot(mpg, aes(displ, hwy)) + geom_point() +
    geom_smooth(aes(color = "loess"), method = "loess") +
    geom_smooth(aes(color = "lm"), method = "lm") +
    labs(color = "Method")

De az alábbiak is használhatók:

ggplot(mpg, aes(displ, hwy, color = drv)) + geom_point() + geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(mpg, aes(displ, hwy, color = drv)) + geom_point() + geom_smooth() + facet_wrap(~class)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : span too small. fewer data values than degrees of freedom.

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 5.6935

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 0.5065

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 0.65044

## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : span too small.
## fewer data values than degrees of freedom.

## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used
## at 5.6935

## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 0.5065

## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal
## condition number 0

## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other
## near singularities as well. 0.65044

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : span too small. fewer data values than degrees of freedom.

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 2.793

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 0.307

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 1.2254

## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : span too small.
## fewer data values than degrees of freedom.

## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used
## at 2.793

## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 0.307

## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal
## condition number 0

## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other
## near singularities as well. 1.2254

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 4.008

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 0.708

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 0.25

## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used
## at 4.008

## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 0.708

## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal
## condition number 0

## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other
## near singularities as well. 0.25

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : span too small. fewer data values than degrees of freedom.

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 2.1985

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 0.3015

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 0.090902

## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : span too small.
## fewer data values than degrees of freedom.

## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used
## at 2.1985

## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 0.3015

## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal
## condition number 0

## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other
## near singularities as well. 0.090902

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 1.8

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 0.2

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 1.0069e-16

## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used
## at 1.8

## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 0.2

## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal
## condition number 1.0069e-16

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 3.792

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 0.808

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 6.7154e-17

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 1.9825

## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used
## at 3.792

## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 0.808

## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal
## condition number 6.7154e-17

## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other
## near singularities as well. 1.9825

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 4.593

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 0.807

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 5.2915e-17

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 0.01

## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used
## at 4.593

## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 0.807

## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal
## condition number 5.2915e-17

## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other
## near singularities as well. 0.01

1.3 Dobozdiagram

Figyeljük meg, hogy alapértelmezésben a címkék lexikografikus sorrendjében kerülnek megjelenítésre az \(x\) tengely fölött a dobozok:

ggplot(mpg, aes(drv, hwy)) + geom_boxplot()

ggplot(mpg, aes(class, hwy)) + geom_boxplot()

A reorder() függvénnyel rendezhetjük át az \(x\) tengely fölött a dobozokat:

ggplot(mpg, aes(reorder(class, hwy), hwy)) + geom_boxplot()

ggplot(mpg, aes(reorder(class, hwy, median), hwy)) + geom_boxplot()

1.4 Hisztogram

ggplot(mpg, aes(hwy)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(mpg, aes(hwy)) + geom_freqpoly()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(mpg, aes(hwy)) + geom_histogram(binwidth = 1)

ggplot(mpg, aes(hwy, color = drv)) + geom_freqpoly(binwidth = 1)

ggplot(mpg, aes(hwy, fill = drv)) + geom_histogram(binwidth = 1)

ggplot(mpg, aes(hwy, fill = drv)) + geom_histogram(binwidth = 1, position = "dodge")

ggplot(mpg, aes(hwy)) + geom_histogram() + facet_wrap(~drv)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(mpg, aes(hwy)) + geom_histogram() + facet_wrap(~drv, ncol = 1)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

1.5 Oszlopdiagram

ggplot(mpg, aes(manufacturer)) + geom_bar() + labs(title = "Number of cars by manufacturer")

Az \(x\) tengelyen egymásra lógnak a feliratok. Esztétikusabb eredményt kapunk a tengelyek felcserélésével:

ggplot(mpg, aes(manufacturer)) + geom_bar() + coord_flip() + labs(title = "Number of cars by manufacturer")

A függőleges (\(x\)) tengely fölött alapértelmezésben a címkék lexikografikus sorrendjében kerülnek megjelenítésre az oszlopok, például az alábbi módon rendezhetjük át őket:

library(forcats)
ggplot(mpg, aes(fct_infreq(manufacturer))) + geom_bar() + coord_flip() + xlab("manufacturer") + labs(title = "Number of cars by manufacturer")

ggplot(mpg, aes(fct_rev(fct_infreq(manufacturer)))) + geom_bar() + coord_flip() + xlab("manufacturer") + labs(title = "Number of cars by manufacturer")

A rendezéshez a forcats csomag fct_infreq() és fct_rev() függvényeit használtuk.

Az alábbi módon jeleníthetők meg az oszlopok különböző színekkel:

ggplot(mpg, aes(manufacturer)) + geom_bar(aes(fill = manufacturer)) + coord_flip()

Így szabadulhatunk meg a jobb oldali jelmagyarázattól:

ggplot(mpg, aes(manufacturer)) + geom_bar(aes(fill = manufacturer), show.legend = FALSE) + coord_flip()

ggplot(mpg, aes(manufacturer)) + geom_bar(aes(fill = manufacturer)) + guides(fill = "none") + coord_flip()

Az alábbi módon jeleníthető meg az oszlopok fölött a magasságuk:

ggplot(mpg, aes(manufacturer)) + geom_bar(aes(fill = manufacturer)) + guides(fill = "none") + coord_flip() +
    geom_text(stat = "count", aes(label = stat(count)), hjust = -1)

Az alábbi módon jeleníthetjük meg a gyártóhoz az autók számát minden egyes meghajtáshoz (alapértelmezésben egymás tetejére kerülnek a különböző színű oszlopok):

ggplot(mpg, aes(manufacturer, fill = drv)) + geom_bar() + coord_flip()

Az oszlopok egymás mellé helyezéséhez az alábbi paraméterezés szükséges:

ggplot(mpg, aes(manufacturer, fill = drv)) + geom_bar(position = "dodge") + coord_flip()

1.6 Idősor adatok

Vizsgáljuk meg és ábrázoljuk az economics adatállományt:

str(economics)

## Classes 'tbl_df', 'tbl' and 'data.frame':    574 obs. of  6 variables:
##  $ date    : Date, format: "1967-07-01" "1967-08-01" ...
##  $ pce     : num  507 510 516 513 518 ...
##  $ pop     : int  198712 198911 199113 199311 199498 199657 199808 199920 200056 200208 ...
##  $ psavert : num  12.5 12.5 11.7 12.5 12.5 12.1 11.7 12.2 11.6 12.2 ...
##  $ uempmed : num  4.5 4.7 4.6 4.9 4.7 4.8 5.1 4.5 4.1 4.6 ...
##  $ unemploy: int  2944 2945 2958 3143 3066 3018 2878 3001 2877 2709 ...

summary(economics)

##       date                 pce               pop            psavert      
##  Min.   :1967-07-01   Min.   :  507.4   Min.   :198712   Min.   : 1.900  
##  1st Qu.:1979-06-08   1st Qu.: 1582.2   1st Qu.:224896   1st Qu.: 5.500  
##  Median :1991-05-16   Median : 3953.6   Median :253060   Median : 7.700  
##  Mean   :1991-05-17   Mean   : 4843.5   Mean   :257189   Mean   : 7.937  
##  3rd Qu.:2003-04-23   3rd Qu.: 7667.3   3rd Qu.:290291   3rd Qu.:10.500  
##  Max.   :2015-04-01   Max.   :12161.5   Max.   :320887   Max.   :17.000  
##     uempmed         unemploy    
##  Min.   : 4.00   Min.   : 2685  
##  1st Qu.: 6.00   1st Qu.: 6284  
##  Median : 7.50   Median : 7494  
##  Mean   : 8.61   Mean   : 7772  
##  3rd Qu.: 9.10   3rd Qu.: 8691  
##  Max.   :25.20   Max.   :15352

economics

## # A tibble: 574 x 6
##    date         pce    pop psavert uempmed unemploy
##    <date>     <dbl>  <int>   <dbl>   <dbl>    <int>
##  1 1967-07-01  507. 198712    12.5     4.5     2944
##  2 1967-08-01  510. 198911    12.5     4.7     2945
##  3 1967-09-01  516. 199113    11.7     4.6     2958
##  4 1967-10-01  513. 199311    12.5     4.9     3143
##  5 1967-11-01  518. 199498    12.5     4.7     3066
##  6 1967-12-01  526. 199657    12.1     4.8     3018
##  7 1968-01-01  532. 199808    11.7     5.1     2878
##  8 1968-02-01  534. 199920    12.2     4.5     3001
##  9 1968-03-01  545. 200056    11.6     4.1     2877
## 10 1968-04-01  545. 200208    12.2     4.6     2709
## # … with 564 more rows

ggplot(economics, aes(date, uempmed)) + geom_line()

ggplot(economics, aes(date, unemploy / pop)) + geom_line()

Tekintsük most a txhousing adatállományt:

str(txhousing)

## Classes 'tbl_df', 'tbl' and 'data.frame':    8602 obs. of  9 variables:
##  $ city     : chr  "Abilene" "Abilene" "Abilene" "Abilene" ...
##  $ year     : int  2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
##  $ month    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ sales    : num  72 98 130 98 141 156 152 131 104 101 ...
##  $ volume   : num  5380000 6505000 9285000 9730000 10590000 ...
##  $ median   : num  71400 58700 58100 68600 67300 66900 73500 75000 64500 59300 ...
##  $ listings : num  701 746 784 785 794 780 742 765 771 764 ...
##  $ inventory: num  6.3 6.6 6.8 6.9 6.8 6.6 6.2 6.4 6.5 6.6 ...
##  $ date     : num  2000 2000 2000 2000 2000 ...

txhousing

## # A tibble: 8,602 x 9
##    city     year month sales   volume median listings inventory  date
##    <chr>   <int> <int> <dbl>    <dbl>  <dbl>    <dbl>     <dbl> <dbl>
##  1 Abilene  2000     1    72  5380000  71400      701       6.3 2000 
##  2 Abilene  2000     2    98  6505000  58700      746       6.6 2000.
##  3 Abilene  2000     3   130  9285000  58100      784       6.8 2000.
##  4 Abilene  2000     4    98  9730000  68600      785       6.9 2000.
##  5 Abilene  2000     5   141 10590000  67300      794       6.8 2000.
##  6 Abilene  2000     6   156 13910000  66900      780       6.6 2000.
##  7 Abilene  2000     7   152 12635000  73500      742       6.2 2000.
##  8 Abilene  2000     8   131 10710000  75000      765       6.4 2001.
##  9 Abilene  2000     9   104  7615000  64500      771       6.5 2001.
## 10 Abilene  2000    10   101  7040000  59300      764       6.6 2001.
## # … with 8,592 more rows

ggplot(txhousing, aes(date, sales, color = city)) + geom_line() + labs(title = "Number of sales") + theme(legend.position = "bottom")

## Warning: Removed 430 rows containing missing values (geom_path).

ggplot(txhousing, aes(date, sales, group = city)) + geom_line(color = "lightgray") + labs(title = "Number of sales")

## Warning: Removed 430 rows containing missing values (geom_path).

A dplyr csomag filter() függvényével szűrjük ki a három legnépesebb texasi várost (San Antonio, Dallas és Houston):

library(dplyr)
ggplot(filter(txhousing, city == "San Antonio" | city == "Dallas" | city == "Houston"), aes(date, sales, color = city)) + geom_line() + labs(title = "Number of sales") + theme(legend.position = "bottom")

ggplot(filter(txhousing, city %in% c("San Antonio", "Dallas", "Houston")), aes(date, sales, color = city)) + geom_line() + labs(title = "Number of sales") + theme(legend.position = "bottom")

1.7 Két dimenziós hisztogram

Az alábbi módon generáljunk mesterségesen adatokat:

X <- data.frame(x = rnorm(10000), y = rnorm(10000))
ggplot(X, aes(x, y)) + geom_point()

A túl sok pont elfedi egymást, a jobb megjelenítéshez az alábbi módon tegyük átlátszóvá a pontokat:

ggplot(X, aes(x, y)) + geom_point(alpha = 1 / 5)

Ha alpha = 1, akkor a pontok teljesen átlátszatlanok, alpha = 0 esetén pedig teljesen átlátszóak. Ha alpha értéke \(1/n\), akkor \(n\) darab egymásra kerülő pont eredményez egy teljesen átlátszatlan pontot.

Két dimenziós hisztogram:

ggplot(X, aes(x, y)) + geom_bin2d()

ggplot(X, aes(x, y)) + geom_bin2d(bins = 100)

ggplot(X, aes(x, y)) + geom_bin2d(binwidth = 0.1)

A hexbin csomag révén hatszögrács is használható:

library(hexbin)
ggplot(X, aes(x, y)) + geom_hex()

1.8 Kördiagram

Jelenítsük meg kördiagrammal az autók meghajtás (drv) szerinti megoszlását!

Az adatok aggregálásához használhatjuk az sqldf és a dplyr csomagot is.

Aggregálás az sqldf csomaggal:

library(sqldf)
drvs <- sqldf("SELECT drv, COUNT(*) count FROM mpg GROUP BY drv ORDER BY count")
drvs

##   drv count
## 1   r    25
## 2   4   103
## 3   f   106

Aggregálás a dplyr csomaggal:

library(dplyr)
drvs <- arrange(summarize(group_by(mpg, drv), count = n()), count)
drvs

## # A tibble: 3 x 2
##   drv   count
##   <chr> <int>
## 1 r        25
## 2 4       103
## 3 f       106

Ugyanez sokkal olvashatóbban írható fel a magrittr csomag csővezeték operátorával (%>%):

library(magrittr)
drvs <- mpg %>% group_by(drv) %>% summarize(count = n()) %>% arrange(count)
drvs

## # A tibble: 3 x 2
##   drv   count
##   <chr> <int>
## 1 r        25
## 2 4       103
## 3 f       106

A fenti kódban például az mpg %>% group_by(drv) kifejezés a group_by(mpg, drv) kifejezéssel ekvivalens.

Az alábbi módon készíthető oszlopdiagram, melyen az oszlopok egymás fölött kerülnek elhelyezésre:

ggplot(drvs, aes(x = "", y = count, fill = drv)) +
    geom_bar(stat = "identity", color = "white") +
    xlab(NULL)

Az alábbi módon helyezhetjük el az ábrára feliratként az oszlopok magasságát szolgáltató számokat (mindkét megoldás ugyanazt eredményezi):

ggplot(drvs, aes(x = "", y = count, fill = drv)) +
    geom_bar(stat = "identity", color = "white") +
    xlab(NULL) +
    geom_text(aes(label = count, y = cumsum(count) - 0.5 * count))

ggplot(drvs, aes(x = "", y = count, fill = drv)) +
    geom_bar(stat = "identity", color = "white") +
    xlab(NULL) +
    geom_text(aes(label = count), position = position_stack(0.5))

Végül az alábbi módon hajlítható az oszlopdiagram kördiagrammá:

ggplot(drvs, aes(x = "", y = count, fill = drv)) +
    geom_bar(stat = "identity", color = "white") +
    geom_text(aes(label = count), position = position_stack(0.5)) +
    coord_polar("y")

Az alábbi kiegészítéssel szabadulhatunk meg az ábra díszítéseitől:

ggplot(drvs, aes(x = "", y = count, fill = drv)) +
    geom_bar(stat = "identity", color = "white") +
    geom_text(aes(label = count), position = position_stack(0.5)) +
    coord_polar("y") +
    theme_void()

A kód kis módosításával gyűrűdiagram is készíthető:

ggplot(drvs, aes(x = 2, y = count, fill = drv)) +
    xlim(0.5, 2.5) +
    geom_bar(stat = "identity", color = "white") +
    geom_text(aes(label = count), position = position_stack(0.5)) +
    coord_polar("y") +
    theme_void()

1.9 Függvény

Gamma eloszlásból származó minta hisztogramjának megjelenítése az elméleti sűrűségfüggvénnyel együtt:

X <- data.frame(x = rgamma(1000, 5, 0.5))
ggplot(X) +
    geom_histogram(aes(x = x, y = stat(density)), bins = 15, fill = "wheat", color = "black") +
    stat_function(fun = dgamma, args = list(5, 0.5), color = "red", size = 1) +
    geom_rug(aes(x)) +
    theme_linedraw() +
    labs(title = substitute(paste("Gamma distribution with parameters ", alpha == a, " and ", beta == b), list(a = 5, b = 0.5)))

1.10 Ábra mentése

A ggsave() függvény szolgál ábrák mentésére, alapértelmezésben a legutóbbi ábrát menti az adott állományba:

p <- ggplot(mpg, aes(displ, hwy)) + geom_point() + labs(title = "Engine displacement vs highway miles per gallon")
p

ggsave("plot.pdf")

## Saving 7 x 5 in image

ggsave("plot.png")

## Saving 7 x 5 in image

ggsave("plot.svg") # az svglite csomag rendelkezésre állása szükséges

## Saving 7 x 5 in image

1.11 Témák

A témák révén szabhatók testre az ábrák az adatoktól különböző komponensei, például a címek, betűtípusok, margók. A theme() függvény révén állíthatók be a téma elemek. Példa a theme() függvény használatára:

ggplot(mpg, aes(displ, hwy, color = class)) + geom_point() + labs(title = "Engine displacement vs highway miles per gallon") +
    theme(plot.background = element_rect(fill = "darkseagreen1"),
        plot.margin = margin(0.5, 0.5, 0.5, 0.5, "in"),
        panel.background = element_rect(fill = "aliceblue"),
        legend.background = element_rect(fill = "white"),
        legend.position = "bottom",
    plot.title = element_text(family = "serif", face = "bold", hjust = 0.5, size = rel(1.5)))

Példa a témák gyakorlati felhasználására (a \(x\) tengely címkéinek elforgatása):

ggplot(mpg, aes(manufacturer)) + geom_bar() + labs(title = "Number of cars by manufacturer") + theme(axis.text.x = element_text(angle = -35))

Több beépített téma is van, ezeket olyan függvények szolgáltatják, melyek a theme() függvényt hívják meg megfelelően felparaméterezve:

ggplot(mpg, aes(displ, hwy, color = class)) + geom_point() + theme_classic()

ggplot(mpg, aes(displ, hwy, color = class)) + geom_point() + theme_light()

ggplot(mpg, aes(displ, hwy, color = class)) + geom_point() + theme_dark()

Jelenítsük meg ugyanazt az ábrát a rendelkezésre álló beépített témák mindegyikével! Ehhez határozzuk meg azon függvényeket a ggplot2 csomagból, melyek neve a theme_ karakterlánccal kezdődik:

themes <- ls("package:ggplot2", pattern = "^theme_")
themes

##  [1] "theme_bw"       "theme_classic"  "theme_dark"     "theme_get"     
##  [5] "theme_gray"     "theme_grey"     "theme_light"    "theme_linedraw"
##  [9] "theme_minimal"  "theme_replace"  "theme_set"      "theme_test"    
## [13] "theme_update"   "theme_void"

Hagyjuk el az eredményül kapott vektorból a témákat kezelő függvényeket:

themes <- setdiff(themes, c("theme_get", "theme_replace", "theme_set", "theme_update"))
themes

##  [1] "theme_bw"       "theme_classic"  "theme_dark"     "theme_gray"    
##  [5] "theme_grey"     "theme_light"    "theme_linedraw" "theme_minimal" 
##  [9] "theme_test"     "theme_void"

Készítsünk egy ábrát, de ne jelenítsük meg, hanem tároljuk el egy változóban:

p <- ggplot(mpg, aes(displ, hwy)) + geom_point(color = "blue")

Minden egyes beépített témára hajtsuk végre a következőt: adjuk hozzá az ábrához címként a téma nevét és alkalmazzuk rá a témát.

plots <- lapply(themes, function(theme) p + ggtitle(theme) + do.call(theme, list()))

A fenti kódban a do.call() függvény az első paraméterben adott nevű függvényt hívja meg a második argumentumként adott lista elemeivel, mint argumentumokkal, esetünkben az üres argumentumlistával. Az eredmény egy olyan lista, melyek elemei ábrák:

class(plots[[1]])

## [1] "gg"     "ggplot"

plots[[1]]

Az ábralistát a cowplot csomag segítségével jelenítjük meg:

library(cowplot)
g <- plot_grid(plotlist = plots)
g

1.12 Komplex ábra

library(dplyr)
presidential <- filter(presidential, start > economics$date[1])
ggplot(economics) +
    geom_rect(aes(xmin = start, xmax = end, fill = party), ymin = -Inf, ymax = Inf, alpha = 0.2, data = presidential) +
    geom_line(aes(date, unemploy / pop)) +
    geom_text(aes(x = start, y = 0, label = name), data = presidential, vjust = 1, hjust = 0, size = 2) +
    scale_fill_manual(values = c("blue", "red")) +
    labs(title = "Number of unemployed in thousands")

ggplot2

Jeszenszky Péter

2019.06.18.