[1] "Hello world!"

Data visualization is not a decoration; it is an analytical instrument.
Human cognition is optimized for pattern recognition in visual space, not for reading tables or numbers.

Can you introduce yourself to R?
We first need to install the packages we want to use.
tibble: a tibble is a nice way to see our data. With -> earnings we are saying that save this data set into our memory so we can use it whenever we want.
lets see how does the data look like. There are two ways:
# A tibble: 200 × 3
id gender income
<int> <chr> <dbl>
1 1 Male 19376.
2 2 Male 9540.
3 3 Male 10501.
4 4 Male 26262.
5 5 Male 23313.
6 6 Male 15835.
7 7 Male 12728.
8 8 Female 22247.
9 9 Female 41456.
10 10 Male 16254.
# ℹ 190 more rows
To see the variable names and information on numerical variables we use the function summary()
So we need to tell R to
earnings%>%)summary() of the data set.For non-numeric variables (like gender) we tell R to.
Take my data set named earnings and do the following ( %>% ).
apply tabyl({variable name}) to show statistics about the variable name we want to see. We want to apply tabyl() function our non-numeric variable named gender.
We are trying to find out how income is distributed.
Income is a continuous variable.
To see how does a continuous variable is distributed: use a histogram or a density plot.
Lets have a look at the data again
# A tibble: 200 × 3
id gender income
<int> <chr> <dbl>
1 1 Male 19376.
2 2 Male 9540.
3 3 Male 10501.
4 4 Male 26262.
5 5 Male 23313.
6 6 Male 15835.
7 7 Male 12728.
8 8 Female 22247.
9 9 Female 41456.
10 10 Male 16254.
# ℹ 190 more rows
income is distributed.A histogram groups continuous data into intervals (bins) and counts how many observations fall inside each interval.
A bin is basically an interval. For example, each of the following is a bin with width of 1000.
We need to group the continuous data into bins and count how many of the observations belong to each bin.
Choosing binwidth is somewhat depends on situation, try different binwidths and you’ll have different shapes.
Lets start drawing a histogram of income by hand with 1000 bin width with our hand. Our bins will start from the minimum value of income.
For each observation we look at the income of the person and identify which bin it belongs to and start counting. As we see more observations falling into the same bin, the height of the bin (count) increases.
Before doing this we first need to place our x and y axises.
Lets plot the first 10 observations.
# A tibble: 200 × 3
id gender income
<int> <chr> <dbl>
1 1 Male 19376.
2 2 Male 9540.
3 3 Male 10501.
4 4 Male 26262.
5 5 Male 23313.
6 6 Male 15835.
7 7 Male 12728.
8 8 Female 22247.
9 9 Female 41456.
10 10 Male 16254.
# ℹ 190 more rows
It is very tiring to go over all 200 observations.
Instead we can tell R to do it for us!
Take my data set named earnings,
I want to make a visualization, take my data set and apply ggplot.
What we will have is a blank piece of paper. We did not specified our axises yet.
Similar to drawing with hand, we first draw the x and y axis. To do this with R we use the function aes(x ={ variable name}). And we add this into our existing blank page using + symbol.
+ will always be at the end of the lines.
For histograms, we only need to give the name of the x variable, R will count each observations by itself!
Now it is time to tell R to make a histogram.
We use geom_{plot type} to tell R what kind of plot we want.
For example, to make a histogram, we use geom_histogram().
To make a density plot, we use geom_density().
To make a bar plot, we use geom_bar().
To make a scatter point plot , we use geom_point().
earnings and then ( %>% ),ggplot() to make a vizualization, +,x variable is income, and on top of that i am adding (+),
First fix this awful plot with start by adding a theme on top of our visualization. theme_{theme name} allows us to use a theme. Make sure theme is always at the last line.
By default R chooses some binwidth by itself inside geom_histogram(), even though we don’t see it geom_histogram(binwidth = 30). We want to change binwidth to what we want: binwidth=1000
R also chooses which color to fill the bins with. We want to change the color of the filling to lightskyblue inside geom_histogram(binwidth = 1000, fill={"name of the color"}).
Note that we use , to separate each argument.
To change the color of the edges of the bins we use color={name of the color} argument inside geom_histogram(). Lets change it to black.
theme_classic() we set earlier and start typing theme_ R will show you suggestions.Before we move on to interpret the plot, i want to increase the number of values in the x axis. Currently we have 5 values which are not really enough to see the distribution.
To increase the number of values from 5 to 10 on the x axis, use scale_x_continuous(n.breaks=10) layer.
Remember to add this before the theme!
We have the same problem in y axis as well.
How can we increase the number of values in y axis? For x axis the function was scale_x_continuous(n.breaks=10). Can you set the number of breaks on y axis to 10?
scale_y_continuous(n.breaks =10)
Now it is time to interpret the plot.
Now that we understood our plot, we should make other people understand too by adding a title and a subtitle.
Use the layer labs(title = "{your title}", subtitle = "{your subtitle}", x = "{x axis label}", y = "{y axis label}").
When we write sentences we write them between quotation marks.
Right: labs(title = "Monthly earnings...")
Wrong: labs(title = Monthlyh earnings..)
Right: labs(x = "Income", y = "N. of observations")
Wrong: labs(x = Income, y = N. of observations)
earnings %>%
ggplot() +
aes(x = income) +
geom_histogram(binwidth = 1000, fill = "lightskyblue", color = "black") +
scale_x_continuous(n.breaks = 10) +
scale_y_continuous(n.breaks = 10) +
labs(
title = "Monthly Earnings Distribution of Manufacturing Workers in Cilek Mobilya, Mamak",
subtitle = "Most workers earn between 15,000–25,000 TL; distribution is right-skewed with a small number of high-income outliers",
x = "Monthly Earnings (Turkish Lira, TL)",
y = "Number of Workers"
) +
theme_classic()

density plot.We want to use our data named earnings and do ( %>% )
ggplot() visualization to it
Add x axis variable as income
Add density geometry: geom_density()

Immediately apply a theme you prefer by.
Well, before that, lets install a nice package to use a nice theme
We want to install the package ggthemes to use its themes
To be able use the themes from ggthemes package, we need to tell R to use the package
We were here
Now we add a theme of your choice by starting to type theme_ on the right side of the suggestions you will see ggthemes, choose one from there.
I don’t like the black color of the density line. Lets change it to “azure4” using the color argument inside geom_density(color = {"color you prefer"})
The width of the line seems a bit thin, can we increase it with linewidth argument inside geom_density(). By default it is set to geom_density(linewidth = 1).
geom_density(color = {"color you prefer"}, linewidth = {width number you prefer})
Label of the y axis, changed, because we are not showing counts anymore. We are showing density (or probability density). Our heights now represent probability density, not the number of workers; we should reflect this in y label.
The x and y values seem ok. we don’t need to use scale_x_continuous() and scale_y_continuous().
While our title is still fine, our subtitle could be better since we are not showing counts anymore. Lets change it to “The highest concentration of earnings lies between 15,000–25,000 TL; the distribution is right-skewed with a long upper tail”
earnings %>%
ggplot() +
aes(x = income) +
geom_density(color = "azure4", linewidth = 1.5) +
labs(
title = "Monthly Earnings Distribution of Manufacturing Workers in Cilek Mobilya, Mamak",
subtitle = "The highest concentration of earnings lies between 15,000–25,000 TL; the distribution is right-skewed",
x = "Monthly Earnings (Turkish Lira, TL)",
y = "Probability Density"
) +
theme_economist()
Now, what if we wanted to see whether the income of the workers change based on their gender?
Since we have gender for each observation, we can use this information as well.
There are different ways to do this.
This was our code and plot
earnings %>%
ggplot() +
aes(x = income) +
geom_density(color = "azure4", linewidth = 1.5) +
labs(
title = "Monthly Earnings Distribution of Manufacturing Workers in Cilek Mobilya, Mamak",
subtitle = "Most workers earn between 15,000–25,000 TL; distribution is right-skewed with a small number of high-income outliers",
x = "Monthly Earnings (Turkish Lira, TL)",
y = "Number of Workers"
) +
theme_economist()
earnings %>%
ggplot() +
aes(x = income) +
geom_density(color = "azure4", linewidth = 1.5) +
labs(
title = "Monthly Earnings Distribution of Manufacturing Workers in Cilek Mobilya, Mamak",
subtitle = "Most workers earn between 15,000–25,000 TL; distribution is right-skewed with a small number of high-income outliers",
x = "Monthly Earnings (Turkish Lira, TL)",
y = "Number of Workers"
) +
theme_economist()variable.geom_density(color = "azure4") to geom_density(color = gender)?earnings %>%
ggplot() +
aes(x = income) +
geom_density(color = gender, linewidth = 1.5) +
labs(
title = "Monthly Earnings Distribution of Manufacturing Workers in Cilek Mobilya, Mamak",
subtitle = "Most workers earn between 15,000–25,000 TL; distribution is right-skewed with a small number of high-income outliers",
x = "Monthly Earnings (Turkish Lira, TL)",
y = "Number of Workers"
) +
theme_economist()We get an error Error: object 'gender' not found.
This is because R thinks we give it a color name, like red, blue, or green. But gender is not a color name. It is a variable
Whenever we want to change something based on a variable, we do it inside aes(), not geom_.
This is why we set aes(x = income) instead of geom_density(x = income). Because x should be a varaible, not just one value.
So what we want to do is to go upstairs to aes(x = income), and add color = gender argument to it, seperating them with ,
earnings %>%
ggplot() +
aes(x = income, color = gender) +
geom_density(linewidth = 1.5) +
labs(
title = "Monthly Earnings Distribution of Manufacturing Workers in Cilek Mobilya, Mamak",
subtitle = "Most workers earn between 15,000–25,000 TL; distribution is right-skewed with a small number of high-income outliers",
x = "Monthly Earnings (Turkish Lira, TL)",
y = "Number of Workers"
) +
theme_economist()
We added the gender information into our plot.
Now we are no longer describing one distribution.
We are comparing two distributions.
The thing we show now is “Do earnings differ by gender?”.
Our title and subtitle should reflect this.
Change the title to “Monthly Earnings by Gender”.
Change the subtitle to “No substantial difference between male and female workers.”
earnings %>%
ggplot() +
aes(x = income, color = gender) +
geom_density(linewidth = 1.5) +
labs(
title = "Monthly Earnings by Gender",
subtitle = "No substantial visual difference between male and female workers",
x = "Monthly Earnings (Turkish Lira, TL)",
y = "Number of Workers"
) +
theme_economist()
Now, instead of the color, we could have changed the linetype based on gender.
Can you do it?
earnings %>%
ggplot() +
aes(x = income, linetype = gender) +
geom_density(linewidth = 1.5) +
labs(
title = "Monthly Earnings by Gender",
subtitle = "No substantial visual difference between male and female workers",
x = "Monthly Earnings (Turkish Lira, TL)",
y = "Number of Workers"
) +
theme_economist()
Remember our histogram code?
earnings %>%
ggplot() +
aes(x = income) +
geom_histogram(binwidth = 1000, fill = "lightskyblue", color = "black") +
scale_x_continuous(n.breaks = 10) +
scale_y_continuous(n.breaks = 10) +
labs(
title = "Monthly Earnings Distribution of Manufacturing Workers in Cilek Mobilya, Mamak",
subtitle = "Most workers earn between 15,000–25,000 TL; distribution is right-skewed with a small number of high-income outliers",
x = "Monthly Earnings (Turkish Lira, TL)",
y = "Number of Workers"
) +
theme_classic()
Instead of setting one color to fill the bins, lets fill the bins based on gender. While there, change the theme to a nicer one.
earnings %>%
ggplot() +
aes(x = income, fill = gender) +
geom_histogram(binwidth = 1000, color = "black",) +
scale_x_continuous(n.breaks = 10) +
scale_y_continuous(n.breaks = 10) +
labs(
title = "Monthly Earnings Distribution of Manufacturing Workers in Cilek Mobilya, Mamak",
subtitle = "Most workers earn between 15,000–25,000 TL; distribution is right-skewed with a small number of high-income outliers",
x = "Monthly Earnings (Turkish Lira, TL)",
y = "Number of Workers"
) +
theme_classic()
So many things are wrong with this plot
Use only one variable in histograms, if you need to make a group based distribution comparison use density plot.
If your goal is to show “How many workers fall into each income range?”, histogram is the right plot (density plot is also ok btw, the interpretation differs slightly.).
If your goal is to show “Do the income distributions differ by gender?”, density plot is the right one
Week 3 - English | Hafta 3 - Türkçe