Scatter Plots & Relationships

What we learned so far

Recap

We learned how to visualize the distribution of one continuous variable.
Histogram: groups data into bins, counts how many observations fall in each bin.
Density plot: estimates a smooth curve representing the distribution.
We also learned how to compare distributions across groups (e.g., income by gender).

But what if we want to ask a different kind of question?

“Is there a relationship between two variables?”

New Question

In Week 3 we asked: “How is income distributed?” → one variable.
This week we ask: “Does income inequality relate to hate crime rates?” → two variables.
When we want to see the relationship between two continuous variables, we use a scatter plot.

When to use a scatter plot?

You have two continuous variables.
You want to see if there is a pattern, trend, or correlation between them.
Each observation (row) becomes one point on the plot.
The x axis shows one variable, the y axis shows the other.

Our Data: Hate Crimes in the US

The Data

Our data comes from a FiveThirtyEight article:
- “Higher Rates Of Hate Crimes Are Tied To Income Inequality”
It contains data on all 50 US states.
For each state we have:
- hate_crimes_fbi: hate crimes per 100,000 people (reported to FBI)
- gini_index: a measure of income inequality (higher = more unequal)
- median_income: median household income
- highschool: share of population with a high school degree
- unemployment: share of unemployed population
- trump_vote: share of voters who voted for Trump in 2016
- region: US Census region (Northeast, Midwest, South, West)

Read the data

library(tidyverse)
library(janitor)
library(ggthemes)
read.csv("https://raw.githubusercontent.com/mucahitzor/IKT2010/refs/heads/main/data/hate_crimes.csv") %>%
  as_tibble() -> hate_crimes

1: Use the tidyverse, janitor, and ggthemes libraries
2: Read the CSV file and then do the following ( %>% )
3: Show it as a tibble and save it as hate_crimes

Have a look at the data

hate_crimes

# A tibble: 50 × 13
   state             median_income unemployment metro_pop highschool non_citizen
   <chr>                     <int>        <dbl>     <dbl>      <dbl>       <dbl>
 1 Alabama                   42278        0.06       0.64      0.821        0.02
 2 Alaska                    67629        0.064      0.63      0.914        0.04
 3 Arizona                   49254        0.063      0.9       0.842        0.1 
 4 Arkansas                  44922        0.052      0.69      0.824        0.04
 5 California                60487        0.059      0.97      0.806        0.13
 6 Colorado                  60940        0.04       0.8       0.893        0.06
 7 Connecticut               70161        0.052      0.94      0.886        0.06
 8 Delaware                  57522        0.049      0.9       0.874        0.05
 9 District of Colu…         68277        0.067      1         0.871        0.11
10 Florida                   46140        0.052      0.96      0.853        0.09
# ℹ 40 more rows
# ℹ 7 more variables: white_poverty <dbl>, gini_index <dbl>, non_white <dbl>,
#   trump_vote <dbl>, hate_crimes_splc <dbl>, hate_crimes_fbi <dbl>,
#   region <chr>

Understanding Our Data

What are our variables?

hate_crimes %>%
  summary()

    state           median_income    unemployment       metro_pop     
 Length:50          Min.   :35521   Min.   :0.02800   Min.   :0.3100  
 Class :character   1st Qu.:48358   1st Qu.:0.04225   1st Qu.:0.6300  
 Mode  :character   Median :54613   Median :0.05100   Median :0.7900  
                    Mean   :54904   Mean   :0.04988   Mean   :0.7500  
                    3rd Qu.:60653   3rd Qu.:0.05775   3rd Qu.:0.8975  
                    Max.   :76165   Max.   :0.07300   Max.   :1.0000  
                                                                      
   highschool      non_citizen      white_poverty      gini_index    
 Min.   :0.7990   Min.   :0.01000   Min.   :0.0400   Min.   :0.4190  
 1st Qu.:0.8397   1st Qu.:0.03000   1st Qu.:0.0800   1st Qu.:0.4400  
 Median :0.8740   Median :0.04000   Median :0.0900   Median :0.4545  
 Mean   :0.8684   Mean   :0.05404   Mean   :0.0922   Mean   :0.4542  
 3rd Qu.:0.8978   3rd Qu.:0.08000   3rd Qu.:0.1000   3rd Qu.:0.4667  
 Max.   :0.9180   Max.   :0.13000   Max.   :0.1700   Max.   :0.5320  
                  NA's   :3                                          
   non_white        trump_vote     hate_crimes_splc  hate_crimes_fbi  
 Min.   :0.0600   Min.   :0.0400   Min.   :0.06745   Min.   : 0.2669  
 1st Qu.:0.1925   1st Qu.:0.4200   1st Qu.:0.14271   1st Qu.: 1.2931  
 Median :0.2750   Median :0.4950   Median :0.22620   Median : 1.9871  
 Mean   :0.3058   Mean   :0.4938   Mean   :0.30409   Mean   : 2.3676  
 3rd Qu.:0.4200   3rd Qu.:0.5775   3rd Qu.:0.35693   3rd Qu.: 3.1843  
 Max.   :0.6300   Max.   :0.7000   Max.   :1.52230   Max.   :10.9535  
                                   NA's   :3                          
    region         
 Length:50         
 Class :character  
 Mode  :character

Understanding Our Data

Non-numeric variable: region

hate_crimes %>%
  tabyl(region)

    region  n percent
   Midwest 12    0.24
 Northeast  9    0.18
     South 17    0.34
      West 12    0.24

We have 4 regions, each with a different number of states.

Our Goal Today

Is there a relationship between income inequality and hate crimes?

gini_index measures income inequality. Higher values = more inequality.
hate_crimes_fbi measures hate crimes per 100,000 people.
We want to see: do states with more inequality also have more hate crimes?
Both variables are continuous → we need a scatter plot.

Building a Scatter Plot Step by Step

Step 1: Start with the data

Take our data set named hate_crimes,

hate_crimes

I want to make a visualization, take my data set and apply ggplot.

hate_crimes %>%
  ggplot()

Step 2: Set the axes

Now we need to tell R what goes on the x axis and what goes on the y axis.

For scatter plots we need both x and y inside aes().

x = gini_index (income inequality)
y = hate_crimes_fbi (hate crimes per 100k)

hate_crimes %>%
  ggplot() +
  aes(x = gini_index, y = hate_crimes_fbi)

Step 3: Add the points

Remember, for histograms we used geom_histogram(), for density we used geom_density().

For scatter plots we use geom_point().

Each row in our data becomes one point on the plot.

hate_crimes %>%
  ggplot() +
  aes(x = gini_index, y = hate_crimes_fbi) +
  geom_point()

Step 4: Add a theme

Lets immediately clean it up with a theme.

hate_crimes %>%
  ggplot() +
  aes(x = gini_index, y = hate_crimes_fbi) +
  geom_point() +
  theme_fivethirtyeight()

Step 5: Customize the points

The default points are small and hard to see. We can change them inside geom_point():

size = how big the points are (default is about 1.5)
color = the color of the points

hate_crimes %>%
  ggplot() +
  aes(x = gini_index, y = hate_crimes_fbi) +
  geom_point(size = 3, color = "steelblue") +
  theme_fivethirtyeight()

Step 6: Fix the axes

Like in Week 3, lets increase the number of values on both axes.

hate_crimes %>%
  ggplot() +
  aes(x = gini_index, y = hate_crimes_fbi) +
  geom_point(size = 3, color = "steelblue") +
  scale_x_continuous(n.breaks = 10) +
  scale_y_continuous(n.breaks = 10) +
  theme_fivethirtyeight()

Interpreting the scatter plot

What do we see?
- As gini_index increases (more inequality), hate crimes tend to increase too.
- This suggests a positive relationship between income inequality and hate crimes.
But not all states follow this pattern perfectly.
- Some states have high inequality but low hate crimes.
- Some states have low inequality but high hate crimes.
- This is normal — we are looking at a trend, not a perfect rule.
There is one point very high up — can you guess which state that might be?

hate_crimes %>% filter(hate_crimes_fbi > 10)

# A tibble: 1 × 13
  state              median_income unemployment metro_pop highschool non_citizen
  <chr>                      <int>        <dbl>     <dbl>      <dbl>       <dbl>
1 District of Colum…         68277        0.067         1      0.871        0.11
# ℹ 7 more variables: white_poverty <dbl>, gini_index <dbl>, non_white <dbl>,
#   trump_vote <dbl>, hate_crimes_splc <dbl>, hate_crimes_fbi <dbl>,
#   region <chr>

Add labels

Now lets add a title, subtitle, and axis labels using labs().

hate_crimes %>%
  ggplot() +
  aes(x = gini_index, y = hate_crimes_fbi) +
  geom_point(size = 3, color = "steelblue") +
  scale_x_continuous(n.breaks = 10) +
  scale_y_continuous(n.breaks = 10) +
  labs(
    title = "Income Inequality and Hate Crimes in US States",
    subtitle = "States with higher Gini index tend to have more hate crimes per 100k people",
    x = "Gini Index (Income Inequality)",
    y = "Hate Crimes per 100,000 (FBI)"
  ) +
  theme_fivethirtyeight()

Add labels

Coloring Points by a Group

Color by region

Remember in Week 3 we changed the color of density lines based on gender?
We can do the same thing here: color the points based on region.
Since region is a variable, we put it inside aes(), not inside geom_point().

Can you guess the code?

Color by region

We move color from geom_point(color = "steelblue") to aes(color = region):

hate_crimes %>%
  ggplot() +
  aes(x = gini_index, y = hate_crimes_fbi, color = region) +
  geom_point(size = 3) +
  scale_x_continuous(n.breaks = 10) +
  scale_y_continuous(n.breaks = 10) +
  labs(
    title = "Income Inequality and Hate Crimes in US States",
    subtitle = "Colored by US Census region",
    x = "Gini Index (Income Inequality)",
    y = "Hate Crimes per 100,000 (FBI)"
  ) +
  theme_fivethirtyeight()

What do we see?

The South (red/pink) tends to cluster in the lower-right: high inequality, but lower FBI-reported hate crimes.
The Northeast tends to have higher hate crime rates.
Does this mean the South has fewer hate crimes? Not necessarily — it could mean different reporting practices across regions.
This is an important lesson: correlation does not mean causation, and data quality matters!

Try it yourself!

Now it is your turn. Using the same hate_crimes data:
1. Make a scatter plot of median_income (x axis) vs hate_crimes_fbi (y axis).
2. Color the points by region.
3. Add a nice theme.
4. Add a title and axis labels.
5. What pattern do you see? Do richer states have more or fewer hate crimes?

Making Points Transparent

The alpha argument

Sometimes points overlap and we cannot see how many points are in the same area.
We can make points semi-transparent using alpha inside geom_point().
alpha = 1 means fully solid (default), alpha = 0 means invisible.

hate_crimes %>%
  ggplot() +
  aes(x = gini_index, y = hate_crimes_fbi, color = region) +
  geom_point(size = 3, alpha = 0.7) +
  scale_x_continuous(n.breaks = 10) +
  scale_y_continuous(n.breaks = 10) +
  labs(
    title = "Income Inequality and Hate Crimes in US States",
    subtitle = "Colored by US Census region",
    x = "Gini Index (Income Inequality)",
    y = "Hate Crimes per 100,000 (FBI)"
  ) +
  theme_fivethirtyeight() +
  scale_color_fivethirtyeight()

Adding a Trend Line

What is a trend line?

A scatter plot shows us the individual data points.
But sometimes it is hard to see the overall pattern just from the points.
A trend line (also called a regression line or line of best fit) helps us see the general direction of the relationship.
If the line goes up → positive relationship (as x increases, y increases).
If the line goes down → negative relationship (as x increases, y decreases).
If the line is flat → no clear relationship.

Adding a trend line with geom_smooth()

We use geom_smooth() to add a trend line on top of our scatter plot.

We add method = "lm" inside geom_smooth() to get a straight line. lm stands for linear model.

hate_crimes %>%
  ggplot() +
  aes(x = gini_index, y = hate_crimes_fbi) +
  geom_point(size = 3, color = "steelblue") +
  geom_smooth(method = "lm") +
  theme_fivethirtyeight()

Understanding the trend line

The blue line shows the overall trend: as inequality increases, hate crimes tend to increase.
The gray shaded area around the line shows the confidence interval — it tells us how uncertain the trend is.
- A narrow band means we are more confident.
- A wide band means we are less confident.
We can remove the confidence interval with se = FALSE:

hate_crimes %>%
  ggplot() +
  aes(x = gini_index, y = hate_crimes_fbi) +
  geom_point(size = 3, color = "steelblue") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  theme_fivethirtyeight()

Putting it all together

hate_crimes %>%
  ggplot() +
  aes(x = gini_index, y = hate_crimes_fbi) +
  geom_point(size = 3, color = "steelblue", alpha = 0.7) +
  geom_smooth(method = "lm", se = TRUE, color = "red") +
  scale_x_continuous(n.breaks = 10) +
  scale_y_continuous(n.breaks = 10) +
  labs(
    title = "Income Inequality and Hate Crimes in US States",
    subtitle = "A positive trend: states with higher inequality tend to report more hate crimes",
    x = "Gini Index (Income Inequality)",
    y = "Hate Crimes per 100,000 (FBI)"
  ) +
  theme_fivethirtyeight()

Putting it all together

Trend line by group

What happens if we add color = region back into aes()?

hate_crimes %>%
  ggplot() +
  aes(x = gini_index, y = hate_crimes_fbi, color = region) +
  geom_point(size = 3, alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Income Inequality and Hate Crimes by Region",
    subtitle = "Each region gets its own trend line",
    x = "Gini Index (Income Inequality)",
    y = "Hate Crimes per 100,000 (FBI)"
  ) +
  theme_fivethirtyeight() +
  scale_color_fivethirtyeight()

Trend line by group

When color = region is inside aes(), both geom_point() and geom_smooth() split by region.
Each region gets its own points and its own trend line.
This is useful to see if the relationship is different across groups.
The relationship direction varies by region — another reason to be careful about making conclusions!

Summary

What we learned today

Scatter plot: shows the relationship between two continuous variables using geom_point().
Building a scatter plot:
1. ggplot() → blank canvas
2. aes(x = ..., y = ...) → set both axes
3. geom_point() → add the points
4. Customize with size, color, alpha
5. labs() → add title, subtitle, axis labels
Color by group: put color = variable inside aes() to color points by a categorical variable.
Trend line: geom_smooth(method = "lm") adds a line of best fit.
Key takeaway: Correlation does not imply causation! Just because two variables move together does not mean one causes the other.

Exercise

Using the hate_crimes data, explore one of these questions:

Is there a relationship between unemployment and hate_crimes_fbi?
Is there a relationship between highschool (education) and hate_crimes_fbi?
Is there a relationship between median_income and gini_index?

For each:

Make a scatter plot with geom_point()
Add a trend line with geom_smooth(method = "lm")
Color by region
Add a title and axis labels
Interpret: what does the plot tell you?