Trends in Space

In this page we will explore where UFO sightings are reported in the United States. First we will visualize where throughout the U.S. these reports are being made, and then compare reports made in urban/suburban areas (i.e. close to major cities), and those made in rural areas (i.e. far from major cities). We separate these two groups using a cutoff criteria of being within 5 miles of a city with more than 200,000 residents.

## Default packages & settings
# Load packages
library(tidyverse)
library(knitr)
library(leaflet)
library(usmap)

# Set seed for reproducibility
set.seed(1)

# Set default figure options
knitr::opts_chunk$set(
  fig.width = 6,
  out.width = "90%"
)

theme_set(theme_bw() + theme(legend.position = "bottom"))

options(
  ggplot2.continuous.colour = "viridis",
  ggplot2.continuous.fill = "viridis"
)

scale_colour_discrete = scale_colour_viridis_d
scale_fill_discrete = scale_fill_viridis_d

## Data import
# Import UFO data
df_ufo = read_csv("data/ufo_clean.csv")

# Add population data
df_pop = 
  read_csv("data/us_census.csv") |>
  rename(state = abbrv) |> 
  select(state, census_2010)

Reports across the U.S.

Raw Data

To begin, we simply visualize a random 2% of these reports to get a sense of where these sightings are throughout the country.

df_ufo |> 
  sample_frac(0.02) |> 
  leaflet() |> 
  addProviderTiles(providers$CartoDB.Positron) |> 
  addCircleMarkers(~city_longitude,~city_latitude, radius = 1)

At first glance, these reports seem to cluster along the coasts and in other regions of high population density. So, we will adjust for population at the state level and re-visualize the results.

Population-adjusted reports

df_ufo |> 
  group_by(state) |> 
  summarize(n_obs = n()) |> 
  left_join(df_pop, by = join_by(state)) |> 
  mutate(obs_per = n_obs/census_2010*100000) |> 
  plot_usmap(data = _, values = "obs_per", color = "#333333") +
  labs(
    title = "UFO Reports per 100,000 Population"
  ) + 
  scale_fill_continuous(name = "Reports per 100k") +
  theme(legend.position = "bottom")

This heatmap reveals three clusters with particularly high UFO reports. The Northwest, including Washington, Oregon, Idaho, Montana, and Alaska are the 5 states with the highest population-adjusted UFO reports ranging from 66 (AK) to 81 (WA) UFO reports per 100,000 residents. Next is northern New England, with New Hampshire, Vermont, and Maine taking positions 6-8 in the list. These states have between 61 (ME) and 66 (NH) UFO reports per 100,000. Rounding out the top 10, we have Arizona and New Mexico, with 61 and 59 reports per 100,000 residents, respectively.

Table 1. States with highest reports per 100,000 population

## Table form
df_ufo |> 
  group_by(state) |> 
  summarize(n_obs = n()) |> 
  left_join(df_pop, by = join_by(state)) |> 
  mutate(obs_per = n_obs/census_2010*100000) |> 
  arrange(desc(obs_per)) |> 
  head(10) |> 
  kable(
    col.names = c("State", "Number of UFO Sightings",
                  "Population (2010 census)", "UFO sightings per 100k"),
    digits = 1)

State	Number of UFO Sightings	Population (2010 census)	UFO sightings per 100k
WA	5443	6724540	80.9
MT	768	989415	77.6
OR	2818	3831074	73.6
ID	1062	1567582	67.7
AK	473	710231	66.6
NH	871	1316470	66.2
VT	405	625741	64.7
ME	818	1328361	61.6
AZ	3894	6392017	60.9
NM	1223	2059179	59.4

If we cluster the number of observations by nearest large city, we see a similar result. The two cities with the most UFO sightings nearby are are Seattle, WA and Portland, OR. These are large cities but not the largest in the U.S. The West Coast seems to have more reports than other areas, though St. Louis is a bit of an outlier in this regard.

Table 2. Cities with most sightings nearby

df_ufo |> 
  group_by(closest_city) |> 
  summarize(n_obs = n()) |> 
  arrange(desc(n_obs)) |> 
  head(10) |> 
  kable()

closest_city	n_obs
Portland	2646
Seattle	2472
Chicago	1842
Los Angeles	1378
St. Louis	1361
Atlanta	1332
Sacramento	1254
Arlington	1211
Philadelphia	1191
Spokane	1174

Distance to cities

We also plotted the distribution of the dist variable, the distance to the nearest city of at least 200,000 residents. This histogram shows that the most reports occur close to large cities, but there is a long tail of reports that were made further away. There are also 1065 reports that were made further than 200 miles from a large city, and 90 that were more than 400 miles away.

df_ufo |> 
  filter(dist <= 200) |> 
  ggplot(aes(x = dist)) +
  geom_histogram(bins = 21, alpha = 0.7, col = I("black")) +
  labs(
    title = "Distribution of reports by distance",
    y = "Number of reports",
    x = "Distance to large city (miles)"
  )

Rural vs Urban Differences

With this preliminary understanding of the geographic distribution of UFO reports, we wanted to explore if the distance from large cities is related to characteristics of the report. To do this, we separate reports into urban/suburban and rural categories, using a cutoff value of 5 miles to a large city. This is not a perfect division, as this distinction should really be done by population density, but should be servicable to identify major differences, as will be seen below.

UFO Shape

Our first question was whether the type of object reported differed between these group, as determined by the shape of the UFO.

df_loc_shape = 
  df_ufo |> 
  group_by(location, shape) |> 
  summarize(n_obs = n()) |> 
  arrange(desc(n_obs)) |> 
  pivot_wider(
    names_from = "location",
    values_from = "n_obs"
  ) 

df_loc_shape |> 
  head(10) |> 
  kable(col.names = c("Shape", "Rural", "Urban/Suburban"))

Shape	Rural	Urban/Suburban
light	16728	5065
circle	8559	2632
triangle	7531	2210
fireball	5714	1864
unknown	5482	1648
other	5298	1749
sphere	5130	1722
disk	4098	1381
oval	3360	1116
formation	2677	848

Both rural and urban reports look similar, with the same ranking of the reported shapes. The most common descriptor of the UFO was just a light, followed by either circle or triangle. To analyze whether descriptions of UFOs varied between these groups, we conducted a Chi-Squared test of homogeneity. This revealed that the distribution of shapes did differ significantly between the urban/suburban and rural categories.

df_loc_shape |> 
  select(-shape) |> 
  as.matrix() |> 
  chisq.test()

## 
##  Pearson's Chi-squared test
## 
## data:  as.matrix(select(df_loc_shape, -shape))
## X-squared = 72.097, df = 22, p-value = 3.082e-07

This result is not being influenced by the comparatively few observations for some shapes. If the data set is restricted to only the top 10 or top 15 shapes, the p-value is still far below 0.05.

Encounter Duration

Next, we wish to consider whether the length of UFO encounters is different between the urban/suburban and rural groups. First, we will look at the data, which skewed right (more observations on the second scale than the minute scale, and more on the minute scale than the hour scale). So to understand the distribution, we will look at the data on the log scale.

df_ufo |> 
  drop_na(duration_clean) |> 
  filter(duration_clean != 0) |>
  ggplot(aes(x = duration_clean, fill = location)) +
  geom_histogram(alpha = 0.5, bins = 41, col = I("black")) +
  scale_x_continuous(
    trans = "log10",
    breaks = scales::trans_breaks("log10", function(x) 10^x),
    labels = scales::trans_format("log10", scales::math_format(10^.x))
                     ) +
  labs(
    title = "Distribution of encounter duration",
    y = "Number of reports",
    x = "Encounter Duration (seconds)",
    fill = "Location"
  )

Even though the data are highly right-skewed, there are enough observations that it is still valid to compare the data using a t-test. There are 1266 reports with a duration over 2 hours, comprising about 1% of the sample. These reports range from several hours to several months. I will restrict the comparison of the data to those with encounters lasting at most hours.

df_ufo |> 
  drop_na(duration_clean) |> 
  filter(duration_clean <= 7200) |> 
  t.test(duration_clean ~ location, data = _)

## 
##  Welch Two Sample t-test
## 
## data:  duration_clean by location
## t = 6.2378, df = 38429, p-value = 4.485e-10
## alternative hypothesis: true difference in means between group rural and group urban is not equal to 0
## 95 percent confidence interval:
##  38.34605 73.48570
## sample estimates:
## mean in group rural mean in group urban 
##            625.4033            569.4874

The results of this test suggest that there is a significant difference between the average duration of the rural UFO sightings and the urban UFO sightings. We estimate that sightings in rural areas last for approximately one minute longer than sightings in urban areas (10.5 minutes vs. 9.5 minutes).