We began with a collections of UFO reports pulled from the NUFORC by
Timothy Renner. This CSV, nuforc_reports.csv
was obtained
from https://data.world/timothyrenner.
We then selected only reports from the U.S.A.
df_reports =
read_csv("data/nuforc_reports.csv") |>
filter(country %in% c("USA", "usa", "U.S.A.", "United States of America")) |>
select(-c(summary, country, stats, report_link)) |>
drop_na() |>
filter(!(state %in% c("AB", "BC", "MB", "NB", "NS", "SK", "ON"))) |>
filter(city_longitude < -30, city_latitude > 10)
The duration
data in the reports was not standardized
and varied wildly in format. Examples include “~8 minutes”, “30 SECS”,
or “3 hours or more”. To resolve this we standardized the time by
writing a regex function that pulled both numeric values and time units
(seconds, minutes, or hours) out of these reports. Our output was a
value in seconds, and was obtained for about 90% of these reports. The
missing 10% came from duration
values such as “Few minutes”
or “18:00” which were not interpretable in a standard fashion.
time_str = function(str){
# Pulls numeric values from string
if (str_detect(str, "\\:")) {
num = NA
} else {
num =
str_extract(str,"\\d*\\.*\\d+(?= *[s|m|h|S|M|H])") |>
as.numeric()}
# Detects reported units (sec, min, h)
if (str_detect(str, "[S|s][E|e][C|c]")) {
unit = 1
} else if (str_detect(str, "[M|m][I|i][N|n]")) {
unit = 60
} else if (str_detect(str, "[H|h]")) {
unit = 3600
} else {
unit = NA
}
# Combines outputs
return(num * unit)
}
df_reports =
df_reports |>
mutate(duration_clean = map(duration, time_str)) |>
unnest(duration_clean)
We wanted to examine how distance from urban centers impacted UFO reports. To do this, we used a collection of data on U.S. cities from https://simplemaps.com/data/us-cities, and selected only cities with a population over 200,000.
df_cities <- read_csv("data/uscities.csv") |>
filter(population > 200000) |>
select(city,lat,lng)
We then created a function to identify the distance from the UFO
report to the closest city with a population over 200,000. We are using
the haversine
formula to calculate distance over a sphere using latitude and
longitude. This is implemented in R using the haversine()
function from the pracma
package.
dist_close_city = function(x,y){
# pull in cities data from df_cities
lat = df_cities |> pull(lat)
lng = df_cities |> pull(lng)
cities = df_cities |> pull(city)
n = length(lat)
# calculate distance to all cities
vec = vector(mode = "numeric", length = n)
cit_vec = vector(mod = "numeric", length = n)
for (i in 1:n){
vec[i] = pracma::haversine(c(x,y),c(lat[i],lng[i]))
}
# determine min distance and change from km to miles
out = tibble(
dist_km = min(vec),
closest_city = cities[match(dist_km, vec)],
dist = dist_km * 0.621
) |>
select(-dist_km)
return(out)
}
This function was then applied to our dataset to calculate the
distances, and a location
variable was added indicating
whether the report took place within 5 miles of a large U.S. city.
df_reports =
df_reports |>
mutate(out = map2(city_latitude, city_longitude, dist_close_city)) |>
unnest(out) |>
mutate(location = case_when(
dist <= 5 ~ "urban",
dist > 5 ~ "rural"
))
This clean version of the dataset will be exported and used for our analyses on other pages.
df_reports |>
write_csv(file = "data/ufo_clean.csv")
The variables in our dataset are:
city
: The city where the sighting occurred.state
: The state where the sighting occurred.date_time
: The date and time of the sighting.shape
: The shape of the UFO.duration
: The duration of the UFO sighting.text
: The text of the report.posted
: The time that the report was made.city_latitude
: The latitude of where the sighting
occured.city_longitude
: The longitude of where the sighting
occured.duration_clean
: The duration in seconds of the UFO
sighting.dist
: The distance in miles to the closest city with a
population of 200,000 or more.closest_city
: The closest city with a population of
200,000 or more.location
: Whether the sighting was urban or rural.
Urban was defined as within 5 miles of a city with population in excess
of 200,000.