Initial Data File

We began with a collections of UFO reports pulled from the NUFORC by Timothy Renner. This CSV, nuforc_reports.csv was obtained from https://data.world/timothyrenner. We then selected only reports from the U.S.A.

df_reports = 
  read_csv("data/nuforc_reports.csv") |> 
  filter(country %in% c("USA", "usa", "U.S.A.", "United States of America")) |> 
  select(-c(summary, country, stats, report_link)) |> 
  drop_na() |> 
  filter(!(state %in% c("AB", "BC", "MB", "NB", "NS", "SK", "ON"))) |> 
  filter(city_longitude < -30, city_latitude > 10)

Duration

The duration data in the reports was not standardized and varied wildly in format. Examples include “~8 minutes”, “30 SECS”, or “3 hours or more”. To resolve this we standardized the time by writing a regex function that pulled both numeric values and time units (seconds, minutes, or hours) out of these reports. Our output was a value in seconds, and was obtained for about 90% of these reports. The missing 10% came from duration values such as “Few minutes” or “18:00” which were not interpretable in a standard fashion.

time_str = function(str){
  # Pulls numeric values from string
  if (str_detect(str, "\\:")) {
    num = NA
  } else {
    num = 
      str_extract(str,"\\d*\\.*\\d+(?= *[s|m|h|S|M|H])") |> 
      as.numeric()}
  
  # Detects reported units (sec, min, h)
  if (str_detect(str, "[S|s][E|e][C|c]")) {
    unit = 1
  } else if (str_detect(str, "[M|m][I|i][N|n]")) {
    unit = 60
  } else if (str_detect(str, "[H|h]")) {
    unit = 3600
  } else {
    unit = NA
  }
  
  #  Combines outputs
  return(num * unit)
}
df_reports = 
  df_reports |> 
  mutate(duration_clean = map(duration, time_str)) |> 
  unnest(duration_clean)

Distance

We wanted to examine how distance from urban centers impacted UFO reports. To do this, we used a collection of data on U.S. cities from https://simplemaps.com/data/us-cities, and selected only cities with a population over 200,000.

df_cities <- read_csv("data/uscities.csv") |>
  filter(population > 200000) |> 
  select(city,lat,lng)

We then created a function to identify the distance from the UFO report to the closest city with a population over 200,000. We are using the haversine formula to calculate distance over a sphere using latitude and longitude. This is implemented in R using the haversine() function from the pracma package.

dist_close_city = function(x,y){
  # pull in cities data from df_cities
  lat = df_cities |> pull(lat)
  lng = df_cities |> pull(lng)
  cities = df_cities |> pull(city)
  n = length(lat)
  
  # calculate distance to all cities
  vec = vector(mode = "numeric", length = n)
  cit_vec = vector(mod = "numeric", length = n)
  
  for (i in 1:n){
    vec[i] = pracma::haversine(c(x,y),c(lat[i],lng[i]))
  }
  
  # determine min distance and change from km to miles
  out = tibble(
    dist_km = min(vec),
    closest_city = cities[match(dist_km, vec)],
    dist = dist_km * 0.621
  ) |> 
    select(-dist_km)
  
  return(out)
}

This function was then applied to our dataset to calculate the distances, and a location variable was added indicating whether the report took place within 5 miles of a large U.S. city.

df_reports =
  df_reports |> 
  mutate(out = map2(city_latitude, city_longitude, dist_close_city)) |> 
  unnest(out) |> 
  mutate(location = case_when(
           dist <= 5 ~ "urban",
           dist > 5 ~ "rural"
         )) 

Export

This clean version of the dataset will be exported and used for our analyses on other pages.

df_reports |> 
  write_csv(file = "data/ufo_clean.csv")

Final Dataset Description

The variables in our dataset are:

  • city: The city where the sighting occurred.
  • state: The state where the sighting occurred.
  • date_time: The date and time of the sighting.
  • shape: The shape of the UFO.
  • duration: The duration of the UFO sighting.
  • text: The text of the report.
  • posted: The time that the report was made.
  • city_latitude: The latitude of where the sighting occured.
  • city_longitude: The longitude of where the sighting occured.
  • duration_clean: The duration in seconds of the UFO sighting.
  • dist: The distance in miles to the closest city with a population of 200,000 or more.
  • closest_city: The closest city with a population of 200,000 or more.
  • location: Whether the sighting was urban or rural. Urban was defined as within 5 miles of a city with population in excess of 200,000.