bucky.util.update_data_repos

Data Updating Utility (bucky.util.update_data_repos).

A utility for fetching updated data for mobility and case data from public repositories.

This module pulls from public git repositories and preprocessed the data if necessary. For case data, unallocated or unassigned cases are distributed as necessary.

Module Contents

Functions

distribute_data_by_population(total_df, dist_vect, data_to_dist, replace)

Distributes data by population across a state or territory.

distribute_mdoc(df, csse_deaths_file)

Distributes Michigan Department of Corrections data across Michigan counties by population.

distribute_territory_data(df, add_american_samoa)

Distributes territory-wide case and death data for territories.

distribute_unallocated_csse(confirmed_file, deaths_file, hist_df)

Distributes unallocated historical case and deaths data from CSSE.

distribute_utah_data(df, csse_deaths_file)

Distributes Utah case data for local health departments spanning multiple counties.

get_county_population_data(csse_deaths_file, county_fips)

Uses JHU CSSE deaths file to get county population data as as fraction of population across list of counties.

get_timeseries_data(col_name, filename, fips_key='FIPS', is_csse=True)

Transforms a historical data file to a dataframe with FIPs, date, and case or death data.

git_pull(abs_path)

Updates a git repository given its path.

main()

Uses git to update public data repos.

process_csse_data()

Performs pre-processing on CSSE data.

update_hhs_hosp_data()

Retrieves updated historical data from healthdata.gov and writes to CSV.

Attributes

ADD_AMERICAN_SAMOA

MI_PRISON_UIDS

TERRITORY_DATA

UT_LHD_UIDS

bucky.util.update_data_repos.ADD_AMERICAN_SAMOA = False
bucky.util.update_data_repos.MI_PRISON_UIDS = [84070004, 84070005]
bucky.util.update_data_repos.TERRITORY_DATA
bucky.util.update_data_repos.UT_LHD_UIDS = [84070015, 84070016, 84070017, 84070018, 84070019, 84070020]
bucky.util.update_data_repos.distribute_data_by_population(total_df, dist_vect, data_to_dist, replace)[source]

Distributes data by population across a state or territory.

Parameters
  • total_df (pandas.DataFrame) – DataFrame containing confirmed and death data indexed by date and FIPS code

  • dist_vect (pandas.DataFrame) – Population data for each county as proportion of total state population, indexed by FIPS code

  • data_to_dist (pandas.DataFrame) – Data to distribute, indexed by data

  • replace (bool) – If true, distributed values overwrite current historical data in DataFrame. If false, distributed values are added to current data

Returns

total_df – Modified input dataframe with distributed data

Return type

pandas.DataFrame

bucky.util.update_data_repos.distribute_mdoc(df, csse_deaths_file)[source]

Distributes Michigan Department of Corrections data across Michigan counties by population.

Parameters
  • df (pandas.DataFrame) – Current historical DataFrame indexed by FIPS and date, which includes MDOC and FCI data

  • csse_deaths_file (str) – File location of CSSE deaths file (contains population data)

Returns

df – Modified historical dataframe with Michigan prison data distributed and added to Michigan data

Return type

pandas.DataFrame

bucky.util.update_data_repos.distribute_territory_data(df, add_american_samoa)[source]

Distributes territory-wide case and death data for territories.

Uses county population to distribute cases for US Virgin Islands, Guam, and CNMI. Optionally adds a single case to the most populous American Samoan county.

Parameters
  • df (pandas.DataFrame) – Current historical DataFrame indexed by FIPS and date, which includes territory-wide case and death data

  • add_american_samoa (bool) – If true, adds 1 case to American Samoa

Returns

df – Modified historical dataframe with territory-wide data distributed to counties

Return type

pandas.DataFrame

bucky.util.update_data_repos.distribute_unallocated_csse(confirmed_file, deaths_file, hist_df)[source]

Distributes unallocated historical case and deaths data from CSSE.

JHU CSSE data contains state-level unallocated data, indicated with “Unassigned” or “Out of” for each state. This function distributes these unallocated cases based on the proportion of cases in each county relative to the state.

Parameters
  • confirmed_file (str) – filename of CSSE confirmed data

  • deaths_file (str) – filename of CSSE death data

  • hist_df (pandas.DataFrame) – current historical DataFrame containing confirmed and death data indexed by date and FIPS code

Returns

hist_df – modified historical DataFrame with cases and deaths distributed

Return type

pandas.DataFrame

bucky.util.update_data_repos.distribute_utah_data(df, csse_deaths_file)[source]

Distributes Utah case data for local health departments spanning multiple counties.

Utah has 13 local health districts, six of which span multiple counties. This function distributes those cases and deaths by population across their constituent counties.

Parameters
  • df (pandas.DataFrame) – DataFrame containing historical data indexed by FIPS and date

  • csse_deaths_file (str) – File location of CSSE deaths file

Returns

df – Modified DataFrame containing corrected Utah historical data indexed by FIPS and date

Return type

pandas.DataFrame

bucky.util.update_data_repos.get_county_population_data(csse_deaths_file, county_fips)[source]

Uses JHU CSSE deaths file to get county population data as as fraction of population across list of counties.

Parameters
  • csse_deaths_file (str) – filename of CSSE deaths file

  • county_fips (numpy.ndarray) – list of FIPS to return population data for

Returns

population_df – DataFrame with population fraction data indexed by FIPS

Return type

pandas.DataFrame

bucky.util.update_data_repos.get_timeseries_data(col_name, filename, fips_key='FIPS', is_csse=True)[source]

Transforms a historical data file to a dataframe with FIPs, date, and case or death data.

Parameters
  • col_name (str) – Column name to extract from data.

  • filename (str) – Location of filename to read.

  • fips_key (str, optional) – Key used in file for indicating county-level field.

  • is_csse (bool, optional) – Indicates whether the file is CSSE data. If True, certain areas without FIPS are included.

Returns

df – Dataframe with the historical data indexed by FIPS, date

Return type

pandas.DataFrame

bucky.util.update_data_repos.git_pull(abs_path)[source]

Updates a git repository given its path.

Parameters

abs_path (str) – Abs path location of repository to update

bucky.util.update_data_repos.main()[source]

Uses git to update public data repos.

bucky.util.update_data_repos.process_csse_data()[source]

Performs pre-processing on CSSE data.

CSSE data is separated into two different files: confirmed cases and deaths. These two files are combined into one dataframe, indexed by FIPS and date with two columns, Confirmed and Deaths. This function distributes CSSE that is either unallocated or territory-wide instead of county-wide. Michigan data from the state Department of Corrections and Federal Correctional Institution is distributed to Michigan counties. New York City data which is currently all placed in one county (New York County) is distributed to the other NYC counties. Territory data for Guam, CNMI, and US Virgin Islands is also distributed. This data is written to a CSV.

bucky.util.update_data_repos.update_hhs_hosp_data()[source]

Retrieves updated historical data from healthdata.gov and writes to CSV.