Advanced Python
Homework Discussion, Pandas Part II
9.1 | Analyze COVID-19 Infection Rates or U.S. Population Rates by County over Time. |
1: import the data I used the index_col= parameter to pd.read_csv() to set the 'date' column to the index. You can also accomplish this using the DataFrame .set_index() method after importing the data. |
|
It's also going to be important to set the type of the index to a 'date' type. You can accomplish this with this command:
cases.index = cases.index.astype('datetime64[ns]')
|
|
[ns] stands for 'nanosecond' and refers to the date's precision (the other choice is a 'timezone-aware' datetime). Finally, I was tempted to open the data in Excel, but I wanted Notebook to allow me to browse the data as I wished, so I issued the max_rows command as noted in the assignment: |
|
pd.set_option('display.max_rows', 500)
|
|
2: explore the DataFrame Please perform each of the operations on the DataFrame to explore its size, shape and contents. 3: count of unique states found in the DataFrame For the count of unique states, I used the Series method .unique() on the state name column, and passed it to len() to see how many there were. I also found that a series has a .size parameter that gives us its length. Always good to look for another way! 4: Show those 'states' not in a 50-state list I made the 'state' column into a set() and then used the .difference() method to compare it to the 50-state list imported from the state_name_abbrev.csv reference file. But there is a pandas-centric way to do this: 1. read the state-name file; the 'name' column holds the 50 state names 2. use .unique() on cases.state to get unique states from the main DataFrame 3. filter these unique states by those that are not in states['name'] using the .isin() function 5: count of unique counties represented in the data I think the easy way to do this is through a 2-column .groupby() on the two columns 'county' and 'state'. This will create a groupby object with groupings of each county-state combination. You can see how many groups there are by passing the groupby() object to len(). (You don't even need to use an aggregation method like .count() or .sum().) Other solutions I found: creating a new column with the state and county concatenated, and calling .unique() on that column; creating a set of 2-item tuples built from the .values of the two columns, and getting its len(). (This last solution is Python-centric and we should always beware when we "think in Python" when we may be able to much more efficiently "think in pandas"). 6: cases by state, sorted This is a simple .groupby() on the state, using the .sum() method. Using the .cases attribute to isolate the 'cases' column will result in a Series with a state index and 'cases' values. Then use .sort_values() on the Series with ascending=False argument to sort highest to lowest. 7: bar chart: cases by state, 10 highest Slice the Series produced in the above exercise to only its first 10 items. You can .plot() the resulting Series with kind=bar; I also added the argument rot=45 to angle the 'x' labels. 8: show cases by county for a given state Taking user input for a state, slice out rows for that state and then group by county with .sum(), accessing .cases and then using .sort_values() with ascending=False as we did in 'cases by state'. 9: bar chart: cases by county for a given state, 10 highest Slice the Series produced in the above exercise to only its first 10 items. You can .plot() the resulting Series with kind=bar; I also added the argument rot=45 to angle the 'x' labels. 10: pie chart showing all 50 states and each share of cases Going back to the cases by state, sorted calculation, you can call .plot() on the Series with kind='pie', figsize=(5,5) and labels=None. The last argument is necessary to keep the chart from listing all 50 states in an unreadable jumble. You can pass a list of labels to this parameter, but the list must be the same length as that of the Series. To set a legend, I had to assign the call to .plot() to a variable (this is a subplot object) and then call .legend() on that object. The two parameter arguments I used to .legend() were labels=, to which I assigned a list of the top states, as well as loc=(.95, .35) which placed the legend in relation to the overall image. Without precise placement, the legend will overlap the pie. 11: pie chart showing top 10 states and share of cases among them. This solution is almost identical to the prior solution except that the Series has been shortened to the first 10 items. 12: cases over time for a selected state Take input for a state name. Slicing only rows for that state, we need to accomplish two things: a sum of 'cases' values for each date, and then a .cumsum() to create a running sum of of cases as days progress. After grouping by date with .sum() and accessing the .cases column, I called .cumsum() to produce the running sum; then .plot() to produce the line chart. My arguments to .plot() were style='b.' (blue dotted line), rot=45 and a title= with the state name. 13: (extra credit) Cases over time for a selected state I am always suspicious anytime I think about looping when I'm working with pandas. However, we do have a list of user's states and we do need to do a groupby() on a DataFrame that contains only those states. I am pretty sure that my solution is not optimal nor even very pandas-like, but it's what came to mind. So, I started with an empty DataFrame, to which I intended to concatenate each column of aggregated totals by day and a running sum of them. After initializing the empty DataFrame, I looped through the user's states and did the same selection of only those rows for that state, .groupby() on date to .sum() the county totals by date and .cumsum() to produce the running sum, just as it had been done for one state. As above, this returns a Series with the date index and running sum for each state. Then I concatenated each Series to the empty DataFrame, knowing that they would align on date (so any dates that might be missing from a Series would just produce an empty row). For the plot, a simple df.plot() will produce a line chart on each column of the DataFrame. My parameters to this method call were rot=45 and legend=True. I also set style=['b.', 'r--'] (first column is blue dots, second column is a red dashed line) although this obviously only would apply to the first two states of a multi-state comparison, and matplotlib would choose the colors for any additional lines. For the chart's title I used the state names, so title=' vs. '.join(user_states) where user_states is the list of states input by the user. To take user input for multiple values I asked the user to separate the state names with commas; rather than split the input on a simple comma, though, I chose to use re to split on space-comma-space: |
|
ustates = input('please enter states separated by commas: ')
ustate = re.split(r'\s*\,\s*', ustates)
|
|
14: (extra credit) scatter plot showing cases by state population I can discuss this upon request. 15: (extra credit) bar chart showing daily increase in cases for a state After taking the user's input for a state name, we do a simple filter for the 'state' column to select only the rows for that state. We can reset the index to integers starting with 0 if we wish. Next is to group by date and sum up the cases column. This produces a Series which you can convert to a DataFrame with the .to_frame() method. Because the default display shows every date and renders as an unreadable blob, I spaced the x axis labels by retrieving the subplot object from the DataFrame .plot() method (kind='bar'), using .xaxis.set_ticks() to set a tick every 10 values and .set_xticklabels() with every 10th date in the index as well as a rotation and ha= to align the dates with the ticks: |
|
ax = daily_increase.plot(kind='bar');
ax.xaxis.set_ticks(range(0, len(daily_increase.index), 10));
ax.set_xticklabels(daily_increase.index[0::10], rotation=45, ha='right');
|
|