Python 3home |
pandas is a Python module used for manipulation and analysis of tabular data. * Excel-like numeric calculations, particularly column-wise and row-wise calculations (vectorization) * SQL-like merging, grouping and aggregating * visualizing (line chart, bar chart, etc.) * emphasis on: - aligning data from multiple sources - "slicing and dicing" by rows and columns - concatenating and joining - cleaning and normalizing missing or incorrect data - working with time series - categorizing * ability to read and write to CSV, XML, Excel, database queries, etc. numpy is a data analysis library upon which pandas is built. We sometimes make direct calls to numpy - some of its variables (such as np.nan), variable-generating functions (such as np.arange or np.linlist) and some processing functions.
Use the docs for an ongoing study of pandas' rich feature set.
full docs (HTML, pdf)
http://pandas.pydata.org/pandas-docs/stable
"10 minutes to pandas"
https://pandas.pydata.org/pandas-docs/stable/10min.html
pandas cookbook
http://pandas.pydata.org/pandas-docs/stable/cookbook.html
matplotlib official documentation
http://matplotlib.org/api/pyplot_api.html
Online resources are many; when you find one you like, stick with it
http://astronomi.erciyes.edu.tr/wp-content/uploads/astronom/pdf/OReilly%20Python%20for%20Data%20Analysis.pdf
(If the above link goes stale, simply search Python for Data Analysis pdf.)
The Second Edition is available from O'Reilly on Safari Bookshelf:
http://shop.oreilly.com/product/0636920050896.do
Please keep in mind that pandas is in active development, which means that features may be added, removed and changed (latest version: 0.25.2)
Tom Augspurger blog (6-part series)
http://tomaugspurger.github.io/modern-1.html
Greg Reda blog (3-part series)
http://gregreda.com/2013/10/26/intro-to-pandas-data-structures/
cheat sheet (Treehouse)
https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PandasPythonForDataScience.pdf
An object type that is new to us can be explored through attribute inspection -- we can list the object's attributes with dir() and see brief documentation on an attribute with help().
import pandas as pd
# list of pandas attributes (global vars)
print(dir(pd))
# list of pandas functions (filtering for <class 'function'> only)
import types
print([ attr for attr in dir(pd) if isinstance(getattr(pd, attr), types.FunctionType) ])
# short doc on read_csv() function
help(pd.read_csv) # help on the read_csv function of pandas
# list of Series attributes
s1 = pd.Series()
print((dir(s1)))
# list of DataFrame attributes
df = pd.DataFrame()
print(dir(df))
# list of DataFrame methods (filtering for <class 'method'> only)
print([ attr for attr in dir(df)
if isinstance(getattr(df, attr), types.MethodType) ])
# short doc on DataFrame join() method
help(df.join) # help on the join() method of a DataFrame
DataFrame: rows and columns; Series: a single column or single row; Index: column or row labels.
DataFrame: * is the core pandas structure -- a 2-dimensional array / list of lists * is like an Excel spreadsheet - rows and columns with row and column labels * is like a "dict of dicts" in that columns and rows can be indexed by label * is like a "list of lists" in that columns and rows can be indexed by integer index * offers "vectorized" operations (sum rows or columns, modify values across rows, etc.) * offers database-like and excel-like manipulations (merge, groupby, pivot table etc.) * is the core pandas structure -- a 2-dimensional array * is also like an Excel spreadsheet - rows, columns, and row and column labels
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3],
'b': [10, 20, 30],
'c': [100, 200, 300]},
index=['r1', 'r2', 'r3'])
print(df)
# a b c
# r1 1 10 100
# r2 2 20 200
# r3 3 30 300
print(df['c']['r2']) # 200
Series * a sequence of items * items addressable by index integer ("list-like") * items addressable by index label ("dict-like") * has a dtype attribute that holds its objects' common type
# read a column as a Series (use DataFrame subscript)
bcol = df['b']
print(bcol)
# r1 10
# r2 20
# r3 30
# Name: b, dtype: int64
# read a row as a Series (use subscript of df.loc[])
oneidx = df.loc['r2']
print(oneidx)
# a 2
# b 20
# c 200
# Name: r2, dtype: int64
# create a Series from scratch
s1 = pd.Series([1, 3, 5, 7, 9], index=['a', 'b', 'c', 'd', 'e'])
print(s1)
# a 1
# b 3
# c 5
# d 7
# e 9
# dtype: int64
Index * an object that provides indexing for both the Series (its item index) and the DataFrame (its column or row index). * is also list-like
columns = df.columns # Index(['a', 'b', 'c'], dtype='object')
idx = df.index # Index(['r1', 'r2', 'r3'], dtype='object')
Like a list, but with item labels... so like a dict too.
Initialize a single series
import pandas as pd
s1 = pd.Series([10, 20, 30], index=['r1', 'r2', 'r3'], name='a')
print(s1)
# r1 10
# r2 20
# r3 30
# Name: b, dtype: int64
s2 = pd.Series(['x', 'y', 'z'], index=['r1', 'r2', 'r3'], name='b')
# combine Series to make a DataFrame
df = pd.DataFrame(s1, s2)
A Series is pandas object representing a column or row in a DataFrame.
Every DataFrame column or row is a Series:
df = pd.DataFrame( {'a': [1, 2],
'b': [8, 9] },
index=['r1', 'r2'] )
print(df)
# a b
# r1 1 8
# r2 2 9
# DataFrame string subscript accesses a column
print(df['a']) # 0 1
# 1 2
# Name: a, dtype: int64
print(type(df['a'])) # <class 'pandas.core.series.Series'>
# DataFrame .loc[] indexer accesses the rows
print(df.loc['r1']) # a 1
# b 8
# Name: r1, dtype: int64
print(type(df.loc['r1'])) # <class 'pandas.core.series.Series'>
A Series can be also be initialized on its own:
s1 = pd.Series([1, 2, 3, 4])
s2 = pd.Series([5, 6, 7, 8])
We can combine Series into DataFrames:
df = pd.DataFrame([s1, s2]) # add Series as rows
df = pd.concat([s1, s2], axis=1) # add Series as columns
The DataFrame is the pandas workhorse structure. It is a 2-dimensional structure with columns and rows (i.e., a lot like a spreadsheet).
Rows and columns can be accessed by index labels (dict-like) or by integer indices (list-like).
Initializing
import pandas as pd
# initialize a new, empty DataFrame
df = pd.DataFrame()
# init with dict of lists (keyed to columns) and index
df = pd.DataFrame( {'a': [1, 2, 3],
'b': [1.0, 1.5, 2.0],
'c': ['a', 'b', 'c'] },
index=['r1', 'r2', 'r3'] )
print(df)
# a b c
# r1 1 1.0 a
# r2 2 1.5 b
# r3 3 2.0 c
previewing the DataFrame
print(len(df)) # 3 (# of rows)
print(df.head(2)) # 1st 2 rows
print(df.tail(2)) # last 2 rows
column attribute / subscripting: delivers a Series
sa = df.a # or df['a']
print(sa) # pandas.Series([1, 2, 3], index=['r1', 'r2', 'r3'], name='a')
label and positional access
print(df.loc['r2', 'b']) # 1.5
print(df.iloc[1, 1]) # 1.5
.columns and .index attributes
Columns and rows can be accessed through the DataFrame's attributes:
print(df.columns) # Index(['a', 'b', 'c'], dtype='object')
print(df.index) # Index(['r1', 'r2', 'r3'], dtype='object')
DataFrame as a list of Series objects Again, any DataFrame's columns or rows can be sliced out as a Series:
# read a column as a Series (use DataFrame subscript)
bcol = df['b']
print(bcol)
# r1 1.0
# r2 1.5
# r3 2.0
# Name: b, dtype: float64
# read a row as a Series (use subscript of df.loc[])
oneidx = df.loc['r2']
print(oneidx)
# a 2
# b 1.5
# c b
# Name: r2, dtype: object
Note: df is a common variable name for pandas DataFrame objects; you will see this name used frequently in these examples.
A dataframe can be indexed like a list and subscripted like a dictionary.
import pandas as pd
import numpy as np
# initialize a new, empty DataFrame
df = pd.DataFrame()
# initialize a DataFrame with sample data
df = pd.DataFrame( {'a': [1, 2, 3, 4],
'b': [1.0, 1.5, 2.0, 2.5],
'c': ['a', 'b', 'c', 'd'] }, index=['r1', 'r2', 'r3', 'r4'] )
print(df)
# a DataFrame, printed
# a b c
# r1 1 1.0 a
# r2 2 1.5 b
# r3 3 2.0 c
# f4 4 2.5 d
DataFrame subscript: column Series
s = df['a']
print(s) # r1 1
# r2 2
# r3 3
# r4 4
# Name: a, dtype: int64
print(type(s)) # Series
An Index object is used to specify a DataFrame's columns or index, or a Series index.
Columns and Indices
A DataFrame makes use of two Index objects: one to represent the columns, and one to represent the rows.
df = pd.DataFrame( {'a': [1, 2, 3, 4],
'b': [1.0, 1.5, 2.0, 2.5],
'c': ['a', 'b', 'c', 'd'],
'd': [100, 200, 300, 400] },
index=['r1', 'r2', 'r3', 'r4'] )
print(df)
# a b c d
# r1 1 1.0 a 100
# r2 2 1.5 b 200
# r3 3 2.0 c 300
# r4 4 2.5 d 400
print((df.index)) # Index(['r1', 'r2', 'r3', 'r4'], dtype=object)
print((df.columns)) # Index(['a', 'b', 'c', 'd'], dtype=object)
# set name for index and columns
df.index.name = 'year'
df.columns.name = 'state'
s_index = s1.index # Index(['r1', 'r2', 'r3'])
columns = df.columns # Index(['a', 'b', 'c'], dtype='object')
idx = df.index # Index(['r1', 'r2', 'r3'], dtype='object')
.rename() method: columns or index labels can be reset using this DataFrame method.
df = df.rename(columns={'a': 'A', 'b': 'B', 'c': 'C', 'd': 'D'},
index={'r1': 'R1', 'r2': 'R2', 'r3': 'R3', 'r4': 'R4'})
print(df)
# A B C D
# R1 1 1.0 a 100
# R2 2 1.5 b 200
# R3 3 2.0 c 300
# R4 4 2.5 d 400
.columns, .index: the columns or index can also be set directly using the DataFrame's attributes (this would have the same effect as above):
df.columns = ['A', 'B', 'C', 'D']
df.index = ['r1', 'r2', 'r3', 'r4']
we can set names for index and column indices:
df.index.name = 'year'
df.columns.name = 'state'
There are a number of "exotic" Index object types: Index (standard, default and most common Index type) RangeIndex (index built from an integer range) Int64Index, UInt64Index, Float64Index (index values of specific types) DatetimeIndex, TimedeltaIndex, PeriodIndex, IntervalIndex (datetime-related indices) CategoricalIndex (index related to the Categorical type)
An Index can be set with a column or other sequence.
Sometimes a pd.read_excel() includes index labels in the first column. We can easily set the index with .set_index():
print(df)
# 0 a b c d
# 0 r1 1 1.0 a 2016-11-01
# 1 r2 2 1.5 b 2016-12-01
# 2 r3 3 2.0 c 2017-01-01
# 3 r4 4 2.5 d 2018-02-01
df = df.set_index(df[0])
df = df[['a', 'b', 'c', 'd']]
print(df)
# a b c d
# 0
# r1 1 1.0 a 2016-11-01
# r2 2 1.5 b 2016-12-01
# r3 3 2.0 c 2017-01-01
# r4 4 2.5 d 2018-02-01
We can reset the index with .reset_index(), although this makes the index into a new column.
df2 = df.reset_index()
print(df2)
# 0 a b c d
#
# 0 r1 1 1.0 a 2016-11-01
# 1 r2 2 1.5 b 2016-12-01
# 2 r3 3 2.0 c 2017-01-01
# 3 r4 4 2.5 d 2018-02-01
Or to drop the index when resetting, we can use drop=True:
df3 = df2.reset_index(drop=True)
print(df3)
# a b c d
#
# r1 1 1.0 a 2016-11-01
# r2 2 1.5 b 2016-12-01
# r3 3 2.0 c 2017-01-01
# r4 4 2.5 d 2018-02-01
We can also sort the DataFrame by index using .sort_index():
df4 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]},
index=['Cello', 'Alpha', 'Bow'])
df5 = df4.sort_index()
print(df5)
# a b
# Alpha 2 5
# Bow 3 6
# Cello 1 4
The default is to sort by the row index; axis=1 allows us to sort by columns. ascending=False reverses the sort:
df6 = df5.sort_index(axis=1, ascending=False)
print(df6)
# b a
# Alpha 5 2
# Bow 6 3
# Cello 4 1
Note that .sort_values() offers the same options for sorting a specified column or row.
.reindex(): we can change the order of the indices and thus the rows.
df5 = df.reindex(reversed(df.index))
df5 = df5.reindex(columns=reversed(df.columns))
print(df5)
# state D C B A
# year
# R4 400 d 2.5 4
# R3 300 c 2.0 3
# R2 200 b 1.5 2
# R1 100 a 1.0 1
DataFrames behave as you might expect when converted to any Python container
df = pd.DataFrame( {'a': [1, 2, 3, 4],
'b': [1.0, 1.5, 2.0, 2.5],
'c': ['a', 'b', 'b', 'a'] }, index=['r1', 'r2', 'r3', 'r4'] )
print(len(df)) # 4
print(len(df.columns)) # 3
print(max(df['a'])) # 4
print(list(df['a'])) # [1, 2, 3, 4] (column for 'a')
print(list(df.loc['r2'])) # [2, 1.5, 'b'] (row for 'r2')
print(set(df['c'])) # {'b', 'a'} (a set of unique values)
DataFrame .values -- convert to a list of numpy arrays
An numpy array is a list-like object. simple list comprehension could convert these to a list of lists:
print(df.values)
# array([[1, 1.0, 'a'],
# [2, 1.5, 'b'],
# [3, 2.0, 'b'],
# [4, 2.5, 'a']], dtype=object)
lol = list( [ list(item) for item in df.values ])
print lol
# [ [1, 1.0, 'a'],
# [2, 1.5, 'b'],
# [3, 2.0, 'b'],
# [4, 2.5, 'a'] ]
looping - loops through columns
for colname in df:
print('{}: {}'.format(colname, type(df[colname])))
# a: <pandas.core.series.Series>
# b: <pandas.core.series.Series>
# c: <pandas.core.series.Series>
# looping with iterrows -- loops through rows
for row_index, row_series in df.iterrows():
print('{}: {}'.format(row_index, type(row_series)))
# r1: <pandas.core.series.Series>
# r2: <pandas.core.series.Series>
# r3: <pandas.core.series.Series>
# r4: <pandas.core.series.Series>
Although keep in mind that we generally prefer vectorized operations across columns or rows to looping (discussed later).
DataFrame can be read from CSV, JSON, Excel and XML formats.
Note: df is a common variable name for pandas DataFrame objects; you will see this name used frequently in these examples.
CSV
# read from file
df = pd.read_csv('quarterly_revenue_2017Q4.csv')
# write to file
wfh = open('output.csv', 'w')
df.to_csv(wfh, na_rep='NULL')
# reading from Fama-French file (the abbreviated file, no header)
# sep= indicates the delimiter on which to split() the fields
# names= indicates the column heads
df = pd.read_csv('FF_abbreviated.txt', sep='\s+',
names=['date', 'MktRF', 'SMB', 'HML', 'RF'])
# reading from Fama-French non-abbreviated (the main file including headers and footers)
# skiprows=5: start reading 5 rows down
df = pd.read_csv('F-F_Research_Data_Factors_daily.txt', skiprows=5, sep='\s+',
names=['date', 'MktRF', 'SMB', 'HML', 'RF'])
df.to_csv('newfile.csv')
Excel
# reading from excel file to DataFrame
df = pd.read_excel('revenue.xlsx', sheet_name='Sheet1')
# optional: produce a 'reader' object used to obtain sheet names, etc.
xls_file = pd.ExcelFile('data.xls') # produce a file 'reader' object
df = xls_file.parse('Sheet1') # parse a selected sheet
# write to excel
df.to_excel('data2.xls', sheet_name='Sheet1')
JSON
# sample df for demo purposes
df = pd.DataFrame( {'a': [1, 2, 3, 4],
'b': [1.0, 1.5, 2.0, 2.5],
'c': ['a', 'b', 'c', 'd'] }, index=['r1', 'r2', 'r3', 'r4'] )
# write dataframe to JSON
pd.json.dump(df, open('df.json', 'w'))
mydict = pd.json.load(open('df.json'))
new_df = pd.DataFrame(mydict)
Relational Database
import sqlite3 # file-based database format
conn = sqlite3.connect('example.db') # a db connection object
df = pd.read_sql('SELECT this FROM that', conn)
The above can be used with any database connection (MySQL, Oracle, etc.)
From Clipboard: this option is excellent for cutting and pasting data from websites
df = pd.read_clipboard(skiprows=5, sep='\s+',
names=['date', 'MktRF', 'SMB', 'HML', 'RF'])
pandas infers a column type based on values and applies it to the column automatically.
Pandas is built on top of numpy, a numeric processing module, compiled in C for efficiency. Unlike core Python containers (but similar to a database table), numpy cares about object type. Wherever possible, numpy will assign a type to a column of values and attempt to maintain the type's integrity. This is done for the same reason it is done with database tables: speed and space efficiency. In the below DataFrame, numpy/pandas "sniffs out" the type of a column Series. It will set the type most appropriate to the values.
import pandas as pd
df = pd.DataFrame( {'a': [1, 2, 3, 4],
'b': [1.0, 1.5, 2.0, 2.5],
'c': ['a', 'b', 'c', 'd'],
'd': ['2016-11-01', '2016-12-01', '2017-01-01', '2018-02-01'] },
index=['r1', 'r2', 'r3', 'r4'] )
print(df)
# a b c d
# r1 1 1.0 a 2016-11-01
# r2 2 1.5 b 2016-12-01
# r3 3 2.0 c 2017-01-01
# r4 4 2.5 d 2017-02-01
print(df.dtypes)
# a int64 # note special pandas types int64 and float64
# b float64
# c object # 'object' is general-purpose type,
# d object # covers strings or mixed-type columns
# dtype: object
You can use the regular integer index to set element values in an existing Series. However, the new element value must be the same type as that defined in the Series; if not, pandas may refuse, or it may upconvert or cast the Series column to a more general type (usually object, because numpy is focused on the efficiency of numeric and datetime types).
print(df.b.dtype) # float64
df.loc['b'] = 'hello'
print(df.b.dtype) # object
Note that we never told pandas to store these values as floats. But since they are all floats, pandas decided to set the type.
We can change a dtype for a Series ourselves with .astype():
df.a = df.a.astype('object') # or df['a'] = df['a'].astype('object')
# df.loc[0, 'a'] = 'hello'
The numpy dtypes you are most likely to see are:
int64 float64 datetime64 object
Checking the memory usage of a DataFrame
.info() provides approximate memory size of a DataFrame
df.info() # on the original example at the top
# <class 'pandas.core.frame.DataFrame'>
# Index: 4 entries, r1 to r4
# Data columns (total 4 columns):
# a 4 non-null int64
# b 4 non-null float64
# c 4 non-null object
# d 4 non-null object
# dtypes: float64(1), int64(1), object(2)
# memory usage: 160.0+ bytes
'+' means "probably larger" -- info() only sizes numeric types, not 'object'
With memory_usage='deep', size includes type'object'
df.info(memory_usage='deep')
# memory usage: 832 bytes
Use a subscript (or attribute) to access columns by label; use the .loc[] or .iloc[] attributes to access rows by label or integer index.
a DataFrame:
df = pd.DataFrame( {'a': [1, 2, 3, 4],
'b': [1.0, 1.5, 2.0, 2.5],
'c': ['a', 'b', 'c', 'd'],
'd': [100, 200, 300, 400] },
index=['r1', 'r2', 'r3', 'r4'] )
access column as Series:
cola = df['a'] # Series with [1, 2, 3, 4] and index ['r1', 'r2', 'r3', 'r4']
cola = df.a # same -- can often use attribute labels for column name
print(cola)
# r1 1
# r2 2
# r3 3
# r4 4
# Name: a, dtype: int64
access row as Series using index label 'r2':
row2 = df.loc['r2'] # Series [2, 1.5, 'b', 200] and index ['a', 'b', 'c', 'd']
access row as Series using integer index:
row2 = df.iloc[1] # Series [2, 1.5, 'b', 200] and index ['a', 'b', 'c', 'd'] (same as above)
print(row2)
# a 2
# b 1.5
# c b
# d 200
# Name: r2, dtype: object
(Note that the .ix DataFrame indexer is a legacy feature and is deprecated.)
Some DataFrame operations provide the inplace=True option
Keep in mind that many operations produce a new DataFrame copy. This means that if you are working with a large dataset, you can avoid allocating additional memory with this option.
import pandas as pd
df = pd.DataFrame({ 'a': [1, 2, 3, 4],
'b': [1.0, 1.5, 2.0, 2.5],
'c': ['a', 'b', 'c', 'd'] })
print(df)
# a b c
# r1 1 1.0 a
# r2 2 1.5 b
# r3 3 2.0 c
# r4 4 2.5 d
df2 = df.set_index('a')
print(df2) # new dataframe
# b c
# a
# 1 1.0 a
# 2 1.5 b
# 3 2.0 c
# 4 2.5 d
print(df) # unchanged
# a b c
# r1 1 1.0 a
# r2 2 1.5 b
# r3 3 2.0 c
# r4 4 2.5 d
df.set_index('a', inplace=True)
print(df)
# b c
# a
# 1 1.0 a
# 2 1.5 b
# 3 2.0 c
# 4 2.5 d
DataFrames behave as you might expect when converted to any Python container
df = pd.DataFrame( {'a': [1, 2, 3, 4],
'b': [1.0, 1.5, 2.0, 2.5],
'c': ['a', 'b', 'b', 'a'] }, index=['r1', 'r2', 'r3', 'r4'] )
print(len(df)) # 4
print(len(df.columns)) # 3
print(max(df['a'])) # 4
print(list(df['a'])) # [1, 2, 3, 4] (column for 'a')
print(list(df.loc['r2'])) # [2, 1.5, 'b'] (row for 'r2')
print(set(df['c'])) # {'b', 'a'} (a set of unique values)
DataFrame .values -- convert to a list of numpy arrays
An numpy array is a list-like object. simple list comprehension could convert these to a list of lists:
print((df.values))
# array([[1, 1.0, 'a'],
# [2, 1.5, 'b'],
# [3, 2.0, 'b'],
# [4, 2.5, 'a']], dtype=object)
lol = list( [ list(item) for item in df.values ])
print(lol)
# [ [1, 1.0, 'a'],
# [2, 1.5, 'b'],
# [3, 2.0, 'b'],
# [4, 2.5, 'a'] ]
looping - loops through columns
for colname in df:
print('{}: {}'.format(colname, type(df[colname])))
# a: <pandas.core.series.Series>
# b: <pandas.core.series.Series>
# c: <pandas.core.series.Series>
# looping with iterrows -- loops through rows
for row_index, row_series in df.iterrows():
print('{}: {}'.format(row_index, type(row_series)))
# r1: <pandas.core.series.Series>
# r2: <pandas.core.series.Series>
# r3: <pandas.core.series.Series>
# r4: <pandas.core.series.Series>
Although keep in mind that we generally prefer vectorized operations across columns or rows to looping.