Python 3home |
These custom pandas objects provide powerful date calculation and generation.
Timestamp: a single timestamp representing a date/time Timedelta: a date/time interval (like 1 months, 5 days or 2 hours) Period: a particular date span (like 4/1/16 - 4/3/16 or 4Q17) DatetimeIndex: DataFrame or Series Index of Timestamp PeriodIndex: DataFrame or Series Index of Period Timestamp: a single point in time
Timestamp() constructor: creating a Timestamp object from string, ints or datetime():
tstmp = pd.Timestamp('2012-05-01')
tstmp = pd.Timestamp(2012, 5, 1)
tstmp = pd.Timestamp(datetime.datetime(2012, 5, 1))
year = tstmp.year # 2012
month = tstmp.month # 5
day = tstmp.day # 1
.to_datetime(): convert a string, list of strings or Series to dates
tseries = pd.to_datetime(['2005/11/23', '2010.12.31'])
# DatetimeIndex(['2005-11-23', '2010-12-31'], dtype='datetime64[ns]', freq=None)
tseries = pd.to_datetime(pd.Series(['Jul 31, 2009', '2010-01-10', None]))
# using European dates
tstmp = pd.to_datetime('11/12/2010', dayfirst=True) # 2010-11-12
Timedelta: a time interval
Timedelta() constructor: creating an interval
# strings
td = pd.Timedelta('1 days') # Timedelta('1 days 00:00:00')
td = pd.Timedelta('1 days 00:00:00') # Timedelta('1 days 00:00:00')
td = pd.Timedelta('1 days 2 hours') # Timedelta('1 days 02:00:00')
td = pd.Timedelta('-1 days 2 min 3us') # Timedelta('-2 days +23:57:59.999997')
# negative Timedeltas
td = pd.Timedelta('-1us') # Timedelta('-1 days +23:59:59.999999')
# with args similar to datetime.timedelta
# note: these MUST be specified as keyword arguments
td = pd.Timedelta(days=1, seconds=1) # Timedelta('1 days 00:00:01')
# integers with a unit
td = pd.Timedelta(1, unit='d') # Timedelta('1 days 00:00:00')
Period: a specific datetime->datetime interval
Period constructor: creating a date-to-date timespan
perimon = pd.Period('2011-01') # default interval is 'month' (end time is 2011-01-31 23:59:59.999)
periday = pd.Period('2012-05-01', freq='D') # specify 'daily' (end datetime is 2012-05-01 23:59:99.999)
Let's start with data as it might come from a CSV file. We've designed the date column to be the DataFrame's index:
import pandas as pd
import numpy as np
df = pd.DataFrame( {'impressions': [9, 10, 8, 3, 7, 12 ],
'sales': [2.03, 2.38, 1.93, 0.63, 1.85, 2.53 ],
'clients': [4, 6, 5, 1, 5, 7 ] },
index=[ '2016-11-15', '2016-12-01', '2016-12-15',
'2017-01-01', '2017-01-15', '2017-02-01' ] )
print(df)
# clients impressions sales
# 2016-11-15 4 9 2.03
# 2016-12-01 6 10 2.38
# 2016-12-15 5 8 1.93
# 2017-01-01 1 3 0.63
# 2017-01-15 5 7 1.85
# 2017-02-01 7 12 2.53
print(type(df.index[0])) # <class 'str'>
Note that the index is listed as string. This would be standard in a read from a plaintext format like CSV (although not from a date-formatted column in Excel)
We can convert the strings to Timestamp with astype():
df.index = df.index.astype(np.datetime64)
print(type(df.index)) # <class 'pandas.tseries.index.DatetimeIndex'>
print(type(df.index[0])) # <class 'pandas.tslib.Timestamp'>
Now the index is a DatetimeIndex (no longer an Index), consisting of Timestamp objects, optimized for date calculation and selection.
Filtering: a Series or DatetimeIndex of np.Timestamp objects, they can be selected or filtered quite easily:
rng = pd.date_range('1/1/2016', periods=24, freq='M')
# all entries from 2016 onward
print(df['2016'])
# clients impressions sales
# 2016-11-15 4 9 2.03
# 2016-12-01 6 10 2.38
# 2016-12-15 5 8 1.93
# all entries from Dec. 2016 onward
print(df['2016-12'])
# clients impressions sales
# 2016-12-01 6 10 2.38
# 2016-12-15 5 8 1.93
# all entries from Dec. 10 2016 onward
print(df['2016-12-10':])
# clients impressions sales
# 2016-12-15 5 8 1.93
# 2017-01-01 1 3 0.63
# 2017-01-15 5 7 1.85
# 2017-02-01 7 12 2.53
# all entries from 12/10/16 - 1/10/17
print(df['2016-12-10': '2017-01-10'])
# clients impressions sales
# 2016-12-15 5 8 1.93
# 2017-01-01 1 3 0.63
We add or subtract a Timedelta interval from a Timestamp
Comparing Timestamps
ts1 = pd.Timestamp('2011-07-09 11:30')
ts2 = pd.Timestamp('2011-07-10 11:35')
print(ts1 > ts2) # False
print(ts1 < ts2) # True
Computing Timedeltas
td1 = ts2 - ts1
print(td1) # 1 days 00:05:00
print((type(td1))) # <class 'pandas._libs.tslib.Timedelta'>
# values in a Timedelta boil down to days and seconds
print(td.days) # 1
print(td.seconds) # 300
ts3 = ts2 + td # adding 1 day and 5 minutes
print(ts3) # Timestamp('2011-07-11 11:40:00')
Creating Timedeltas
# strings
pd.Timedelta('1 days') # Timedelta('1 days 00:00:00')
pd.Timedelta('1 days 00:00:00') # Timedelta('1 days 00:00:00')
pd.Timedelta('1 days 2 hours') # Timedelta('1 days 02:00:00')
pd.Timedelta('-1 days 2 min 3us') # Timedelta('-2 days +23:57:59.999997')
# like datetime.timedelta
# note: these MUST be specified as keyword arguments
pd.Timedelta(days=1, seconds=1) # Timedelta('1 days 00:00:01')
# integers with a unit
pd.Timedelta(1, unit='d') # Timedelta('1 days 00:00:00')
# from a datetime.timedelta/np.timedelta64
pd.Timedelta(datetime.timedelta(days=1, seconds=1))
# Timedelta('1 days 00:00:01')
pd.Timedelta(np.timedelta64(1, 'ms')) # Timedelta('0 days 00:00:00.001000')
# negative Timedeltas
pd.Timedelta('-1us') # Timedelta('-1 days +23:59:59.999999')
date_range() provides evenly spaced Timestamp objects.
date_range() with a start date, periods= and freq=:
# By default date_range() returns a DatetimeIndex.
# 5 hours starting with midnight Jan 1st, 2011
rng = pd.date_range('1/1/2011', periods=5, freq='H')
print(rng)
# DatetimeIndex(['2011-01-01 00:00:00', '2011-01-01 01:00:00',
# '2011-01-01 02:00:00', '2011-01-01 03:00:00',
# '2011-01-01 04:00:00'],
# dtype='datetime64[ns]', freq='H')
ts = pd.Series(list(range(0, len(rng))), index=rng)
print(ts)
# 2011-01-01 00:00:00 0
# 2011-01-01 01:00:00 1
# 2011-01-01 02:00:00 2
# 2011-01-01 03:00:00 3
# 2011-01-01 04:00:00 4
# Freq: H, dtype: int64
date_range() with a start date and end date
start = pd.Timestamp('1/1/2011')
end = pd.Timestamp('1/5/2011')
tindex = pd.date_range(start, end)
print(tindex)
# DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03',
# '2011-01-04', '2011-01-05'])
# dtype='datetime64[ns]', length=5, freq='D')
# note default frequency: 'D' (days)
date_range() with a monthly period, dates are set to end of the month:
tindex = pd.date_range(start='1/1/1980', end='11/1/1990', freq='M')
date_range() with a monthly period, dates are set to start of the month:
tindex = pd.date_range(start='1/1/1980', end='11/1/1990', freq='MS')
date_range() with a start date, periods and freq
tindex = pd.date_range('1/1/2011', periods=3, freq='W')
print(tindex)
# DatetimeIndex(['2011-01-02', '2011-01-09', '2011-01-16'],
# dtype='datetime64[ns]', freq='W-SUN')
Note that freq= has defaulted to W-SUN which indicates weekly beginning on Sunday. pandas even adjusted our first day on this basis! We can specify the day of the week ourselves to start on a precise date.
bdate_range() provides a date range that includes "business days" only:
tbindex = pd.bdate_range(start, end)
print(tbindex)
# DatetimeIndex(['2011-01-03', '2011-01-04', '2011-01-05'],
# dtype='datetime64[ns]', freq='B')
# (the 1st and 2nd of Jan. 2011 are Saturday and Sunday)
See the offset aliases portion of the documentation.
The Period represents an interval with a start date/time
The .end_time attribute value is calulated as the start date/time + freq= value.
# a 'day' period
per = pd.Period('2016-05-03') # Period('2016-05-03', 'D')
print(per.start_time) # Timestamp('2016-05-03 00:00:00')
print(per.end_time) # Timestamp('2016-05-03 23:59:59.999999999')
# a 'month' period
pdfm = pd.Period('2016-05-03', freq='M')
print(pdfm.start_time) # Timestamp('2016-05-01 00:00:00')
print(pdfm.end_time) # Timestamp('2016-05-31 23:59:59.999999999')
"frequency" (or freq=) is a bat of a misnomer. It describes the size of the period -- that is, the amount of time it covers. Thus a freq='M' (month) period ends a month later than the start date.
The Period object can be incremented to produce a new Period object. The freq interval determines the start date/time and size of the next Period.
# a 'month' period
pdfm = pd.Period('2016-05-03', freq='M')
pdfm2 = pdfm + 1
print(pdfm2.start_time) # Timestamp('2016-06-01 00:00:00')
print(pdfm2.end_time) # Timestamp('2016-06-30 23:59:59.999999999')
period_range(): produce a range of Period objects
ps = pd.Series(list(range(12)), pd.period_range('1/2017', '12/2017', freq='M'))
print(ps)
# 2017-01 0
# 2017-02 1
# 2017-03 2
# 2017-04 3
# 2017-05 4
# 2017-06 5
# 2017-07 6
# 2017-08 7
# 2017-09 8
# 2017-10 9
# 2017-11 10
# 2017-12 11
# Freq: M, dtype: int64
Above we have an index of Period objects; each period represents a monthly interval.
This differs from TimeStamp in that a comparison or selection (such as a slice) will include any value that falls within the requested period, even if the date range is partial:
print(ps['2017-03-15': '2017-06-15'])
# 2017-03 2
# 2017-04 3
# 2017-05 4
# 2017-06 5
# Freq: M, dtype: int64
Note that both 03 and 06 were included in the results, because the slice fell between their ranges.
Quarterly Period Range
prng = pd.period_range('1990Q1', '2000Q4', freq='Q-JAN')
sq = pd.Series(range(0, len(prng)), prng)
print(sq)
# 1990Q1 0
# 1990Q2 1
# 1990Q3 2
# 1990Q4 3
# 1991Q1 4
# 1991Q2 5
# 1991Q3 6
# 1991Q4 7
# Freq: Q-JAN, dtype: int64
sq[pd.Timestamp('1990-02-13')] # 4
Dividing values into bins based on a category scheme
Bins allow us to categorize values (often dates) into "bins" which are mapped to a value to be applied. Consider the table below, which might come from an Excel spreadsheet:
dfbin = pd.DataFrame({'start_date': [1, 6, 11, 16],
'end_date': [5, 10, 15, 20],
'percent': [1, 2, 3, 10]})
# order the columns
dfbin = dfbin[['start_date', 'end_date', 'percent']]
print(dfbin)
# start_date end_date percent
# 0 1 5 1
# 1 6 10 2
# 2 11 15 3
# 3 16 20 10
Any date from 1-5 should key to 1%; any from 6-10, 2%, etc.
We have data that needs to be categorized into the above bins:
data = pd.DataFrame({'period': range(1, 21)})
print(data)
# period
# 0 1
# 1 2
# 2 3
# 3 4
# 4 5
# 5 6
# 6 7
# 7 8
# 8 9
# 9 10
# 10 11
# 11 12
# 12 13
# 13 14
# 14 15
# 15 16
# 16 17
# 17 18
# 18 19
# 19 20
print(dfbin)
# start_date end_date percent
# 0 1 5 1
# 1 6 10 2
# 2 11 15 3
# 3 16 20 10
# converting the 'start_date' field into a list
bins = list(dfbin['start_date'])
# adding the last 'end_date' value to the end
bins.append(dfbin.loc[len(dfbin)-1, 'end_date']+1)
# category labels (which can be strings, but here are integers)
cats = list(range(1, len(bins)))
print(bins)
print(cats)
# [1, 6, 11, 16, 21]
# [1, 2, 3, 4, 5]
The cut function takes the data, bins and labels and sorts them by bin value:
# 'right=False' keeps bins from overlapping (the bin does not include the rightmost edge)
data['cat'] = pd.cut(data['period'], bins, labels=cats, right=False)
print(data)
# period cat
# 0 1 1
# 1 2 1
# 2 3 1
# 3 4 1
# 4 5 1
# 5 6 2
# 6 7 2
# 7 8 2
# 8 9 2
# 9 10 2
# 10 11 3
# 11 12 3
# 12 13 3
# 13 14 3
# 14 15 3
# 15 16 4
# 16 17 4
# 17 18 4
# 18 19 4
# 19 20 4
We are now free to use the bin mapping to apply the proper pct value to each row.