Python 3

home

pandas: TimeSeries, Binning and Categorizing

TimeSeries: objects and methods

These custom pandas objects provide powerful date calculation and generation.


Timestamp: a single timestamp representing a date/time Timedelta: a date/time interval (like 1 months, 5 days or 2 hours) Period: a particular date span (like 4/1/16 - 4/3/16 or 4Q17) DatetimeIndex: DataFrame or Series Index of Timestamp PeriodIndex: DataFrame or Series Index of Period Timestamp: a single point in time


Timestamp() constructor: creating a Timestamp object from string, ints or datetime():

tstmp = pd.Timestamp('2012-05-01')
tstmp = pd.Timestamp(2012, 5, 1)
tstmp = pd.Timestamp(datetime.datetime(2012, 5, 1))

year  = tstmp.year    # 2012
month = tstmp.month   # 5
day   = tstmp.day     # 1

.to_datetime(): convert a string, list of strings or Series to dates

tseries = pd.to_datetime(['2005/11/23', '2010.12.31'])
    # DatetimeIndex(['2005-11-23', '2010-12-31'], dtype='datetime64[ns]', freq=None)

tseries = pd.to_datetime(pd.Series(['Jul 31, 2009', '2010-01-10', None]))

# using European dates
tstmp = pd.to_datetime('11/12/2010', dayfirst=True)   # 2010-11-12


Timedelta: a time interval


Timedelta() constructor: creating an interval

# strings
td = pd.Timedelta('1 days')                 # Timedelta('1 days 00:00:00')
td =  pd.Timedelta('1 days 00:00:00')       # Timedelta('1 days 00:00:00')
td = pd.Timedelta('1 days 2 hours')         # Timedelta('1 days 02:00:00')
td = pd.Timedelta('-1 days 2 min 3us')      # Timedelta('-2 days +23:57:59.999997')

# negative Timedeltas
td = pd.Timedelta('-1us')                   # Timedelta('-1 days +23:59:59.999999')


# with args similar to datetime.timedelta
# note: these MUST be specified as keyword arguments
td = pd.Timedelta(days=1, seconds=1)        # Timedelta('1 days 00:00:01')


# integers with a unit
td = pd.Timedelta(1, unit='d')              # Timedelta('1 days 00:00:00')


Period: a specific datetime->datetime interval


Period constructor: creating a date-to-date timespan

perimon = pd.Period('2011-01')               # default interval is 'month' (end time is 2011-01-31 23:59:59.999)
periday = pd.Period('2012-05-01', freq='D')  # specify 'daily' (end datetime is 2012-05-01 23:59:99.999)




Filtering / Selecting Dates

Let's start with data as it might come from a CSV file. We've designed the date column to be the DataFrame's index:

import pandas as pd
import numpy as np

df = pd.DataFrame( {'impressions': [9,    10,   8,    3,    7,    12    ],
                    'sales':       [2.03, 2.38, 1.93, 0.63, 1.85, 2.53  ],
                    'clients':     [4,    6,    5,    1,    5,    7     ]  },
                    index=[ '2016-11-15', '2016-12-01', '2016-12-15',
                            '2017-01-01', '2017-01-15', '2017-02-01' ] )

print(df)
                  #             clients  impressions  sales
                  # 2016-11-15        4            9   2.03
                  # 2016-12-01        6           10   2.38
                  # 2016-12-15        5            8   1.93
                  # 2017-01-01        1            3   0.63
                  # 2017-01-15        5            7   1.85
                  # 2017-02-01        7           12   2.53
print(type(df.index[0]))                           # <class 'str'>

Note that the index is listed as string. This would be standard in a read from a plaintext format like CSV (although not from a date-formatted column in Excel)


We can convert the strings to Timestamp with astype():

df.index = df.index.astype(np.datetime64)

print(type(df.index))                      # <class 'pandas.tseries.index.DatetimeIndex'>
print(type(df.index[0]))                   # <class 'pandas.tslib.Timestamp'>

Now the index is a DatetimeIndex (no longer an Index), consisting of Timestamp objects, optimized for date calculation and selection.


Filtering: a Series or DatetimeIndex of np.Timestamp objects, they can be selected or filtered quite easily:

rng = pd.date_range('1/1/2016', periods=24, freq='M')


# all entries from 2016 onward
print(df['2016'])

                  #             clients  impressions  sales
                  # 2016-11-15        4            9   2.03
                  # 2016-12-01        6           10   2.38
                  # 2016-12-15        5            8   1.93


# all entries from Dec. 2016 onward
print(df['2016-12'])

                  #             clients  impressions  sales
                  # 2016-12-01        6           10   2.38
                  # 2016-12-15        5            8   1.93


# all entries from Dec. 10 2016 onward
print(df['2016-12-10':])

                  #             clients  impressions  sales
                  # 2016-12-15        5            8   1.93
                  # 2017-01-01        1            3   0.63
                  # 2017-01-15        5            7   1.85
                  # 2017-02-01        7           12   2.53


# all entries from 12/10/16 - 1/10/17
print(df['2016-12-10': '2017-01-10'])

                  #             clients  impressions  sales
                  # 2016-12-15        5            8   1.93
                  # 2017-01-01        1            3   0.63




Creating, comparing and calculating dates with pd.Timedelta

We add or subtract a Timedelta interval from a Timestamp


Comparing Timestamps

ts1 = pd.Timestamp('2011-07-09 11:30')
ts2 = pd.Timestamp('2011-07-10 11:35')

print(ts1 > ts2)            # False
print(ts1 < ts2)            # True

Computing Timedeltas

td1 = ts2 - ts1
print(td1)                 # 1 days 00:05:00
print((type(td1)))           # <class 'pandas._libs.tslib.Timedelta'>

# values in a Timedelta boil down to days and seconds
print(td.days)              # 1
print(td.seconds)           # 300

ts3 = ts2 + td              # adding 1 day and 5 minutes
print(ts3)                  # Timestamp('2011-07-11 11:40:00')

Creating Timedeltas

# strings
pd.Timedelta('1 days')                 # Timedelta('1 days 00:00:00')

pd.Timedelta('1 days 00:00:00')        # Timedelta('1 days 00:00:00')

pd.Timedelta('1 days 2 hours')         # Timedelta('1 days 02:00:00')

pd.Timedelta('-1 days 2 min 3us')      # Timedelta('-2 days +23:57:59.999997')

# like datetime.timedelta
# note: these MUST be specified as keyword arguments
pd.Timedelta(days=1, seconds=1)        # Timedelta('1 days 00:00:01')

# integers with a unit
pd.Timedelta(1, unit='d')              # Timedelta('1 days 00:00:00')

# from a datetime.timedelta/np.timedelta64
pd.Timedelta(datetime.timedelta(days=1, seconds=1))
                                       # Timedelta('1 days 00:00:01')

pd.Timedelta(np.timedelta64(1, 'ms'))  # Timedelta('0 days 00:00:00.001000')

# negative Timedeltas
pd.Timedelta('-1us')                   # Timedelta('-1 days +23:59:59.999999')




Generating a date range with pd.date_range()

date_range() provides evenly spaced Timestamp objects.


date_range() with a start date, periods= and freq=:

# By default date_range() returns a DatetimeIndex.
# 5 hours starting with midnight Jan 1st, 2011
rng = pd.date_range('1/1/2011', periods=5, freq='H')
print(rng)
        # DatetimeIndex(['2011-01-01 00:00:00', '2011-01-01 01:00:00',
        #                '2011-01-01 02:00:00', '2011-01-01 03:00:00',
        #                '2011-01-01 04:00:00'],
        #                dtype='datetime64[ns]', freq='H')


ts = pd.Series(list(range(0, len(rng))), index=rng)

print(ts)
    # 2011-01-01 00:00:00    0
    # 2011-01-01 01:00:00    1
    # 2011-01-01 02:00:00    2
    # 2011-01-01 03:00:00    3
    # 2011-01-01 04:00:00    4
    # Freq: H, dtype: int64

date_range() with a start date and end date

start = pd.Timestamp('1/1/2011')
end =  pd.Timestamp('1/5/2011')
tindex = pd.date_range(start, end)

print(tindex)
    # DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03',
    #                '2011-01-04', '2011-01-05'])
    #               dtype='datetime64[ns]', length=5, freq='D')

    # note default frequency:  'D' (days)

date_range() with a monthly period, dates are set to end of the month:

tindex = pd.date_range(start='1/1/1980', end='11/1/1990', freq='M')

date_range() with a monthly period, dates are set to start of the month:

tindex = pd.date_range(start='1/1/1980', end='11/1/1990', freq='MS')

date_range() with a start date, periods and freq

tindex = pd.date_range('1/1/2011', periods=3, freq='W')

print(tindex)
    # DatetimeIndex(['2011-01-02', '2011-01-09', '2011-01-16'],
    #               dtype='datetime64[ns]', freq='W-SUN')

Note that freq= has defaulted to W-SUN which indicates weekly beginning on Sunday. pandas even adjusted our first day on this basis! We can specify the day of the week ourselves to start on a precise date.


bdate_range() provides a date range that includes "business days" only:

tbindex = pd.bdate_range(start, end)

print(tbindex)
    # DatetimeIndex(['2011-01-03', '2011-01-04', '2011-01-05'],
    #               dtype='datetime64[ns]', freq='B')

    # (the 1st and 2nd of Jan. 2011 are Saturday and Sunday)

See the offset aliases portion of the documentation.





Comparing dates within intervals with pd.Period

The Period represents an interval with a start date/time


The .end_time attribute value is calulated as the start date/time + freq= value.

# a 'day' period
per = pd.Period('2016-05-03')    # Period('2016-05-03', 'D')

print(per.start_time)            # Timestamp('2016-05-03 00:00:00')
print(per.end_time)              # Timestamp('2016-05-03 23:59:59.999999999')

# a 'month' period
pdfm = pd.Period('2016-05-03', freq='M')

print(pdfm.start_time)           # Timestamp('2016-05-01 00:00:00')

print(pdfm.end_time)             # Timestamp('2016-05-31 23:59:59.999999999')

"frequency" (or freq=) is a bat of a misnomer. It describes the size of the period -- that is, the amount of time it covers. Thus a freq='M' (month) period ends a month later than the start date.


The Period object can be incremented to produce a new Period object. The freq interval determines the start date/time and size of the next Period.

# a 'month' period
pdfm = pd.Period('2016-05-03', freq='M')

pdfm2 = pdfm + 1

print(pdfm2.start_time)           # Timestamp('2016-06-01 00:00:00')
print(pdfm2.end_time)             # Timestamp('2016-06-30 23:59:59.999999999')

period_range(): produce a range of Period objects


ps = pd.Series(list(range(12)), pd.period_range('1/2017', '12/2017', freq='M'))

print(ps)
    # 2017-01     0
    # 2017-02     1
    # 2017-03     2
    # 2017-04     3
    # 2017-05     4
    # 2017-06     5
    # 2017-07     6
    # 2017-08     7
    # 2017-09     8
    # 2017-10     9
    # 2017-11    10
    # 2017-12    11
    # Freq: M, dtype: int64

Above we have an index of Period objects; each period represents a monthly interval.


This differs from TimeStamp in that a comparison or selection (such as a slice) will include any value that falls within the requested period, even if the date range is partial:

print(ps['2017-03-15': '2017-06-15'])

    # 2017-03    2
    # 2017-04    3
    # 2017-05    4
    # 2017-06    5
    # Freq: M, dtype: int64

Note that both 03 and 06 were included in the results, because the slice fell between their ranges.


Quarterly Period Range

prng = pd.period_range('1990Q1', '2000Q4', freq='Q-JAN')

sq = pd.Series(range(0, len(prng)), prng)
print(sq)

    # 1990Q1    0
    # 1990Q2    1
    # 1990Q3    2
    # 1990Q4    3
    # 1991Q1    4
    # 1991Q2    5
    # 1991Q3    6
    # 1991Q4    7
    # Freq: Q-JAN, dtype: int64

sq[pd.Timestamp('1990-02-13')]    # 4




Binning

Dividing values into bins based on a category scheme


Bins allow us to categorize values (often dates) into "bins" which are mapped to a value to be applied. Consider the table below, which might come from an Excel spreadsheet:


dfbin = pd.DataFrame({'start_date': [1, 6, 11, 16],
                      'end_date': [5, 10, 15, 20],
                      'percent': [1, 2, 3, 10]})

# order the columns
dfbin = dfbin[['start_date', 'end_date', 'percent']]

print(dfbin)
         #    start_date  end_date  percent
         # 0           1         5        1
         # 1           6        10        2
         # 2          11        15        3
         # 3          16        20       10

Any date from 1-5 should key to 1%; any from 6-10, 2%, etc.


We have data that needs to be categorized into the above bins:

data = pd.DataFrame({'period': range(1, 21)})

print(data)
         #         period
         #     0        1
         #     1        2
         #     2        3
         #     3        4
         #     4        5
         #     5        6
         #     6        7
         #     7        8
         #     8        9
         #     9       10
         #     10      11
         #     11      12
         #     12      13
         #     13      14
         #     14      15
         #     15      16
         #     16      17
         #     17      18
         #     18      19
         #     19      20


print(dfbin)
         #    start_date  end_date  percent
         # 0           1         5        1
         # 1           6        10        2
         # 2          11        15        3
         # 3          16        20       10

# converting the 'start_date' field into a list
bins = list(dfbin['start_date'])

# adding the last 'end_date' value to the end
bins.append(dfbin.loc[len(dfbin)-1, 'end_date']+1)

# category labels (which can be strings, but here are integers)
cats = list(range(1, len(bins)))

print(bins)
print(cats)
         # [1, 6, 11, 16, 21]
         # [1, 2, 3, 4, 5]

The cut function takes the data, bins and labels and sorts them by bin value:

# 'right=False' keeps bins from overlapping (the bin does not include the rightmost edge)
data['cat'] = pd.cut(data['period'], bins, labels=cats, right=False)
print(data)

         #         period cat
         #     0        1   1
         #     1        2   1
         #     2        3   1
         #     3        4   1
         #     4        5   1
         #     5        6   2
         #     6        7   2
         #     7        8   2
         #     8        9   2
         #     9       10   2
         #     10      11   3
         #     11      12   3
         #     12      13   3
         #     13      14   3
         #     14      15   3
         #     15      16   4
         #     16      17   4
         #     17      18   4
         #     18      19   4
         #     19      20   4

We are now free to use the bin mapping to apply the proper pct value to each row.





[pr]