Python 3

home

CSV Parsing Using Built-In Python

Table Data: Rows and Fields

Tables consist of records (rows) and fields (column values).

Tabular data may come in many forms, but the most common are CSV (comma-separated values) files and RDBMS (relational / SQL databases). In this session, we'll look at CSV. CSV files are plaintext, meaning they consist solely of text characters. In order to be structured, CSV files must include delimiters, or characters the indicate the start and end of each data value. The row delimiter is usually the comma, but may be any character, or even even be one or more spaces.

comma-separated values file (CSV)

    Date,Mkt-RF,SMB,HML,RF
    19260701,0.09,0.22,0.30,0.009
    19260702,0.44,0.35,0.08,0.009
    19270103,0.97,0.21,0.24,0.010
    19270104,0.30,0.15,0.73,0.010
    19280103,0.43,0.90,0.20,0.010
    19280104,0.14,0.47,0.01,0.010

space-separated values file

    Date        Mkt-RF  SMB     HML
    19260701    0.09    0.22    0.30   0.009
    19260702    0.44    0.35    0.08   0.009
    19270103    0.97    0.21    0.24   0.010
    19270104    0.30    0.15    0.73   0.010
    19280103    0.43    0.90    0.20   0.010
    19280104    0.14    0.47    0.01   0.010

Our job for this lesson is to parse (separate) these values into usable data. We use the delimiter characters in a CSV to programmatically locate the values.

Here's what a CSV file really looks like under the hood:

19260701,0.09,0.22,0.30,0.009\n19260702,0.44,0.35,0.08,
0.009\n19270103,0.97,0.21,0.24,0.010\n19270104,0.30,0.15,
0.73,0.010\n19280103,0.43,0.90,0.20,0.010\n19280104,0.14,
0.47,0.01,0.010

The newline character (\n) separates the records (or rows) in a CSV file. The row delimeter (in this case, a comma) separates the fields (column values). When displaying a file, your text editor will translate the newlines into a line break, and drop down to the next line. This makes it seem as if each line is separate, but in fact they are a continuous stream of characters, punctuated by the newline and row delimiter characters.

Tabular Data: Looping, Parsing and Summarizing

We iterate over each row, parse each row to isolate field values, then add to variables to build a collection of values.

Here we loop through each string line in the file, strip and split the lineto a list, then subscript the list to retrieve a value. Note the value produced by each code line:

filename = '../revenue.csv'

fh = open(filename)                # 'file' object

headers = next(fh)                 # 'company,state,price\n'

mysum = 0.0

for row in fh:                     # "Haddad's,PA,239.5\n"
    row = row.rstrip()             # "Haddad's,PA,239.5"
    fields = row.split(',')        # ["Haddad's", 'PA', '239.5']
    row_val = fields[-1]           # '239.5'
    float_val = float(row_val)     # 239.5
    mysum = mysum + float_val      # 239.5

print(mysum)                       # 662.010000000002

File Writing

When writing a line to a file we must include the newline.

Writing free text to a file (must include newlines)

header = ['company', 'state', 'price']
line1 = ['Alpha Corp.', 'NY', '239.5']
line2 = ['Beta Corp.', 'NJ', '101.03']

header_line = ','.join(header)
line1_line = ','.join(line1)
line2_line = ','.join(line2)

wfh = open('new_file.txt', 'w')

wfh.write(header_line + '\n')
wfh.write(line1_line + '\n')
wfh.write(line2_line + '\n')

wfh.close()

Methods Used in File Parsing: File Object

We use 'for' to iterate over lines from a file. We can also parse the file as a whole.

for: read one string line at a time

fh = open('pyku.txt')      # file object

for my_file_line in fh:    # "We're all out of gouda.\n"
    print(my_file_line)

fh.close()                 # close the file

read(): read entire file as a single string

fh = open('pyku.txt')     # file object

text = fh.read()          # "We're all out of gouda.\nThis
                          # parrot has ceased to be.\nSpam,
                          # spam, spam, spam, spam."

fh.close()

read() with str.split(): read entire file as a list of words

fh = open('pyku.txt')      # file object

text = fh.read()           # "We're all out of gouda.\nThis
                           #  parrot has ceased to be.\nSpam,
                           #  spam, spam, spam, spam."

words = text.split()       # ["We're", 'all', 'out', 'of',
                           #  'gouda.', 'This', 'parrot',
                           #  'has', 'ceased', 'to', 'be.',
                           #  'Spam,', 'spam,', 'spam,',
                           #  'spam,', 'spam.' ]

fh.close()                                # close the file

print(len(words), ' words in the file')

print(f'first word: {words[0]}')          # We're
print(f'last word: {words[-1]}')          # spam.

readlines(): read as a list of strings (each string a line)

fh = open('pyku.txt')         # file object

file_lines = fh.readlines()   # ["We're all out of gouda.\n",
                              #  'This parrot has ceased to
                              #   be\n', 'Spam, spam, spam,
                              #   spam, spam.\n']

fh.close()

print(file_lines[0])           # "We're all out of gouda.\n"
print(file_lines[-1])          # "Spam, spam, spam, spam, spam."

Methods Used in File Parsing: String Object

We can strip, split and slice a string; and we can join strings.

rstrip() a string

xx = 'this is a line with a newline at the end\n'

yy = xx.rstrip()                 # return a new string without the newline

print(yy)                        # 'this is a line with a newline at the end'

split() a string into a list of strings

mystr = 'jw234,Joe,Wilson,Smithtown,NJ,2015585894'
elements = mystr.split(',')
print(elements)                           # ['jw234', 'Joe', 'Wilson',
                                          #  'Smithtown', 'NJ', '2015585894']


# alternative:  "multi-target" assignment
# allows us to name each value on a row
stuid, fname, lanme, city, state, stuzip = mystr.split(',')

join() a list of strings into a string

mylist = ['jw234', 'Joe', 'Wilson', 'Smithtown', 'NJ', '2015585894']

line = ','.join(mylist)          # 'jw234,Joe,Wilson,Smithtown,NJ,2015585894'

slice a string

mystr = '2014-03-13 15:33:00'
year =  mystr[0:4]               # '2014'
month = mystr[5:7]               # '03'
day =   mystr[8:10]              # '13'

Methods Used in File Parsing: List Object

We can subscript or slice a list.

initialize a list: lists are initalized with square brackets and comma-separated objects.

aa = ['a', 'b', 'c', 3.5, 4.09, 2]

subscript a list: select an item using item index, starting at 0

elements = ['jw234', 'Joe', 'Wilson', 'Smithtown', 'NJ', '2015585894']

var = elements[0]               # 'jw234'
var2 = elements[4]              # 'NJ'
var3 = elements[-1]             # '2015585894' (-1 means last index)

slice a list: select a range of items and optional 'step' for skipping items

aa = ['a', 'b', 'c', 3.5, 4.09, 2, 19]

s1 = aa[2:5]       # ['c', 3.5, 4.09]  (3rd through 5th item)

s2 = aa[3:]        # [3.5, 4.09, 2, 19]  (4th item to the end)

s3 = aa[1:6:2]     # ['b', 3.5, 2]  (every other item)

In a slice, the "upper bound" index is "non-inclusive" meaning we include all items up to but not including the upper bound. If the upper bound is omitted, the slice extends to the end of the list. With a "step" 3rd index, Python will select every 2nd item, every 3rd item, etc.

[pr]