Python 3home |
Tables consist of records (rows) and fields (column values).
Tabular data may come in many forms, but the most common are CSV (comma-separated values) files and RDBMS (relational / SQL databases). In this session, we'll look at CSV. CSV files are plaintext, meaning they consist solely of text characters. In order to be structured, CSV files must include delimiters, or characters the indicate the start and end of each data value. The row delimiter is usually the comma, but may be any character, or even even be one or more spaces.
comma-separated values file (CSV)
Date,Mkt-RF,SMB,HML,RF 19260701,0.09,0.22,0.30,0.009 19260702,0.44,0.35,0.08,0.009 19270103,0.97,0.21,0.24,0.010 19270104,0.30,0.15,0.73,0.010 19280103,0.43,0.90,0.20,0.010 19280104,0.14,0.47,0.01,0.010
space-separated values file
Date Mkt-RF SMB HML 19260701 0.09 0.22 0.30 0.009 19260702 0.44 0.35 0.08 0.009 19270103 0.97 0.21 0.24 0.010 19270104 0.30 0.15 0.73 0.010 19280103 0.43 0.90 0.20 0.010 19280104 0.14 0.47 0.01 0.010
Our job for this lesson is to parse (separate) these values into usable data. We use the delimiter characters in a CSV to programmatically locate the values.
Here's what a CSV file really looks like under the hood:
19260701,0.09,0.22,0.30,0.009\n19260702,0.44,0.35,0.08, 0.009\n19270103,0.97,0.21,0.24,0.010\n19270104,0.30,0.15, 0.73,0.010\n19280103,0.43,0.90,0.20,0.010\n19280104,0.14, 0.47,0.01,0.010
The newline character (\n) separates the records (or rows) in a CSV file. The row delimeter (in this case, a comma) separates the fields (column values). When displaying a file, your text editor will translate the newlines into a line break, and drop down to the next line. This makes it seem as if each line is separate, but in fact they are a continuous stream of characters, punctuated by the newline and row delimiter characters.
We iterate over each row, parse each row to isolate field values, then add to variables to build a collection of values.
Here we loop through each string line in the file, strip and split the lineto a list, then subscript the list to retrieve a value. Note the value produced by each code line:
filename = '../revenue.csv'
fh = open(filename) # 'file' object
headers = next(fh) # 'company,state,price\n'
mysum = 0.0
for row in fh: # "Haddad's,PA,239.5\n"
row = row.rstrip() # "Haddad's,PA,239.5"
fields = row.split(',') # ["Haddad's", 'PA', '239.5']
row_val = fields[-1] # '239.5'
float_val = float(row_val) # 239.5
mysum = mysum + float_val # 239.5
print(mysum) # 662.010000000002
When writing a line to a file we must include the newline.
Writing free text to a file (must include newlines)
header = ['company', 'state', 'price']
line1 = ['Alpha Corp.', 'NY', '239.5']
line2 = ['Beta Corp.', 'NJ', '101.03']
header_line = ','.join(header)
line1_line = ','.join(line1)
line2_line = ','.join(line2)
wfh = open('new_file.txt', 'w')
wfh.write(header_line + '\n')
wfh.write(line1_line + '\n')
wfh.write(line2_line + '\n')
wfh.close()
We use 'for' to iterate over lines from a file. We can also parse the file as a whole.
for: read one string line at a time
fh = open('pyku.txt') # file object
for my_file_line in fh: # "We're all out of gouda.\n"
print(my_file_line)
fh.close() # close the file
read(): read entire file as a single string
fh = open('pyku.txt') # file object
text = fh.read() # "We're all out of gouda.\nThis
# parrot has ceased to be.\nSpam,
# spam, spam, spam, spam."
fh.close()
read() with str.split(): read entire file as a list of words
fh = open('pyku.txt') # file object
text = fh.read() # "We're all out of gouda.\nThis
# parrot has ceased to be.\nSpam,
# spam, spam, spam, spam."
words = text.split() # ["We're", 'all', 'out', 'of',
# 'gouda.', 'This', 'parrot',
# 'has', 'ceased', 'to', 'be.',
# 'Spam,', 'spam,', 'spam,',
# 'spam,', 'spam.' ]
fh.close() # close the file
print(len(words), ' words in the file')
print(f'first word: {words[0]}') # We're
print(f'last word: {words[-1]}') # spam.
readlines(): read as a list of strings (each string a line)
fh = open('pyku.txt') # file object
file_lines = fh.readlines() # ["We're all out of gouda.\n",
# 'This parrot has ceased to
# be\n', 'Spam, spam, spam,
# spam, spam.\n']
fh.close()
print(file_lines[0]) # "We're all out of gouda.\n"
print(file_lines[-1]) # "Spam, spam, spam, spam, spam."
We can strip, split and slice a string; and we can join strings.
rstrip() a string
xx = 'this is a line with a newline at the end\n'
yy = xx.rstrip() # return a new string without the newline
print(yy) # 'this is a line with a newline at the end'
split() a string into a list of strings
mystr = 'jw234,Joe,Wilson,Smithtown,NJ,2015585894'
elements = mystr.split(',')
print(elements) # ['jw234', 'Joe', 'Wilson',
# 'Smithtown', 'NJ', '2015585894']
# alternative: "multi-target" assignment
# allows us to name each value on a row
stuid, fname, lanme, city, state, stuzip = mystr.split(',')
join() a list of strings into a string
mylist = ['jw234', 'Joe', 'Wilson', 'Smithtown', 'NJ', '2015585894']
line = ','.join(mylist) # 'jw234,Joe,Wilson,Smithtown,NJ,2015585894'
slice a string
mystr = '2014-03-13 15:33:00'
year = mystr[0:4] # '2014'
month = mystr[5:7] # '03'
day = mystr[8:10] # '13'
We can subscript or slice a list.
initialize a list: lists are initalized with square brackets and comma-separated objects.
aa = ['a', 'b', 'c', 3.5, 4.09, 2]
subscript a list: select an item using item index, starting at 0
elements = ['jw234', 'Joe', 'Wilson', 'Smithtown', 'NJ', '2015585894']
var = elements[0] # 'jw234'
var2 = elements[4] # 'NJ'
var3 = elements[-1] # '2015585894' (-1 means last index)
slice a list: select a range of items and optional 'step' for skipping items
aa = ['a', 'b', 'c', 3.5, 4.09, 2, 19]
s1 = aa[2:5] # ['c', 3.5, 4.09] (3rd through 5th item)
s2 = aa[3:] # [3.5, 4.09, 2, 19] (4th item to the end)
s3 = aa[1:6:2] # ['b', 3.5, 2] (every other item)
In a slice, the "upper bound" index is "non-inclusive" meaning we include all items up to but not including the upper bound. If the upper bound is omitted, the slice extends to the end of the list. With a "step" 3rd index, Python will select every 2nd item, every 3rd item, etc.