Python 3home |
An integer value set to 0 before a 'for' loop over a file can be used to count lines in the file.
This algorithm uses a basic 'for' loop counter, which you can apply to any 'for' loop if you want to count the iterations. For a file, counting iterations means counting lines from the file. We might use this pattern to count lines that have a particular value, or only those lines that have data.
the basic pattern is as follows:
set counter to integer zero loop over lines in the file increment counter print or use the count integer
This example counts the data lines - first we advance past the header line to the 2nd line, then begin counting:
fh = open('revenue.csv')
header_line = next(fh) # advance the file pointer past
# the first line (only if needed)
line_counter = 0
for line in fh:
line_counter = line_counter + 1 # 1 (first value)
fh.close()
print(f'{line_counter} lines') # '7 lines'
Of course if we're only interested in knowing how many lines are in the file, we can also read the file as a list of lines (e.j. with .readlines()) and measure the length of that list). A counter would be more appropriate if we weren't counting every line.
An integer or float value that is set to 0 before a 'for' loop over a file, can be used to sum up values from the file.
"Column summing" is a common and useful algorithm. It's also meaningful as an exercise in splitting out or selecting out values to summarize some of the data in a file.
set summing variable to 0 loop over lines in the file split or select out data from the line (e.g. a column value) add value to summing variable print or use the sum
This example splits out the 3rd column value from each row, converts to float and adds the value to a float variable initialized before the loop begins:
fh = open('revenue.csv')
header_line = next(fh) # advance the file pointer past
# the first line (only if needed)
value_summer = 0.0 # set to a float because a
# float value is expected
# first values:
for line in fh: # "Haddad's,PA,239.5\n"
line = line.rstrip() # "Haddad's,PA,239.5"
items = line.split(',') # ["Haddad's", 'PA', '239.5']
rev = items[2] # '239.5'
frev = float(rev) # 239.5
value_summer = value_summer + frev # 239.5
fh.close()
print(f'sum of values in 3rd column: {value_summer}')
# 662.0100000000001
# (note fractional value may differ)
An empty list or set initialized before a 'for' loop over a file can be used to collect values from the file.
Collecting values as a loop progresses is also a very common idiom. We may be looping over lines from a file, a database result set or a more complex structure such as that read from a JSON file.
The central concept behind this algorithm is a collector - such as a list or set - that we initialize as empty before the loop and add to as we loop through the data:
initialize a list or set as empty loop over lines in file split or select out value to add add value to list or set use the list of collected values
This example splits out the 3rd column value from each row, converts to float and append the value to a list that was initialized before the loop began:
fh = open('revenue.csv')
header_line = next(fh) # advance the file pointer past
# the first line (only if needed)
value_collector = [] # empty list or set for collecting
# first values:
for line in fh: # "Haddad's,PA,239.5\n"
line = line.rstrip() # "Haddad's,PA,239.5"
items = line.split(',') # [ "Haddad's", 'PA', '239.5' ]
rev = items[2] # '239.5'
frev = float(rev) # 239.5
value_collector.append(frev) # [ 239.5 ]
fh.close()
print(value_collector) # [ 239.5, 53.9, 211.5, 11.98,
# 5.98, 239.5, 115.2 ]
A dict lookup taken from a file usually stores one pair per row of a file.
Dicts (or "mappings") pair up keys with values, and in the case of a lookup dict, we're interested in being able to look up a value associated with a key. For example, we might look up a full name based on an abbreviation; a city based on a zip code; an employee name based on an id, etc.
Similar to a list or set collection, we add pairs to a dict from each row or selected rows in a file:
initialize a dict as empty loop through file split or select out the key and value to be added add the key/value as a pair use the dict for lookup or other purpose
This example splits out the 1st and 2nd column value from each line, and adds that values as a pair to the dict.
fh = open('revenue.csv')
header_line = next(fh) # advance the file pointer past
# the first line (only if needed)
company_states = {} # empty dict for collecting
# first values:
for line in fh: # "Haddad's,PA,239.5\n"
items = line.split(',') # [ "Haddad's", 'PA', '239.5\n' ]
co = items[0] # "Haddad's"
st = items[1] # 'PA'
company_states[co] = st # {"Haddad's", 'PA'}
fh.close()
print(company_states) # { "Haddad's": 'PA',
# 'Westfield': 'NJ',
# 'The Store': 'NJ',
# 'Hipster's': 'NY',
# 'Dothraki Fashions': 'NY',
# "Awful's': 'PA',
# 'The Clothiers': 'NY',
# }
A dict aggregation from a file relies on the presence of a key to determine whether to set a pair or change the value associated with a key.
An aggregation, or grouping, allows us to compile a sum or a count based on a "primary key". For example: counting large cities in each state or students under each major; summing up revenue by sales associate or total population of cities in each country.
A dictionary can accomplish this by setting the "primary key" (the key under which values are collected) as the key in the dict and the value to an int or float that is then updated inside the file loop:
initialize a dict as empty loop through the file split and select out the "key" and associated "value" if the key is not in the dict set the key to 0 add the "value" to the current value for the key use the dict for lookup or other purpose
fh = open('revenue.csv')
header_line = next(fh) # advance the file pointer past
# the first line (only if needed)
company_states = {} # empty dict for collecting
# first values:
for line in fh: # "Haddad's,PA,239.5\n"
items = line.split(',') # [ "Haddad's", 'PA', '239.5\n' ]
state = items[1] # 'PA'
if state not in company_states:
company_states[state] = 0 # {'PA': 0}
company_states[state] = company_states[state] + 1
# {'PA': 1}
print(company_states) # {'PA': 2, 'NJ': 2, 'NY': 3}
fh.close()
We may "sculpt" a CSV file by reading the file and building a list of rows for writing to another file.
initialize an empty list open a file for reading, pass to a CSV reader read the file row-by-row select columns from or add column values to the row append the row to the list open a new file for writing, pass to a CSV writer write the list of rows to the file always remember to close a write file
Selecting column values for each row is as easy as creating a new row list, then appending the new list to a row collector list:
import csv
fh = open('revenue.csv')
reader = csv.reader(fh)
# advance the file pointer past
# the first line (only if needed)
header_row = next(reader) # ['company', 'state', 'price']
new_lines = [] # new collector list
new_lines.append(['company', 'price']) # add header line with selected columns
# first values:
for row in reader: # ["Haddad's", 'PA', '239.5']
name = row[0] # "Haddad's"
state = row[1] # 'PA'
price = row[2] # '239.5'
new_row = [name, price] # ["Haddad's", '239.5']
new_lines.append(new_row) # [ ['company', 'price'],
# ["Haddad's", '239.5'] ]
# (list of lists)
fh.close()
wfh = open('revenue_new.csv', 'w', newline='') # <B>newline=''</B> is necessary for Windows users
writer = csv.writer(wfh) # <B>newline=''</B> is necessary for Windows users
writer.writerows(new_lines) # writing list of lists to new file
# as all lines at once
wfh.close()
# company,price
# Haddad's,239.5
# Westfield,53.9
# The Store,211.5
# Hipster's,11.98
# Dothraki Fashions,5.98
# Awful's,23.95
# The Clothiers,115.2
Adding a column can be done just as easily - building a collector list of rows with an extra value on each row list:
import csv
fh = open('revenue.csv')
reader = csv.reader(fh)
# advance the file pointer past
# the first line (only if needed)
header_row = next(reader) # ['company', 'state', 'price']
header_row.append('tax') # ['company', 'state', 'price', 'tax']
new_lines = [ ] # new collector list
new_lines.append(header_row) # [ ['company', 'state', 'price', 'tax'] ]
# first values:
for row in reader: # ["Haddad's", 'PA', '239.5']
name = row[0] # "Haddad's"
state = row[1] # 'PA'
revenue = float(row[2]) # 239.5
tax = round(revenue * .08, 2) # 19.16
new_row = [ name, state, revenue, tax ] # [ "Haddad's", 'PA', 239.5, 19.16 ]
# adding above row to list - list of lists
new_lines.append(new_row) # [ ['company', 'state', 'price', 'tax'],
# ["Haddad's", 'PA', 239.5, 19.16] ]
wfh = open('revenue_new.csv', 'w', newline='') # <B>newline=''</B> is necessary for Windows users
writer = csv.writer(wfh)
writer.writerows(new_lines) # writing list of lists to new file
# as all lines at once
fh.close()
# company,state,price,tax
# Haddad's,PA,239.5,19.16
# Westfield,NJ,53.9,4.31
# The Store,NJ,211.5,16.92
# Hipster's,NY,11.98,0.96
# Dothraki Fashions,NY,5.98,0.48
# Awful's,PA,23.95,1.92
# The Clothiers,NY,115.2,9.22
We may selectively choose rows from a file for addition to a new file.
Here we are building a new collector list of rows that have a column value above a threshold value:
import csv
fh = open('revenue.csv')
reader = csv.reader(fh)
# advance the file pointer past
# the first line (only if needed)
header_row = next(reader) # ['company', 'state', 'price']
new_lines = [ ] # new collector list
new_lines.append(header_row) # [ ['company', 'state', 'price'] ]
# first values:
for row in reader: # ["Haddad's", 'PA', '239.5']
name = row[0] # "Haddad's"
state = row[1] # 'PA'
revenue = float(row[2]) # 239.5
if revenue >= 100:
new_lines.append(row) # [ ['company', 'state', 'price'],
# ["Haddad's", 'PA', '239.5'] ]
wfh = open('revenue_new.csv', 'w', newline='') # <B>newline=''</B> is necessary for Windows users
writer = csv.writer(wfh)
writer.writerows(new_lines) # writes entire list of lists to file
fh.close()
# company,state,price
# Haddad's,PA,239.5
# The Store,NJ,211.5
# The Clothiers,NY,115.2
We may choose rows from a file based on position, build them into a list of rows, then write those rows to a new file.
Similar to "whole file" parsing of lines, we can read the file into a list of rows, then select rows from the list for writing to a new file. The cvs .writerows() method makes this particularly easy.
import csv
fh = open('revenue.csv')
reader = csv.reader(fh)
# advance the file pointer past
# the first line (only if needed)
header_row = next(reader) # ['company', 'state', 'price']
new_lines = [ ] # new collector list
new_lines.append(header_row) # [ ['company', 'state', 'price'] ]
data_lines = list(reader) # list of lists - entire file
wanted_lines = data_lines[3:] # just the 4th through last rows
# (omitting 1st 3 rows)
new_lines.extend(wanted_lines) # add selected rows to list of rows
wfh = open('revenue_new.csv', 'w', newline='') # <B>newline=''</B> is necessary for Windows users
writer = csv.writer(wfh)
writer.writerows(new_lines) # writes entire list of lists to file
fh.close()
# company,state,price
# Hipster's,NY,11.98
# Dothraki Fashions,NY,5.98
# Awful's,PA,23.95
# The Clothiers,NY,115.2
Files can be read as a single string and divided into words or lines.
Entire File as a String Treating an entire file as a string means we can process the entire file at once, using string transforming methods like .upper() or .replace(), "inspection" methods like .count(), in or len(), or methods that convert the string into another useful form like .split() (split into "words") or .splitlines() (split into lines).
fh = open('pyku.txt')
text = fh.read() # returns single string, full text of file
print(f'there are {len(text)} lines in the file')
fh.close()
(As with any file read, be sure to read the file only once - if you try to read it a second time, python will return an empty list or string.) Entire File as a List of String Lines Dividing a file into a list of lines means we can access any line by position in the file (first line, last line, 5th line, etc.) by subscripting the list. We can also slice a portion of the file by position (first 5 lines, last 3 lines, 2nd through last line).
str.readlines() returns a list of strines lines:
fh = open('revenue.csv')
lines = fh.readlines() # list of strings, each line in file
print(f'there are {len(lines)} in the file')
print(f'first line is {lines[0]}')
print(f'last line is {lines[-1]}')
fh.close()
Calling the .read() method on a file to get a string, and then the .splitlines() method on the string returns the list of lines (strings)
fh = open('revenue.csv')
text = fh.read() # single string, full text of file
lines = text.splitlines() # list of strings, lines from file
print(lines)
fh.close()
We may even want to combine the read and splitlines into one statement -- this is just a convenient way to say it quickly:
text = fh.read().splitlines()
(As with any file read, be sure to read the file only once - if you try to read it a second time, python will return an empty list or string.) Entire file as a list of string words
Since a string can be split, we can easily read the file as a list of words in the file:
fh = open('pyku.txt')
text = fh.read() # single string, full text of file
words = text.split() # entire file as a list of words
print(f'there are {len(words} in the file')
print(f'the first word is {words[0]}')
print(f'the last word is {words[-1]}')
fh.close()