Python 3

home

[Review] File Parsing Algorithms

File Looping and Counting

An integer value set to 0 before a 'for' loop over a file can be used to count lines in the file.

This algorithm uses a basic 'for' loop counter, which you can apply to any 'for' loop if you want to count the iterations. For a file, counting iterations means counting lines from the file. We might use this pattern to count lines that have a particular value, or only those lines that have data.

the basic pattern is as follows:

set counter to integer zero

loop over lines in the file
    increment counter

print or use the count integer

This example counts the data lines - first we advance past the header line to the 2nd line, then begin counting:

fh = open('revenue.csv')

header_line = next(fh)   # advance the file pointer past
                         # the first line (only if needed)

line_counter = 0

for line in fh:
    line_counter = line_counter + 1     # 1 (first value)

fh.close()

print(f'{line_counter} lines')          # '7 lines'

Of course if we're only interested in knowing how many lines are in the file, we can also read the file as a list of lines (e.j. with .readlines()) and measure the length of that list). A counter would be more appropriate if we weren't counting every line.

File Looping and Summing

An integer or float value that is set to 0 before a 'for' loop over a file, can be used to sum up values from the file.

"Column summing" is a common and useful algorithm. It's also meaningful as an exercise in splitting out or selecting out values to summarize some of the data in a file.

set summing variable to 0
loop over lines in the file
    split or select out data from the line (e.g. a column value)
    add value to summing variable

print or use the sum

This example splits out the 3rd column value from each row, converts to float and adds the value to a float variable initialized before the loop begins:

fh = open('revenue.csv')

header_line = next(fh)   # advance the file pointer past
                         # the first line (only if needed)

value_summer = 0.0       # set to a float because a
                         # float value is expected

                                        # first values:
for line in fh:                         # "Haddad's,PA,239.5\n"
    line = line.rstrip()                # "Haddad's,PA,239.5"
    items = line.split(',')             # ["Haddad's", 'PA', '239.5']
    rev = items[2]                      # '239.5'
    frev = float(rev)                   # 239.5

    value_summer = value_summer + frev  # 239.5

fh.close()

print(f'sum of values in 3rd column:  {value_summer}')

                                        # 662.0100000000001
                                        # (note fractional value may differ)

File Looping and Collecting

An empty list or set initialized before a 'for' loop over a file can be used to collect values from the file.

Collecting values as a loop progresses is also a very common idiom. We may be looping over lines from a file, a database result set or a more complex structure such as that read from a JSON file.

The central concept behind this algorithm is a collector - such as a list or set - that we initialize as empty before the loop and add to as we loop through the data:

initialize a list or set as empty

loop over lines in file
    split or select out value to add
    add value to list or set

use the list of collected values

This example splits out the 3rd column value from each row, converts to float and append the value to a list that was initialized before the loop began:

fh = open('revenue.csv')

header_line = next(fh)   # advance the file pointer past
                         # the first line (only if needed)

value_collector = []     # empty list or set for collecting

                                   # first values:
for line in fh:                    # "Haddad's,PA,239.5\n"
    line = line.rstrip()           # "Haddad's,PA,239.5"
    items = line.split(',')        # [ "Haddad's", 'PA', '239.5' ]
    rev = items[2]                 # '239.5'
    frev = float(rev)              # 239.5

    value_collector.append(frev)   # [ 239.5 ]

fh.close()

print(value_collector)             # [ 239.5, 53.9, 211.5, 11.98,
                                   #   5.98, 239.5, 115.2 ]

Building a dict Lookup from File

A dict lookup taken from a file usually stores one pair per row of a file.

Dicts (or "mappings") pair up keys with values, and in the case of a lookup dict, we're interested in being able to look up a value associated with a key. For example, we might look up a full name based on an abbreviation; a city based on a zip code; an employee name based on an id, etc.

Similar to a list or set collection, we add pairs to a dict from each row or selected rows in a file:

initialize a dict as empty
loop through file
    split or select out the key and value to be added
    add the key/value as a pair

use the dict for lookup or other purpose

This example splits out the 1st and 2nd column value from each line, and adds that values as a pair to the dict.

fh = open('revenue.csv')

header_line = next(fh)   # advance the file pointer past
                         # the first line (only if needed)

company_states = {}      # empty dict for collecting

                                  # first values:
for line in fh:                   # "Haddad's,PA,239.5\n"
    items = line.split(',')       # [ "Haddad's", 'PA', '239.5\n' ]
    co = items[0]                 # "Haddad's"
    st = items[1]                 # 'PA'
    company_states[co] = st       # {"Haddad's", 'PA'}

fh.close()

print(company_states)      # { "Haddad's": 'PA',
                           #   'Westfield': 'NJ',
                           #   'The Store': 'NJ',
                           #   'Hipster's': 'NY',
                           #   'Dothraki Fashions': 'NY',
                           #   "Awful's': 'PA',
                           #   'The Clothiers': 'NY',
                           #  }

Building a dict Aggregation (Summing or Counting) from File

A dict aggregation from a file relies on the presence of a key to determine whether to set a pair or change the value associated with a key.

An aggregation, or grouping, allows us to compile a sum or a count based on a "primary key". For example: counting large cities in each state or students under each major; summing up revenue by sales associate or total population of cities in each country.

A dictionary can accomplish this by setting the "primary key" (the key under which values are collected) as the key in the dict and the value to an int or float that is then updated inside the file loop:

initialize a dict as empty

loop through the file
    split and select out the "key" and associated "value"
    if the key is not in the dict
        set the key to 0
    add the "value" to the current value for the key

use the dict for lookup or other purpose

fh = open('revenue.csv')

header_line = next(fh)   # advance the file pointer past
                         # the first line (only if needed)

company_states = {}      # empty dict for collecting

                                     # first values:
for line in fh:                      # "Haddad's,PA,239.5\n"
    items = line.split(',')          # [ "Haddad's", 'PA', '239.5\n' ]
    state = items[1]                 # 'PA'

    if state not in company_states:
        company_states[state] = 0    # {'PA': 0}

    company_states[state] = company_states[state] + 1
                                     # {'PA': 1}

print(company_states)                # {'PA': 2, 'NJ': 2, 'NY': 3}

fh.close()

CSV <-> CSV Transforms: Selecting Columns

We may "sculpt" a CSV file by reading the file and building a list of rows for writing to another file.

initialize an empty list
open a file for reading, pass to a CSV reader
read the file row-by-row
    select columns from or add column values to the row
    append the row to the list

open a new file for writing, pass to a CSV writer
write the list of rows to the file
always remember to close a write file

Selecting column values for each row is as easy as creating a new row list, then appending the new list to a row collector list:

import csv

fh = open('revenue.csv')
reader = csv.reader(fh)

                                       # advance the file pointer past
                                       # the first line (only if needed)
header_row = next(reader)              # ['company', 'state', 'price']

new_lines = []                         # new collector list
new_lines.append(['company', 'price']) # add header line with selected columns

                                       # first values:
for row in reader:                     # ["Haddad's", 'PA', '239.5']
    name = row[0]                      # "Haddad's"
    state = row[1]                     # 'PA'
    price = row[2]                     # '239.5'

    new_row = [name, price]            # ["Haddad's", '239.5']

    new_lines.append(new_row)          # [ ['company', 'price'],
                                       #   ["Haddad's", '239.5'] ]
                                       # (list of lists)

fh.close()


wfh = open('revenue_new.csv', 'w', newline='')    # <B>newline=''</B> is necessary for Windows users

writer = csv.writer(wfh)            # <B>newline=''</B> is necessary for Windows users
writer.writerows(new_lines)         # writing list of lists to new file
                                    # as all lines at once

wfh.close()
                                    # company,price
                                    # Haddad's,239.5
                                    # Westfield,53.9
                                    # The Store,211.5
                                    # Hipster's,11.98
                                    # Dothraki Fashions,5.98
                                    # Awful's,23.95
                                    # The Clothiers,115.2

Adding a column can be done just as easily - building a collector list of rows with an extra value on each row list:

import csv

fh = open('revenue.csv')
reader = csv.reader(fh)

                                    # advance the file pointer past
                                    # the first line (only if needed)
header_row = next(reader)           # ['company', 'state', 'price']

header_row.append('tax')            # ['company', 'state', 'price', 'tax']

new_lines = [ ]                     # new collector list
new_lines.append(header_row)        # [ ['company', 'state', 'price', 'tax'] ]

                                    # first values:
for row in reader:                  # ["Haddad's", 'PA', '239.5']
    name = row[0]                   # "Haddad's"
    state = row[1]                  # 'PA'
    revenue = float(row[2])         # 239.5

    tax = round(revenue * .08, 2)   # 19.16

    new_row = [ name, state, revenue, tax ]   # [ "Haddad's", 'PA', 239.5, 19.16 ]

                                    # adding above row to list - list of lists
    new_lines.append(new_row)       # [ ['company', 'state', 'price', 'tax'],
                                    #   ["Haddad's", 'PA', 239.5, 19.16] ]


wfh = open('revenue_new.csv', 'w', newline='')    # <B>newline=''</B> is necessary for Windows users
writer = csv.writer(wfh)

writer.writerows(new_lines)         # writing list of lists to new file
                                    # as all lines at once

fh.close()
                                    # company,state,price,tax
                                    # Haddad's,PA,239.5,19.16
                                    # Westfield,NJ,53.9,4.31
                                    # The Store,NJ,211.5,16.92
                                    # Hipster's,NY,11.98,0.96
                                    # Dothraki Fashions,NY,5.98,0.48
                                    # Awful's,PA,23.95,1.92
                                    # The Clothiers,NY,115.2,9.22

CSV <-> CSV Transforms: Selecting Rows by Column Value

We may selectively choose rows from a file for addition to a new file.

Here we are building a new collector list of rows that have a column value above a threshold value:

import csv

fh = open('revenue.csv')
reader = csv.reader(fh)
                                    # advance the file pointer past
                                    # the first line (only if needed)
header_row = next(reader)           # ['company', 'state', 'price']

new_lines = [ ]                     # new collector list
new_lines.append(header_row)        # [ ['company', 'state', 'price'] ]

                                    # first values:
for row in reader:                  # ["Haddad's", 'PA', '239.5']
    name = row[0]                   # "Haddad's"
    state = row[1]                  # 'PA'
    revenue = float(row[2])         # 239.5

    if revenue >= 100:
        new_lines.append(row)       # [ ['company', 'state', 'price'],
                                    #   ["Haddad's", 'PA', '239.5']    ]


wfh = open('revenue_new.csv', 'w', newline='')    # <B>newline=''</B> is necessary for Windows users
writer = csv.writer(wfh)
writer.writerows(new_lines)         # writes entire list of lists to file

fh.close()
                                    # company,state,price
                                    # Haddad's,PA,239.5
                                    # The Store,NJ,211.5
                                    # The Clothiers,NY,115.2

CSV <-> CSV Transforms: Selecting Rows by Position

We may choose rows from a file based on position, build them into a list of rows, then write those rows to a new file.

Similar to "whole file" parsing of lines, we can read the file into a list of rows, then select rows from the list for writing to a new file. The cvs .writerows() method makes this particularly easy.

import csv

fh = open('revenue.csv')
reader = csv.reader(fh)
                                    # advance the file pointer past
                                    # the first line (only if needed)
header_row = next(reader)           # ['company', 'state', 'price']

new_lines = [ ]                     # new collector list
new_lines.append(header_row)        # [ ['company', 'state', 'price'] ]

data_lines = list(reader)           # list of lists - entire file

wanted_lines = data_lines[3:]       # just the 4th through last rows
                                    # (omitting 1st 3 rows)

new_lines.extend(wanted_lines)      # add selected rows to list of rows


wfh = open('revenue_new.csv', 'w', newline='')    # <B>newline=''</B> is necessary for Windows users
writer = csv.writer(wfh)
writer.writerows(new_lines)         # writes entire list of lists to file

fh.close()

                                    # company,state,price
                                    # Hipster's,NY,11.98
                                    # Dothraki Fashions,NY,5.98
                                    # Awful's,PA,23.95
                                    # The Clothiers,NY,115.2

"Whole File" Parsing: Accessing File Text as a String or List

Files can be read as a single string and divided into words or lines.

Entire File as a String Treating an entire file as a string means we can process the entire file at once, using string transforming methods like .upper() or .replace(), "inspection" methods like .count(), in or len(), or methods that convert the string into another useful form like .split() (split into "words") or .splitlines() (split into lines).

fh = open('pyku.txt')

text = fh.read()       # returns single string, full text of file

print(f'there are {len(text)} lines in the file')

fh.close()

(As with any file read, be sure to read the file only once - if you try to read it a second time, python will return an empty list or string.) Entire File as a List of String Lines Dividing a file into a list of lines means we can access any line by position in the file (first line, last line, 5th line, etc.) by subscripting the list. We can also slice a portion of the file by position (first 5 lines, last 3 lines, 2nd through last line).

str.readlines() returns a list of strines lines:

fh = open('revenue.csv')

lines = fh.readlines()     # list of strings, each line in file

print(f'there are {len(lines)} in the file')
print(f'first line is {lines[0]}')
print(f'last line is {lines[-1]}')

fh.close()

Calling the .read() method on a file to get a string, and then the .splitlines() method on the string returns the list of lines (strings)

fh = open('revenue.csv')

text = fh.read()                 # single string, full text of file
lines = text.splitlines()        # list of strings, lines from file

print(lines)

fh.close()

We may even want to combine the read and splitlines into one statement -- this is just a convenient way to say it quickly:

text = fh.read().splitlines()

(As with any file read, be sure to read the file only once - if you try to read it a second time, python will return an empty list or string.) Entire file as a list of string words

Since a string can be split, we can easily read the file as a list of words in the file:

fh = open('pyku.txt')

text = fh.read()                  # single string, full text of file
words = text.split()              # entire file as a list of words

print(f'there are {len(words} in the file')
print(f'the first word is {words[0]}')
print(f'the last word is {words[-1]}')

fh.close()

[pr]