Python 3

home

Introduction to Python

davidbpython.com

Data Parsing & Extraction: String Methods

our first data format: csv

The CSV format will allow us to explore Python's text parsing tools.

    19260701,0.09,0.22,0.30,0.009
    19260702,0.44,0.35,0.08,0.009
    19270103,0.97,0.21,0.24,0.010

data is commonly organized in tabular form: columns and rows
examples: Excel spreadsheet, CSV file, relational database
the CSV stands for "comma-separated values"
CSV is used throughout the world to post, transmit and store data
in this lesson we will 'parse' CSV data (i.e., divide into usable pieces)
in the process, we'll learn Python's tools for reading file data and parsing strings
much of the data we are called upon to work with comes to us as strings

CSV structure: "fields" and "records"

Tables consist of records (rows) and fields (column values).

Tabular text files are organized into rows and columns.

comma-separated values file (CSV)

    19260701,0.09,0.22,0.30,0.009
    19260702,0.44,0.35,0.08,0.009
    19270103,0.97,0.21,0.24,0.010
    19270104,0.30,0.15,0.73,0.010
    19280103,0.43,0.90,0.20,0.010
    19280104,0.14,0.47,0.01,0.010

space-separated values file

    19260701   -0.09    0.22    0.30   0.009
    19260702    0.44    0.35   -0.08   0.009
    19270103    0.97   -0.21    0.24  -0.010
    19270104    0.30   -0.15    0.73   0.010
    19280103   -0.43    0.90    0.20   0.010
    19280104    0.14    0.47    0.01  -0.010

note the delimiters may be commas, colons, tabs, or any other non-alphanumeric character
in addition, the delimiter may be "spaces", in other words multiple space
the delimiter is necessary to maintain the structure, but also must be removed during parsing
our job will be to turn the CSV into "fields", i.e. separated data values on each line

presentation note: ask student to name the two structural characters

table data in text files

Text files are just sequences of characters. Commas and newline characters separate the data.

If we print a CSV text file, we may see this:

    19260701,0.09,0.22,0.30,0.009
    19260702,0.44,0.35,0.08,0.009
    19270103,0.97,0.21,0.24,0.010
    19270104,0.30,0.15,0.73,0.010
    19280103,0.43,0.90,0.20,0.010
    19280104,0.14,0.47,0.01,0.010

However, here's what a text file really looks like under the hood:

19260701,0.09,0.22,0.30,0.009\n19260702,0.44,0.35,0.08,
0.009\n19270103,0.97,0.21,0.24,0.010\n19270104,0.30,0.15,
0.73,0.010\n19280103,0.43,0.90,0.20,0.010\n19280104,0.14,
0.47,0.01,0.010

the newline character separates the records in a CSV file
the delimiter (in this case, a comma) separates the fields
the newline character is actually a "printable" character: it is a signal to your printer or display program to drop down a line and continue to print or display characters on the next line

tabular data: looping, parsing and summarizing

Looping through file line strings, we can split and isolate fields on each line.

The process:

1. Open the file for reading.

fh = open('myfile.csv')

2. Use a for loop to read each line of the file, one at a time. Each line will be represented as a string.

for line in fh:

3. Remove the newline from the end of each string with .rstrip

    line = line.rstrip()

4. Divide (using .split()) the string into fields.

    fields = line.split(',')

5. Read a value from one of the fields, representing the data we want.

    val = fields[4]

6. As the loop progresses, build a sum of values from each line.

    mysum = mysum + float(val)

We will begin by reviewing each feature necessary to complete this work, and then we will put it all together.

string method: .rstrip()

This method can remove any character, or whitespace from the right side of a string.

When no argument is passed, the newline character (or any "whitespace" character) is removed from the end of the line:

line_from_file = 'jw234,Joe,Wilson\n'

stripped = line_from_file.rstrip()      # str, 'jw234,Joe,Wilson'

When a string argument is passed, that character is removed from the end of the ine:

line_from_file = 'I have something to say.'

stripped = line_from_file.rstrip('.')   # str, 'I have something to say'

Whitespace characters are any characters that don't print directly, but we may see their presence: space, tab, or newline characters are whitespace.

string method: .split() with a delimiter

This method divides a delimited string into a list.

line_from_file = 'jw234:Joe:Wilson:Smithtown:NJ:2015585894\n'

xx = line_from_file.split(':')

print(xx)                         # ['jw234', 'Joe', 'Wilson',
                                  #  'Smithtown', 'NJ', '2015585894\n']

When we pass a delimiter string like the colon, ':', Python steps through the string character-by-character, looking for that character. When it finds it, it determines that all characters leading up to it are a new item in the resulting list.
It then continues searching the string for the next instance of the delimiter. Each delimiter marks the end of another item in the resulting list.
A list is an object that can contain a sequence of other objects. We'll learn more about lists in the next lesson.

string method: .split() without a delimiter

We can also thing of a string as delimited by spaces.

gg = 'this is a file    with    some     whitespace'

hh = gg.split()                   # splits on any "whitespace character"

print(hh)                         # ['this', 'is', 'a', 'file',
                                  #  'with', 'some', 'whitespace']

If no delimiter is supplied, the string is split on whitespace.
Whitespace characters are any characters that don't print directly, but we may see their presence: space, tab, or newline characters are whitespace.
Also note that all whitespace is removed - any consecutive spaces are treated as one.

ex 4.1 - 4.2 (skipping 4.3, 4.4, slicing)

[pr]