Python 3

home

Introduction to Python

davidbpython.com




Data Parsing & Extraction: String Methods


our first data format: csv

The CSV format will allow us to explore Python's text parsing tools.


    19260701,0.09,0.22,0.30,0.009
    19260702,0.44,0.35,0.08,0.009
    19270103,0.97,0.21,0.24,0.010






CSV structure: "fields" and "records"

Tables consist of records (rows) and fields (column values).


Tabular text files are organized into rows and columns.


comma-separated values file (CSV)

    19260701,0.09,0.22,0.30,0.009
    19260702,0.44,0.35,0.08,0.009
    19270103,0.97,0.21,0.24,0.010
    19270104,0.30,0.15,0.73,0.010
    19280103,0.43,0.90,0.20,0.010
    19280104,0.14,0.47,0.01,0.010

space-separated values file

    19260701   -0.09    0.22    0.30   0.009
    19260702    0.44    0.35   -0.08   0.009
    19270103    0.97   -0.21    0.24  -0.010
    19270104    0.30   -0.15    0.73   0.010
    19280103   -0.43    0.90    0.20   0.010
    19280104    0.14    0.47    0.01  -0.010


presentation note: ask student to name the two structural characters






table data in text files

Text files are just sequences of characters. Commas and newline characters separate the data.


If we print a CSV text file, we may see this:

    19260701,0.09,0.22,0.30,0.009
    19260702,0.44,0.35,0.08,0.009
    19270103,0.97,0.21,0.24,0.010
    19270104,0.30,0.15,0.73,0.010
    19280103,0.43,0.90,0.20,0.010
    19280104,0.14,0.47,0.01,0.010

However, here's what a text file really looks like under the hood:

19260701,0.09,0.22,0.30,0.009\n19260702,0.44,0.35,0.08,
0.009\n19270103,0.97,0.21,0.24,0.010\n19270104,0.30,0.15,
0.73,0.010\n19280103,0.43,0.90,0.20,0.010\n19280104,0.14,
0.47,0.01,0.010






tabular data: looping, parsing and summarizing

Looping through file line strings, we can split and isolate fields on each line.


The process:


1. Open the file for reading.

fh = open('myfile.csv')

2. Use a for loop to read each line of the file, one at a time. Each line will be represented as a string.

for line in fh:

3. Remove the newline from the end of each string with .rstrip

    line = line.rstrip()

4. Divide (using .split()) the string into fields.

    fields = line.split(',')

5. Read a value from one of the fields, representing the data we want.

    val = fields[4]

6. As the loop progresses, build a sum of values from each line.

    mysum = mysum + float(val)

We will begin by reviewing each feature necessary to complete this work, and then we will put it all together.






string method: .rstrip()

This method can remove any character, or whitespace from the right side of a string.


When no argument is passed, the newline character (or any "whitespace" character) is removed from the end of the line:

line_from_file = 'jw234,Joe,Wilson\n'

stripped = line_from_file.rstrip()      # str, 'jw234,Joe,Wilson'

When a string argument is passed, that character is removed from the end of the ine:

line_from_file = 'I have something to say.'

stripped = line_from_file.rstrip('.')   # str, 'I have something to say'

Whitespace characters are any characters that don't print directly, but we may see their presence: space, tab, or newline characters are whitespace.






string method: .split() with a delimiter

This method divides a delimited string into a list.


line_from_file = 'jw234:Joe:Wilson:Smithtown:NJ:2015585894\n'

xx = line_from_file.split(':')

print(xx)                         # ['jw234', 'Joe', 'Wilson',
                                  #  'Smithtown', 'NJ', '2015585894\n']






string method: .split() without a delimiter

We can also thing of a string as delimited by spaces.


gg = 'this is a file    with    some     whitespace'

hh = gg.split()                   # splits on any "whitespace character"

print(hh)                         # ['this', 'is', 'a', 'file',
                                  #  'with', 'some', 'whitespace']


ex 4.1 - 4.2 (skipping 4.3, 4.4, slicing)





[pr]