Python 3

home

Introduction to Python

davidbpython.com




Data Parsing & Extraction: String Methods

our first data format: csv

The CSV format will allow us to explore Python's text parsing tools.


comma-separated values file (CSV)

    19260701,0.09,0.22,0.30,0.009
    19260702,0.44,0.35,0.08,0.009
    19270103,0.97,0.21,0.24,0.010





CSV structure: "fields" and "records"

Tables consist of records (rows) and fields (column values).


Tabular text files are organized into rows and columns.


comma-separated values file (CSV)

    19260701,0.09,0.22,0.30,0.009
    19260702,0.44,0.35,0.08,0.009
    19270103,0.97,0.21,0.24,0.010
    19270104,0.30,0.15,0.73,0.010
    19280103,0.43,0.90,0.20,0.010
    19280104,0.14,0.47,0.01,0.010

space-separated values file

    19260701    0.09    0.22    0.30   0.009
    19260702    0.44    0.35    0.08   0.009
    19270103    0.97    0.21    0.24   0.010
    19270104    0.30    0.15    0.73   0.010
    19280103    0.43    0.90    0.20   0.010
    19280104    0.14    0.47    0.01   0.010





table data in text files

Text files are just sequences of characters. Commas and newline characters separate the data.


If we print a CSV text file, we may see this:

    19260701,0.09,0.22,0.30,0.009
    19260702,0.44,0.35,0.08,0.009
    19270103,0.97,0.21,0.24,0.010
    19270104,0.30,0.15,0.73,0.010
    19280103,0.43,0.90,0.20,0.010
    19280104,0.14,0.47,0.01,0.010

However, here's what a text file really looks like under the hood:

19260701,0.09,0.22,0.30,0.009\n19260702,0.44,0.35,0.08,
0.009\n19270103,0.97,0.21,0.24,0.010\n19270104,0.30,0.15,
0.73,0.010\n19280103,0.43,0.90,0.20,0.010\n19280104,0.14,
0.47,0.01,0.010





tabular data: looping, parsing and summarizing

Looping through file line strings, we can split and isolate fields on each line.


The process: 1. Open the file for reading. 2. Use a for loop to read each line of the file, one at a time. Each line will be represented as a string. 3. Remove the newline from the end of each string with .rstrip 4. Divide (using .split()) the string into fields. 5. Read a value from one of the fields, representing the data we want. 6. As the loop progresses, build a sum of values from each line. We will begin by reviewing each feature necessary to complete this work, and then we will begin to put it all together.





string method: .rstrip()

This method can remove any character from the right side of a string.


When no argument is passed, the newline character (or any "whitespace" character) is removed from the end of the line:

line_from_file = 'jw234,Joe,Wilson\n'

stripped = line_from_file.rstrip()      # str, 'jw234,Joe,Wilson'

When a string argument is passed, that character is removed from the end of the ine:

line_from_file = 'I have something to say.'

stripped = line_from_file.rstrip('.')   # str, 'I have something to say'




string method: .split() with a delimeter

This method divides a delimited string into a list.


line_from_file = 'jw234:Joe:Wilson:Smithtown:NJ:2015585894\n'

xx = line_from_file.split(':')

print(xx)                         # ['jw234', 'Joe', 'Wilson',
                                  #  'Smithtown', 'NJ', '2015585894\n']




string method: .split() without a delimeter

We can also thing of a string as delimited by spaces.


gg = 'this is a file    with    some     whitespace'

hh = gg.split()                   # splits on any "whitespace character"

print(hh)                         # ['this', 'is', 'a', 'file',
                                  #  'with', 'some', 'whitespace']





[pr]