Python 3

home

Web Scraping

Web Scraping with Beautiful Soup (bs4)

Beautiful Soup parses HTML or XML documents, making text and attribute extraction a snap.

Here we are passing the text of a web page (obtained by requests) to the bs4 parser:

from bs4 import BeautifulSoup
import requests

response = requests.get('http://www.nytimes.com')

soup = BeautifulSoup(response.text, 'html.parser')

# show HTML in "pretty" form
print(soup.prettify())

# show all plain text in a page
print(soup.get_text())

The result is a BeautifulSoup object which we can use to search for tags and data.

For the following examples, let's use the HTML provided on the Beatiful Soup Quick Start page:

<!doctype html>
<html>
  <head>
    <title>The Dormouse's story</title>
  </head>
  <body>
    <p class="story_title"><b>The Dormouse's story</b></p>

    <p class="story">Once upon a time there were three little sisters; and their names were
        <a href="http://example.com/elsie" class="sister eldest" id="link1">Elsie</a>,
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>

    <p class="story">They were happy, and eventually died.  The End.</p>
  </body>
</html>

bs4: finding a tag with attributes, find() and find_all()

Finding the first tag by name using soup.attribute

The BeautifulSoup object's attributes can be used to search for a tag. The first tag with a name will be returned.

# first (and only) <title> tag
print(soup.title)              # <title>The Dormouse's story</title>

# first (of several) <p> tags
print(soup.p)                  # <p class="title">The Dormouse's story</p>

Attributes can be chained to drill down to a particular tag:

print(soup.body.p.b)           # The Dormouse's story

However keep in mind these represent the first of each tag listed. Finding the first tag by name: find()

find() works similarly to an attribute, but filters can be applied (discussed shortly).

print(soup.find('a'))         # <a class="sister eldest" href="http://example.com/elsie" id="link1">Elsie</a>

Finding all tags by name: find_all()

findall() retrieves a list of all tags with a particular name.

tags = soup.find_all('a')     # a list of 'a' tags from the page

bs4: finding a tag using varying criteria

Tag criteria can focus on a tag's name, its attributes, or text within the tag.

SEARCHING NAME, ATTRIBUTE OR TEXT Finding a tag by name

Links in a page are marked with the <A> tag (usually seen as <A HREF="">). This call pulls out all links from a page:

link_tags = soup.find_all('a')

Finding a tag by tag attribute and/or name and tag attribute

# all <a> tags with an 'id' attribute of link1
link1_a_tags = soup.find_all('a', attrs={'id': "link1"})

# all tags (of any name) with an 'id' attribute of link1
link1_tags = soup.find_all(None, attrs={'id': "link1"})

"multi-value" tag attribute

CSS allows multiple values in an attribute:

<a href="http://example.com/elsie" class="sister eldest" id="link1">Elsie</a>

If we'd like to find a tag through this value, we pass a list:

link1_elsie_tag = soup.find('a', attrs={'class': ['sister', 'eldest']})

Finding a tag by string within the tag's text

All <a> tags containing text 'Dormouse'

elsie_tags = soup.find_all('a', attrs={'text': 'Dormouse'})

FILTER TYPES: STRING, LIST, REGEXP, FUNCTION string: filter on the tag's name

tags = soup.find_all('a')          # return a list of all <a> tags

list: filter on tag names

tags = soup.find_all(['a', 'b'])   # return a list of all <a> or  tags

regexp: filter on pattern match against name

import re
tags = soup.find_all(re.compile('^b'))      # a list of all tags whose names start with 'b'

re.compile() produces a pattern object that is applied to tag names using re.match() function: filter if function returns True

soup.find_all(lambda tag: tag.name == 'a' and 'mysite.com' in tag.get('href'))

This lambda is a special function in the form lambda arg: return value. It acts just like a function except that the 2nd half (the return value on the right side of the colon) must be a single statement. The findall() above is saying 'for each tag in the soup, give me only those where the name of the tag is "a" and also has an "href" parameter with a value that contains mysite.com'.

bs4: the Tag object

Tags' attributes and contents can be read; they can also be queried for tags and text within

body_text = """
    <BODY class="someclass otherclass">
        <H1 id='mytitle'<This is a headings</H1>
        <A href="mysite.com"<This is a link</A>
    </BODY>
"""

An HTML tag has four types of data: 1. The tag's name ('BODY' 'H1' or 'A') 2. The tag's attributes (<BODY class=, H1 id= or <A href=) 3. The tag's text ('This is a header' or 'This is a link') 4. The tag's contents (i.e., tags within it -- for <BODY>, the <H1> and <A> tags)

from bs4 import BeautifulSoup
soup = BeautifulSoup(body_text, 'html.parser')

h1 = soup.body.h1        # h1 is a Tag object
print(h1.name)            # u'h1'
print(h1.get('id'))       # u'mytitle'
print(h1.attrs)           # {u'id': u'mytitle'}
print(h1.text)            # u'This is a heading'

body = soup.body         # body is a Tag object
print(body.name)          # u'body'
print(body.get('class'))  # ['someclass', 'otherclass']
print(body.attrs)         # {'class': ['someclass', 'otherclass']}
print(body.text)          # u'\nThis is a heading\nThis is a link\n'

A tag's child tags can be searched the same as the BeautifulSoup object

body = soup.body         # find the <body> tag in this document

atag = body.find('a')    # find first <a> tag in this <body> tag

[pr]