Python 3home |
Beautiful Soup parses HTML or XML documents, making text and attribute extraction a snap.
Here we are passing the text of a web page (obtained by requests) to the bs4 parser:
from bs4 import BeautifulSoup
import requests
response = requests.get('http://www.nytimes.com')
soup = BeautifulSoup(response.text, 'html.parser')
# show HTML in "pretty" form
print(soup.prettify())
# show all plain text in a page
print(soup.get_text())
The result is a BeautifulSoup object which we can use to search for tags and data.
For the following examples, let's use the HTML provided on the Beatiful Soup Quick Start page:
<!doctype html> <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="story_title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister eldest" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">They were happy, and eventually died. The End.</p> </body> </html>
Finding the first tag by name using soup.attribute
The BeautifulSoup object's attributes can be used to search for a tag. The first tag with a name will be returned.
# first (and only) <title> tag
print(soup.title) # <title>The Dormouse's story</title>
# first (of several) <p> tags
print(soup.p) # <p class="title">The Dormouse's story</p>
Attributes can be chained to drill down to a particular tag:
print(soup.body.p.b) # The Dormouse's story
However keep in mind these represent the first of each tag listed. Finding the first tag by name: find()
find() works similarly to an attribute, but filters can be applied (discussed shortly).
print(soup.find('a')) # <a class="sister eldest" href="http://example.com/elsie" id="link1">Elsie</a>
Finding all tags by name: find_all()
findall() retrieves a list of all tags with a particular name.
tags = soup.find_all('a') # a list of 'a' tags from the page
Tag criteria can focus on a tag's name, its attributes, or text within the tag.
SEARCHING NAME, ATTRIBUTE OR TEXT Finding a tag by name
Links in a page are marked with the <A> tag (usually seen as <A HREF="">). This call pulls out all links from a page:
link_tags = soup.find_all('a')
Finding a tag by tag attribute and/or name and tag attribute
# all <a> tags with an 'id' attribute of link1
link1_a_tags = soup.find_all('a', attrs={'id': "link1"})
# all tags (of any name) with an 'id' attribute of link1
link1_tags = soup.find_all(None, attrs={'id': "link1"})
"multi-value" tag attribute
CSS allows multiple values in an attribute:
<a href="http://example.com/elsie" class="sister eldest" id="link1">Elsie</a>
If we'd like to find a tag through this value, we pass a list:
link1_elsie_tag = soup.find('a', attrs={'class': ['sister', 'eldest']})
Finding a tag by string within the tag's text
All <a> tags containing text 'Dormouse'
elsie_tags = soup.find_all('a', attrs={'text': 'Dormouse'})
FILTER TYPES: STRING, LIST, REGEXP, FUNCTION string: filter on the tag's name
tags = soup.find_all('a') # return a list of all <a> tags
list: filter on tag names
tags = soup.find_all(['a', 'b']) # return a list of all <a> or tags
regexp: filter on pattern match against name
import re
tags = soup.find_all(re.compile('^b')) # a list of all tags whose names start with 'b'
re.compile() produces a pattern object that is applied to tag names using re.match() function: filter if function returns True
soup.find_all(lambda tag: tag.name == 'a' and 'mysite.com' in tag.get('href'))
This lambda is a special function in the form lambda arg: return value. It acts just like a function except that the 2nd half (the return value on the right side of the colon) must be a single statement. The findall() above is saying 'for each tag in the soup, give me only those where the name of the tag is "a" and also has an "href" parameter with a value that contains mysite.com'.
Tags' attributes and contents can be read; they can also be queried for tags and text within
body_text = """ <BODY class="someclass otherclass"> <H1 id='mytitle'<This is a headings</H1> <A href="mysite.com"<This is a link</A> </BODY> """
An HTML tag has four types of data: 1. The tag's name ('BODY' 'H1' or 'A') 2. The tag's attributes (<BODY class=, H1 id= or <A href=) 3. The tag's text ('This is a header' or 'This is a link') 4. The tag's contents (i.e., tags within it -- for <BODY>, the <H1> and <A> tags)
from bs4 import BeautifulSoup
soup = BeautifulSoup(body_text, 'html.parser')
h1 = soup.body.h1 # h1 is a Tag object
print(h1.name) # u'h1'
print(h1.get('id')) # u'mytitle'
print(h1.attrs) # {u'id': u'mytitle'}
print(h1.text) # u'This is a heading'
body = soup.body # body is a Tag object
print(body.name) # u'body'
print(body.get('class')) # ['someclass', 'otherclass']
print(body.attrs) # {'class': ['someclass', 'otherclass']}
print(body.text) # u'\nThis is a heading\nThis is a link\n'
A tag's child tags can be searched the same as the BeautifulSoup object
body = soup.body # find the <body> tag in this document
atag = body.find('a') # find first <a> tag in this <body> tag