Another week, another tutorial! This week, we will use well-known Python packages requests and BeautifulSoup to extract the information from the websites. We will extract the text of an article from the webpage www.espn.com/nba.
This Tutorial was created in July 2017 and it is possible that parts of the code will need to be adjusted, if ESPN changes the structure of the page or articles.
import requests from bs4 import BeautifulSoup
The first step is to send the HTTP request to the web page we want to scrape, to get the full content of the page. This is very simple in Python and we use the method get() from module requests. Once this is done we can create the object BeautifulSoup.
# Base soup object page = requests.get("http://www.espn.com/nba/") soup = BeautifulSoup(page.content, 'html.parser') print type(soup)
FINDING THE <body> TAG OF A WEBPAGE
The great thing about BeautifulSoup is that the object is perfectly structured and you can easily access the body of your webpage.
html_code = list(soup.children)
body = list(html_code.children) body
Now we have the body object which contains the source code of our webpage. The problem is that each webpage has a slightly different structure, so numbers 3 and 3 might not lead to expected results on other webpages. We can use a simple trick to tackle this problem:
main_content = list(soup.children) for i in main_content: # if the element start with <html, it is the html code we want to access if str(i).find("<html") == 0: inner_content = i # looping throught the html code to find the <body> tag for j in inner_content: if str(j).find("<body") == 0: body = j print "Find HTML <body>: Successfull!!" if 'body' not in locals(): print "body wasn't find"
We can wrap the code above into a function that we can use for each website.
def find_body(soup): main_content = list(soup.children) for i in main_content: # if the element start with <html, it is the html code we want to access if str(i).find("<html") == 0: inner_content = i # looping throught the html code to find the <body> tag for j in inner_content: if str(j).find("<body") == 0: body = j print "Find HTML <body>: Successfull!!" if 'body' not in locals(): print "body wasn't find" body = None return body
IDENTIFICATION OF THE HEADLINE
Now we can continue with the extraction of the parts of the page we want. In our case, we will focus on the main article in the middle of the page and our goal is to extract the title and the short summary below the title. We will use the developer tools of our respective browser to identify the section and div of the article.
> In this tutorial, we assume that you have a basic knowledge of HTML, if you want to refresh your memory and find out what HTML tags can be used you can visit https://www.w3schools.com/tags/default.asp where you can find a nice overview for each tag.
In your browser, right-click on the headline you want to extract and click Inspect. The position of the headline in the HTML code is identified and it is very easy to extract the information afterward. You just need to follow the HTML tags that lead to the content you want.
list_of_possible_titles = body.select("""section#news-feed article.contentItem section.contentItem__wrapper section.contentItem__content a h1.contentItem__title""")
The other good thing about BeautifulSoup is that it supports most of the CSS selectors, so you can easily identify specific divs based on the id or class.
- section#news-feed - HTML tag section with id news-feed
- article.ContentItem - HTML tag article with class ContentItem
The selected method above will return the list with all the possibilities. We can check the number of articles which were returned:
We can see that only one response was returned, which is exactly what we wanted. Now we can access the object in the list:
headline = list_of_possible_titles.get_text() print headline
Next we will do the same thing for the short summary below the headline:
list_of_possible_summaries = body.select("""section#news-feed article.contentItem section.contentItem__wrapper section.contentItem__content a p.contentItem__subhead""") summary = list_of_possible_summaries.get_text() print summary
SCRAPING THE ACTUAL ARTICLE
The last thing we will do in this tutorial, is to extract the article link from the home page and extract the full text of the article. The process of link extraction is almost identical to those before:
list_of_possible_links = body.select("""section#news-feed article.contentItem section.contentItem__wrapper section.contentItem__content a""") link = list_of_possible_links.get("href") print link
Now we will repeat the similar process for the new link.
# Base soup object if link.find("/") == 0: page = requests.get("http://espn.com" + link) else: #if it starts with something else than /, it is the most probably full URL page = requests.get(link) soup = BeautifulSoup(page.content, 'html.parser') article_body = find_body(soup)
In most of times, the article is spread across more than one paragraph, html tag p. Therefore we need to extract text from all of them and then merge them together.
# identification of text of the article text = "" for article_part in article_body.select('div.container div.article-body p'): text = text + "\n" + article_part.get_text() print text
The problem with the approach above is that the method get_text() extracts also the text from all the child tags. Sometimes it can be useful, but in our case it causes troubles because it extracts also text from child tags we don't want. We can tackle this problem by using the parameter recursive = False in the methods find() or find_all().
# identification of text of the article text = "" article = article_body.select('div.container div.article-body') for article_part in article.find_all("p", recursive=False): print article_part.get_text()
We have shown you two different approaches how to extract the text of the article. Everyone has a preference so you can use anything that works for you. Let us know which approach you prefer! If you want to learn more about how to further analyze data you scrape, you can sign up for one of our courses at Online.BaseCamp :-)