Another week, another tutorial! This week, we will use well-known Python packages requests and BeautifulSoup to extract the information from the websites. We will extract the text of an article from the webpage www.espn.com/nba.

This Tutorial was created in July 2017 and it is possible that parts of the code will need to be adjusted, if ESPN changes the structure of the page or articles.

import requests
from bs4 import BeautifulSoup

The first step is to send the HTTP request to the web page we want to scrape, to get the full content of the page. This is very simple in Python and we use the method get() from module requests. Once this is done we can create the object BeautifulSoup.

# Base soup object
page = requests.get("http://www.espn.com/nba/")
soup = BeautifulSoup(page.content, 'html.parser')
print type(soup)

FINDING THE <body> TAG OF A WEBPAGE

The great thing about BeautifulSoup is that the object is perfectly structured and you can easily access the body of your webpage.

html_code = list(soup.children)[3]
body = list(html_code.children)[3]
body

Now we have the body object which contains the source code of our webpage. The problem is that each webpage has a slightly different structure, so numbers 3 and 3 might not lead to expected results on other webpages. We can use a simple trick to tackle this problem:

main_content = list(soup.children)
for i in main_content:
    # if the element start with <html, it is the html code we want to access
    if str(i).find("<html") == 0:
        inner_content = i
        # looping throught the html code to find the <body> tag
        for j in inner_content:
            if str(j).find("<body") == 0:
                body = j
                print "Find HTML <body>: Successfull!!"
if 'body' not in locals():
    print "body wasn't find"

We can wrap the code above into a function that we can use for each website. 

def find_body(soup):
    main_content = list(soup.children)
    for i in main_content:
        # if the element start with <html, it is the html code we want to access
        if str(i).find("<html") == 0:
            inner_content = i
            # looping throught the html code to find the <body> tag
            for j in inner_content:
                if str(j).find("<body") == 0:
                    body = j
                    print "Find HTML <body>: Successfull!!"
    if 'body' not in locals():
        print "body wasn't find"
        body = None
    return body


IDENTIFICATION OF THE HEADLINE

Now we can continue with the extraction of the parts of the page we want. In our case, we will focus on the main article in the middle of the page and our goal is to extract the title and the short summary below the title. We will use the developer tools of our respective browser to identify the section and div of the article.

> In this tutorial, we assume that you have a basic knowledge of HTML, if you want to refresh your memory and find out what HTML tags can be used you can visit https://www.w3schools.com/tags/default.asp where you can find a nice overview for each tag.

In your browser, right-click on the headline you want to extract and click Inspect. The position of the headline in the HTML code is identified and it is very easy to extract the information afterward. You just need to follow the HTML tags that lead to the content you want.

list_of_possible_titles = body.select("""section#news-feed article.contentItem 
                                    section.contentItem__wrapper 
                                    section.contentItem__content 
                                    a h1.contentItem__title""")

The other good thing about BeautifulSoup is that it supports most of the CSS selectors, so you can easily identify specific divs based on the id or class.

  • section#news-feed - HTML tag section with id news-feed
  • article.ContentItem - HTML tag article with class ContentItem

The selected method above will return the list with all the possibilities. We can check the number of articles which were returned:

print len(list_of_possible_titles)

We can see that only one response was returned, which is exactly what we wanted. Now we can access the object in the list:

headline = list_of_possible_titles[0].get_text()
print headline

Next we will do the same thing for the short summary below the headline:

list_of_possible_summaries = body.select("""section#news-feed article.contentItem 
                                    section.contentItem__wrapper 
                                    section.contentItem__content 
                                    a p.contentItem__subhead""")
summary = list_of_possible_summaries[0].get_text()
print summary


SCRAPING THE ACTUAL ARTICLE

The last thing we will do in this tutorial, is to extract the article link from the home page and extract the full text of the article. The process of link extraction is almost identical to those before:

list_of_possible_links = body.select("""section#news-feed article.contentItem 
                                    section.contentItem__wrapper 
                                    section.contentItem__content 
                                    a""")
link = list_of_possible_links[0].get("href")
print link

Now we will repeat the similar process for the new link.

# Base soup object
if link.find("/") == 0:
    page = requests.get("http://espn.com" + link)
else:
    #if it starts with something else than /, it is the most probably full URL
    page = requests.get(link)
soup = BeautifulSoup(page.content, 'html.parser')
article_body = find_body(soup)

In most of times, the article is spread across more than one paragraph, html tag p. Therefore we need to extract text from all of them and then merge them together.

# identification of text of the article
text = ""
for article_part in article_body.select('div.container div.article-body p'):
     text = text + "\n" + article_part.get_text()
print text

The problem with the approach above is that the method get_text() extracts also the text from all the child tags. Sometimes it can be useful, but in our case it causes troubles because it extracts also text from child tags we don't want. We can tackle this problem by using the parameter recursive = False in the methods find() or find_all().

# identification of text of the article
text = ""
article = article_body.select('div.container div.article-body')
for article_part in article[0].find_all("p", recursive=False):
    print article_part.get_text()

CONCLUSION

We have shown you two different approaches how to extract the text of the article. Everyone has a preference so you can use anything that works for you. Let us know which approach you prefer! If you want to learn more about how to further analyze data you scrape, you can sign up for one of our courses at Online.BaseCamp :-)