Note: All links are from BS v4.12.0 documentation and reference, but the gist should probably hold for a while.
BeautifulSoup4 + Python is a formidable combo for web-scraping when the territory I’m scraping in is uncertain, webpages are static, and I need to spin up something quick and dirty so I can hunt around through the HTML to specifically extract what I need.
Another thing is: when pulling web data with BS4, sometimes it’s really handy to keep separate files of the subset/tags/text in a page that are of interest, but then also save original HTML files just to have them on hand in case extraction does not go according to plan.
Through this scraping and cleaning process, there are some coniderations involved:
Parsers are important to specify because each parser behaves a bit differently, especially if webpages are formatted wonkily (which is decently often) and my work needs to be reproducible.
Parser choices include: Python’s native html.parser
, C-based lxml
parsers, and finally HTML5. lxml
is the quickest and most efficient, but has external dependencies. html.parser
has no dependencies but being Python native, it’s slower. It’s also less lenient in what webpage inaccuracies/wonkiness it will permit. I find it best to use lxml
.
Here’s the standard code I use:
# Imports
from bs4 import BeautifulSoup, Tag, NavigableString
import requests
# Query the website and return the html
response = requests.get(insert_url)
# Parse the html in the 'response' variable, and store it in Beautiful Soup format
soup = BeautifulSoup(
response.text, ### Remember, do not use response.content here, BAD THINGS HAPPEN
features="lxml",
)
One thing worth remembering is that all HTML pages have some sort of text encoding they are written in (e.g. ASCII, UTF-8). Using the requests
library in this case returns a library-specific Response
object. This object’s encoding is guessed by the requests
library “based solely on HTTP headers”, but it can also be set manually using the response.encoding
property, if I know the correct encoding to use. Once this is sorted, I can access the content of the webpage using the response.text
property, which returns the content in a Unicode string, which can be passed into BS4. This is the sane and normal thing to do.
Now if I’m feeling edgy, I can pass an HTML response object to BS4 (not the Unicode string from requests
), and it will automatically convert everything, regardless of its encoding, into the Unicode character set. Very occasionally, there might be errors in how certain characters are processed, because BS4 does the conversion based on guesswork about the original encoding of the webpage, using another library called Unicode, Dammit
. I have never actually have done it this way and there’s no reason to do it this way as far as I can tell.
Errors in how the encoding for a webpage was interpreted, either by requests
or BS4’s Unicode, Dammit
, are usually revealed as I navigate the contents and tags of the page. If I uncover errors during analysis, it would be worth going back and using the response.encoding
property of requests
to set the right encoding. The docs say one can also use the from_encoding
parameter for the BeautifulSoup()
function when parsing the scraped webpage, whichever is relevant. If I don’t know what the correct encoding is, but I know that Unicode, Dammit is guessing wrong in the case of BS4, the docs say can pass the wrong guesses in as a list to theexclude_encodings
argument in BS4.
Now, a really important one: using the requests
library to generate the webpage’s Unicode string means that the from_encoding
argument of the BeautifulSoup()
function will become useless. Intuitively this makes sense: it isn’t needed because there’s already a decoded Unicode string representing the webpage. However, this can still be easy to forget.
If I do want to use it or any other features that BS4 makes available to control the decoding of a webpage, I have to query and store the webpage without creating a Unicode string to be passed to BS4. This probably entails figuring out how to pass an HTML response object to BS4. If I try to pass the from_encoding
argument to BeautifulSoup()
along with the response.text
object anyway, BS4 will be very kind and throw up the following warning:
UserWarning: You provided Unicode markup but also provided a value for from_encoding. Your from_encoding will be ignored.
Re: parsers, here’s a page from the docs also laying out the different parsers and their features, for further reference: table of parsers.
Digging through the resulting soup (“parse tree”) is easy enough with functions like soup.find_all()
, and checking if something is a tag or a NavigableString with isinstance()
calls. There’s reams of documentation and SO content to help me dig through and extract what I want.
The doc shows two ways to preview the soup or its selected sub-components: pretty vs. non-pretty printing. I tend to prefer pretty printing because it’s nicer and well-indented. All I need to use is the print(soup.prettify())
call. This method will turn a BS4 parse tree into a nicely formatted Unicode string, with newline characters separating tags and different lines.
BS4 documentation says you can pass an encoding to soup.prettify()
as an argument but I don’t know why or when I would do that. Experimenting with it leads to not fun outcomes so for now I should avoid this.
Once it’s time to write out the soup for later analysis/revisiting, instead of say conducting all my analysis ad-hoc over a thousand scraped pages and deleting the scraped HTML once I think I’m done because there’s no way I might have made an error and could have to re-scrape this in three weeks when I uncover said error, there’s a couple different ways to go about it.
One thing to know is that BS4 outputs a UTF-8 encoded document, unless specified otherwise. I have never run into errors with this or have had to use an alternate encoding version for output, but just in case, here’s a handy link to navigate those issues: docs page for output encodings.
A simple way: save as an HTML file!
with open("filepath/scraped_page.html", mode="w", encoding='utf-8') as file:
file.write(str(soup))
I learnt how to do this here.
Some housekeeping notes:
str
before writing, and use text mode (w
), not byte mode (wb
). Code written in Python2 used to convert the soup to unicode
and not str
because of different syntax conventions, and some internet code might reflect that too.Then, when it’s time to read the file back in:
with open("filepath/scraped_page.html", "r") as file:
soup = BeautifulSoup(file.read(), "lxml")
Something I still don’t know is what the difference is between writing out the str(soup)
vs. str(soup.prettify())
. GFG recommends the latter and I’m not sure why one would like to write out a modified and maybe even bloated version of the page with newline characters inserted all over. Is fidelity to the original not valuable? Unsure.
Pickling is insecure, prone to recursion errors when a page is very nested, and is maybe even a lazy way out because you don’t bother about encodings or anything of the sort. All of this is true. But also pickling is incredibly flexible: you can serialise anything! The perks are good for local storage and quick experimentation. Here’s how to do this:
# Do this to prevent issues with recursion
sys.setrecursionlimit(8000)
# Save the soup object to a file
with open("soup.pickle", "wb") as file:
pickle.dump(soup, f)
# Read the soup object from a file
with open("soup.pickle", "rb") as file:
soup_obj = pickle.load(f)
I learnt how to do this from a very nice SO thread very long ago.
One case where pickling won’t help is when I want to keep previewing the saved files for whether they “look alright” on my Mac. I can only preview the saved file if it’s in HTML, not as a pickle. In part due to this, and in part due it being not great standard practice for publicly hosted work, I’m mostly phasing out of using pickle
. However, it IS a good solution under certain cases, so it’s a good one to keep in the back pocket.
Scraping webpages is very hard because the internet is a wild, wild place, and everyone uses different templates and patterns and schematics for how they organize their data. It’s almost like being a detective - looking for clues on the pages to see what the standard patterns and templates for a website are, how error handling could be done, and how flexible and adaptive functions can be written to extract the relevant data. But before all of that comes the tedious pro cess of actually procuring the correctly decoded HTML in hand. This is one way to think of that process. I’m happy to hear from anyone who thinks they have cool ideas on this stuff and would want to think through them together!