Web Scraping with Python is a popular subject around data science enthusiasts. Here is a piece of content aimed at beginners who want to learn Web Scraping with Python lxml library.What is lxml?
lxml is the most feature-rich and easy-to-use library for processing XML and HTML in Python programming language. lxml is a reference to the XML toolkit in a pythonic way which is internally being bound with two specific libraries of C language, libxml2, and libxslt. lxml is unique in a way that it combines the speed and XML feature completeness of these libraries with the simplicity of a native Python API.With the continued growth of both Python and XML, there are a plethora of packages out there that help you read, generate, and modify XML files from Python scripts. Compared to most of them, the python lxml package has two big advantages:
lxml is similar in many ways to two other earlier packages which are called as parent packages for lxml.
However, lxml is preferred by most of the python developers because it provides a number of additional features that make life easier. In particular, it supports XPath,which makes it considerably easy to manage more complex XML structures.python lxml library can be used to either create XML/HTML structure using elements, or parse XML/HTML structure to retrieve information from them. This library can be used to get information from different web services and web resources, as these are implemented in XML/HTML format. The objective of this tutorial is throw light on how lxml helps us to get and process information from different web resources.How to install lxml? lxml can be installed as a python package using pip which is a package manager tool for python. Below is the command which is needs to be fired to install it on your system.pip install lxmlpip automatically installs all the dependencies for installing python lxml as well.lxml can be installed as a system package using binary installers depending upon system OS. I would prefer to install it using the former method, as many systems do not have a better and clean way to install this package if the latter is used.How to use lxml? Python is a very easy language to learn but libraries which are written using python are as easy. Getting a clear picture of the function of library is ambiguous.
Practical implementation will take us closer to creating an idea of what is the library actually doing. Let us pick few examples and use lxml in practical scenarios. A successful implementation of Web Scraping with Python takes time and practice.As discussed earlier, we can use python lxml to create as well as parse XML/HTML structures.In a first and very basic example, let’s create a html web page structure using python lxml and define some elements and its attributes. So, let us begin!lxml has many modules and one of the module is a etree which is responsible for creating elements and structure using these elements.First, let’s import the “require” module in python. I generally prefer to use Ipython command shell to execute python programs because it gives an extensive and clear command prompt to use python features in a very broad way.
Element nodes have multiple properties. For example a text property can be used to set a text value for a node which we can be inferred as an information for the end user. We can also set attributes for any node in the tree structure. As you can see below, I have created a html tree structure using lxml and its etree which can be saved as a html web page as well.We can set attributes for elements.
Now, let’s take another example in which we shall see how to parse html tree structure. This process is a part of scraping content from web so you can follow this process if you want to scrap data from the web and process the data further.In this example, let us use requests python module, which is used to send HTTP requests to web URLs. requests module has improved speed and readability when compared to the built-in urllib2 module. So, using requests module is a better choice. Along with requests, html module is made use of from lxml, to parse the response of the request.First, let’s import require modules,
In [19]: import requests
In [20]: from lxml import html
Using requests module, let’s send a get request to cnn.com website to retrieve top news stories. HTTP web server sends the response as a Response<200> object. We store this in a page variable and then use html module to parse it and save the results in a tree. Response object has multiple properties like response headers, contents, cookies etc. We can use python dir() method to see all these object properties. Here, I am using page.content instead of page.text because html.fromstring implicitly expects bytes as input where the page.text provides content in simple text format (ASCII or utf-8, depending upon web server configuration).
In [21]: page = requests.get('http://www.cnn.com')
In [22]: html_content = html.fromstring(page.content)
now contains the whole HTML file in a nice tree structure which we can go over two different ways: XPath and CSSSelect. In this example, we will focus on the former.XPath is a way of locating information in structured documents such as HTML or XML documents. XPath uses path expressions to select nodes or node-sets in an XML document. The node is selected by following a path or steps.The most useful path expressions are listed below:
Description Selects all nodes with the name “nodename” Selects from the root node Selects nodes in the document from the current node that match the selection no matter where they are Selects the current node Selects the parent of the current node Selects attributes
/
//
.
..
@
Expressionnodename
Following are some path expressions and their results:
Result Selects all nodes with the name “bookstore” Selects the root element bookstore Note: If the path starts with a slash ( / ) it always represents an absolute path to an element! Selects all book elements that are children of bookstore Selects all book elements no matter where they are in the document Selects all book elements that are descendant of the bookstore element, no matter where they are under the bookstore element
Selects all attributes that are named lang
/bookstore
bookstore/book
//book
bookstore//book
//@lang
Path ExpressionbookstoreLets get back to our scraping example. so far we have downloaded and made a tree structure from html web page. We are using XPath to select nodes from this tree structure. As, we want to get top stories, we have to analyse the web page to find the tags that are storing this information. Upon analysis we can see that h3 tag with data-analytic attribute contains this information. Selecting this node allows us to fetch the text of news stories and appropriate web links to read for complete news.
Ta da! We have successfully covered scraping using python lxml and requests. We have it stored in memory as a lists. Now we can do all sorts of cool stuff with it: analyze it using Python or save it in a file and share it with the world.We have covered most of the stuff related toWeb Scraping with python lxml module and also understood how can we combine it with other python modules to do some impressive work. Below are few references which can be helpful in knowing more about it.Do share this if you enjoyed reading this blog post on Web Scraping with Python. Write a web scraper on your own and share your experience with us.References
Originally published in Datahut.co
Refer: https://blog.datahut.co/beginners-guide-to-web-scraping-with-python-lxml/