Hands-On Web Scraping with Python - Anish Chapagain

Blick ins Buch

Hands-On Web Scraping with Python (eBook)

Perform advanced scraping operations using various Python libraries and tools such as Selenium, Regex, and others

Anish Chapagain (Autor)

eBook Download: EPUB

2019
350 Seiten
Packt Publishing (Verlag)
978-1-78953-619-5 (ISBN)

Lese- und Medienproben

Ebook-Leseprobe (EPUB)

Web scraping is an essential technique used in many organizations to gather valuable data from web pages. This book will enable you to delve into web scraping techniques and methodologies.
The book will introduce you to the fundamental concepts of web scraping techniques and how they can be applied to multiple sets of web pages. You'll use powerful libraries from the Python ecosystem such as Scrapy, lxml, pyquery, and bs4 to carry out web scraping operations. You will then get up to speed with simple to intermediate scraping operations such as identifying information from web pages and using patterns or attributes to retrieve information. This book adopts a practical approach to web scraping concepts and tools, guiding you through a series of use cases and showing you how to use the best tools and techniques to efficiently scrape web pages. You'll even cover the use of other popular web scraping tools, such as Selenium, Regex, and web-based APIs.
By the end of this book, you will have learned how to efficiently scrape the web using different techniques with Python and other popular tools.

Collect and scrape different complexities of data from the modern Web using the latest tools, best practices, and techniques Key FeaturesLearn different scraping techniques using a range of Python libraries such as Scrapy and Beautiful SoupBuild scrapers and crawlers to extract relevant information from the webAutomate web scraping operations to bridge the accuracy gap and manage complex business needsBook DescriptionWeb scraping is an essential technique used in many organizations to gather valuable data from web pages. This book will enable you to delve into web scraping techniques and methodologies. The book will introduce you to the fundamental concepts of web scraping techniques and how they can be applied to multiple sets of web pages. You'll use powerful libraries from the Python ecosystem such as Scrapy, lxml, pyquery, and bs4 to carry out web scraping operations. You will then get up to speed with simple to intermediate scraping operations such as identifying information from web pages and using patterns or attributes to retrieve information. This book adopts a practical approach to web scraping concepts and tools, guiding you through a series of use cases and showing you how to use the best tools and techniques to efficiently scrape web pages. You'll even cover the use of other popular web scraping tools, such as Selenium, Regex, and web-based APIs. By the end of this book, you will have learned how to efficiently scrape the web using different techniques with Python and other popular tools.What you will learnAnalyze data and information from web pagesLearn how to use browser-based developer tools from the scraping perspectiveUse XPath and CSS selectors to identify and explore markup elementsLearn to handle and manage cookiesExplore advanced concepts in handling HTML forms and processing loginsOptimize web securities, data storage, and API use to scrape dataUse Regex with Python to extract dataDeal with complex web entities by using Selenium to find and extract dataWho this book is forThis book is for Python programmers, data analysts, web scraping newbies, and anyone who wants to learn how to perform web scraping from scratch. If you want to begin your journey in applying web scraping techniques to a range of web pages, then this book is what you need! A working knowledge of the Python programming language is expected. ]]>

Loading URLs

Now that we've confirmed the required libraries and system requirements, we will proceed with loading the URLs. While looking for contents from a URL, it is also necessary to confirm and verify the exact URL that has been chosen for the required content. Contents can be found on single web pages or scattered across multiple pages, and it might not always be the HTML sources we are looking for.

We will load some URLs and explore the content using a couple of tasks.

Before loading URLs using Python script, it's also advisable to verify the URLs are working properly and contain the detail we are looking for, using web browsers. Developer tools can also be used for similar scenarios, as discussed in Chapter 1, Web Scraping Fundamentals, in the Developer tools section.

Task 1: To view data related to the listings of the most popular websites from Wikipedia. We will identify data from the Site, Domain, and Type columns in the page source.

We will follow the steps at the following link to achieve our task (a data extraction-related activity will be done in Chapter 3, Using LXML, XPath and CSS Selectors): https://en.wikipedia.org/wiki/List_of_most_popular_websites.

Search Wikipedia for the information we are looking for. The preceding link can be easily viewed in a web browser. The content is in tabular format (as shown in the following screenshot), and so the data can be collected by repeatedly using the select, copy, and paste actions, or by collecting all the text inside the table.

However, such actions will not result in the content that we are interested in being in a desirable format, or it will require extra editing and formatting tasks being performed on the text to achieve the desired result. We are also not interested in the page source that's obtained from the browser:

Page from Wikipedia, that is, https://en.wikipedia.org/wiki/List_of_most_popular_websites

After finalizing the link that contains the content we require, let's load the link using Python. We are making a request to the link and willing to see the response returned by both libraries, that is, urllib and requests:

Let's use urllib:

>>> import urllib.request as req #import module request from urllib
>>> link = "https://en.wikipedia.org/wiki/List_of_most_popular_websites"
>>> response = req.urlopen(link) #load the link using method urlopen()

>>> print(type(response)) #print type of response object
<class 'http.client.HTTPResponse'>

>>> print(response.read()) #read response content
b'<!DOCTYPE html>/n<html class="client-nojs" lang="en" dir="ltr">/n<head>/n<meta charset="UTF-8"/>/n<title>List of most popular websites - Wikipedia</title>/n<script>…..,"wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_most_popular_websites","wgTitle":"List of most popular websites",……

The urlopen() function from urllib.request has been passed with the selected URL or request that has been made to the URL and response is received, that is, HTTPResponse. response that's received for the request made can be read using the read() method.

2. Now, let's use requests:

>>> import requests
>>> link = "https://en.wikipedia.org/wiki/List_of_most_popular_websites"
>>> response = requests.get(link)

>>> print(type(response))
<class 'requests.models.Response'>

>>> content = response.content #response content received
>>> print(content[0:150]) #print(content) printing first 150 character from content

b'<!DOCTYPE html>/n<html class="client-nojs" lang="en" dir="ltr">/n<head>/n<meta charset="UTF-8"/>/n<title>List of most popular websites - Wikipedia</title>'

Here, we are using the requests module to load the page source, just like we did using urllib. requests with the get() method, which accepts a URL as a parameter. The response type for both examples has also been checked.

The output that's displayed in the preceding code blocks has been shortened. You can find the code files for this at https://github.com/PacktPublishing/Hands-On-Web-Scraping-with-Python.

In the preceding examples, the page content—or the response object—contains the details we were looking for, that is, the Site, Domain, and Type columns.

We can choose any one library to deal with the HTTP request and response. Detailed information on these two Python libraries with examples is provided in the next section, URL handling and operations with urllib and requests.

Let's have a look at the following screenshot:

Wikipedia.com page content, viewed using Python libraries

Further activities like processing and parsing can be applied to content like this in order to extract the required data. More details about further processing tools/techniques and parsing can be found in Chapter 3, Using LXML, XPath, and CSS Selectors, Chapter 4, Scraping Using pyquery – a Python Library, and Chapter 5, Web Scraping Using Scrapy and Beautiful Soup.

Task 2: Load and save the page content from https://www.samsclub.com/robots.txt and https://www.samsclub.com/sitemap.xml using urllib and requests.

Generally, websites provide files in their root path (for more information on these files, please refer to Chapter 1, Web Scraping Fundamentals, the Data finding techniques for the web section):

robots.txt: This contains information for the crawler, web agents, and so on
sitemap.xml: This contains links to recently modified files, published files, and so on

From Task 1, we were able to load the URL and retrieve its content. Saving the content to local files using libraries methods and using file handling concepts will be implemented in this task. Saving content to local files and working on content with tasks like parsing and traversing can be really quick and even reduce network resources:

Load and save the content from https://www.samsclub.com/robots.txt using urllib:

>>> import urllib.request

>>> urllib.request.urlretrieve('https://www.samsclub.com/robots.txt')
('C://Users//*****/AppData//Local//Temp//tmpjs_cktnc', <http.client.HTTPMessage object at 0x04029110>)

>>> urllib.request.urlretrieve(link,"testrobots.txt") #urlretrieve(url, filename=None)
('testrobots.txt', <http.client.HTTPMessage object at 0x04322DF0>)

The urlretrieve() function, that is, urlretrieve(url, filename=None, reporthook=None, data=None), from urllib.request returns a tuple with the filename and HTTP headers. You can find this file in the C://Users..Temp directory if no path is given; otherwise, the file will be generated in the current working directory with the name provided to the urlretrieve() method as the second argument. This was testrobots.txt in the preceding code:

>>> import urllib.request
>>> import os
>>> content = urllib.request.urlopen('https://www.samsclub.com/robots.txt').read() #reads robots.txt content from provided URL

>>> file = open(os.getcwd()+os.sep+"contents"+os.sep+"robots.txt","wb") #Creating a file robots.txt inside directory 'contents' that exist under current working directory (os.getcwd())

>>> file.write(content) #writing content to file robots.txt opened in line above. If the file doesn't exist inside directory 'contents', Python will throw exception "File not Found"

>>> file.close() #closes the file handle

In the preceding code, we are reading the URL and writing the content found using a file handling concept.

Load and save the content from https://www.samsclub.com/sitemap.xml using requests:

>>> link="https://www.samsclub.com/sitemap.xml"
>>> import requests
>>> content = requests.get(link).content
>>> content

b'<?xml version="1.0" encoding="UTF-8"?>/n<sitemapindex...

Erscheint lt. Verlag	15.7.2019
Sprache	englisch
Themenwelt	Sachbuch/Ratgeber ► Freizeit / Hobby ► Sammeln / Sammlerkataloge
	Informatik ► Datenbanken ► Data Warehouse / Data Mining
	Informatik ► Theorie / Studium ► Künstliche Intelligenz / Robotik
Schlagworte	Beautiful Soup • machine learning • pyquery • Python • regex • selenium • Web Scraping
ISBN-10	1-78953-619-7 / 1789536197
ISBN-13	978-1-78953-619-5 / 9781789536195

Informationen gemäß Produktsicherheitsverordnung (GPSR)
Haben Sie eine Frage zum Produkt?

EPUB (Ohne DRM)

Digital Rights Management: ohne DRM
Dieses eBook enthält kein DRM oder Kopierschutz. Eine Weitergabe an Dritte ist jedoch rechtlich nicht zulässig, weil Sie beim Kauf nur die Rechte an der persönlichen Nutzung erwerben.

Dateiformat: EPUB (Electronic Publication)
EPUB ist ein offener Standard für eBooks und eignet sich besonders zur Darstellung von Belletristik und Sachbüchern. Der Fließtext wird dynamisch an die Display- und Schriftgröße angepasst. Auch für mobile Lesegeräte ist EPUB daher gut geeignet.

Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen dafür die kostenlose Software Adobe Digital Editions.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen dafür eine kostenlose App.
Geräteliste und zusätzliche Hinweise

Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.