Website Scraping with Python - Gábor László Hajba

Website Scraping with Python (eBook)

Using BeautifulSoup and Scrapy

Gábor László Hajba (Autor)

eBook Download: PDF

2018 | 1st ed.
XVIII, 223 Seiten
Apress (Verlag)
978-1-4842-3925-4 (ISBN)

Website Scraping with Python starts by introducing and installing the scraping tools and explaining the features of the full application that readers will build throughout the book. You'll see how to use BeautifulSoup4 and Scrapy individually or together to achieve the desired results. Because many sites use JavaScript, you'll also employ Selenium with a browser emulator to render these sites and make them ready for scraping.

By the end of this book, you'll have a complete scraping application to use and rewrite to suit your needs. As a bonus, the author shows you options of how to deploy your spiders into the Cloud to leverage your computer from long-running scraping tasks.

What You'll Learn

Install and implement scraping tools individually and together
Run spiders to crawl websites for data from the cloud
Work with emulators and drivers to extract data from scripted sites

Who This Book Is For

Readers with some previous Python and software development experience, and an interest in website scraping.

Gabor Laszlo Hajba is an IT Consultant who specializes in Java and Python, and holds workshops about Java and Java Enterprise Edition. As the CEO of the JaPy Szoftver Kft in Sopron, Hungary he is responsible for designing and developing customer needs in the enterprise software world. He has also held roles as a software developer with EBCONT Enterprise Technologies, and as an Advanced Software Engineer with Zuhlke Group. He considers himself a workaholic, (hard)core and well-grounded developer, functional minded, freak of portable apps and 'a champion Javavore who loves pushing code' and loves to develop in Python.

Closely examine website scraping and data processing: the technique of extracting data from websites in a format suitable for further analysis. You'll review which tools to use, and compare their features and efficiency. Focusing on BeautifulSoup4 and Scrapy, this concise, focused book highlights common problems and suggests solutions that readers can implement on their own.Website Scraping with Python starts by introducing and installing the scraping tools and explaining the features of the full application that readers will build throughout the book. You'll see how to use BeautifulSoup4 and Scrapy individually or together to achieve the desired results. Because many sites use JavaScript, you'll also employ Selenium with a browser emulator to render these sites and make them ready for scraping.By the end of this book, you'll have a complete scraping application to use and rewrite to suit your needs. As a bonus, the authorshows you options of how to deploy your spiders into the Cloud to leverage your computer from long-running scraping tasks. What You'll LearnInstall and implement scraping tools individually and togetherRun spiders to crawl websites for data from the cloudWork with emulators and drivers to extract data from scripted sitesWho This Book Is ForReaders with some previous Python and software development experience, and an interest in website scraping.

Gabor Laszlo Hajba is an IT Consultant who specializes in Java and Python, and holds workshops about Java and Java Enterprise Edition. As the CEO of the JaPy Szoftver Kft in Sopron, Hungary he is responsible for designing and developing customer needs in the enterprise software world. He has also held roles as a software developer with EBCONT Enterprise Technologies, and as an Advanced Software Engineer with Zuhlke Group. He considers himself a workaholic, (hard)core and well-grounded developer, functional minded, freak of portable apps and "a champion Javavore who loves pushing code" and loves to develop in Python.

Table of Contents 5
About the Author 11
About the Technical Reviewer 12
Acknowledgments 13
Introduction 14
Chapter 1: Getting Started 16
Website Scraping 16
Projects for Website Scraping 17
Websites Are the Bottleneck 18
Tools in This Book 18
Preparation 19
Terms and Robots 20
robots.txt 21
Technology of the Website 22
Using Chrome Developer Tools 23
Set-up 24
Tool Considerations 27
Starting to Code 28
Parsing robots.txt 28
Creating a Link Extractor 30
Extracting Images 32
Summary 33
Chapter 2: Enter the Requirements 34
The Requirements 35
Preparation 36
Navigating Through “Meat & fishFish”
Selecting the Required Information 43
Outlining the Application 46
Navigating the Website 47
Creating the Navigation 48
The requests Library 51
Installation 51
Getting Pages 51
Switching to requests 52
Putting the Code Together 53
Summary 54
Chapter 3: Using Beautiful Soup 55
Installing Beautiful Soup 55
Simple Examples 56
Parsing HTML Text 56
Parsing Remote HTML 58
Parsing a File 59
Difference Between find and find_all 59
Extracting All Links 59
Extracting All Images 60
Finding Tags Through Their Attributes 60
Finding Multiple Tags Based on Property 61
Changing Content 62
Adding Tags and Attributes 63
Changing Tags and Attributes 64
Deleting Tags and Attributes 65
Finding Comments 66
Conver ting a Soup to HTML Text 67
Extracting the Required Information 67
Identifying, Extracting, and Calling the Target URLs 68
Navigating the Product Pages 70
Extracting the Information 72
Using Dictionaries 72
Using Classes 76
Unforeseen Changes 77
Exporting the Data 79
To CSV 80
Quick Glance at the csv Module 80
Line Endings 82
Headers 82
Saving a Dictionary 83
Saving a Class 84
To JSON 87
Quick Glance at the json module 87
Saving a Dictionary 88
Saving a Class 89
To a Relational Database 90
To an NoSQL Database 97
Installing MongoDB 97
Writing to MongoDB 98
Per formance Improvements 99
Changing the Parser 100
Parse Only What’s Needed 101
Saving While Working 102
Developing on a Long Run 104
Caching Intermediate Step Results 104
Caching Whole Websites 105
File-Based Cache 106
Database Cache 106
Saving Space 107
Updating the Cache 108
Source Code for this Chapter 109
Summary 109
Chapter 4: Using Scrapy 111
Installing Scrapy 112
Creating the Project 112
Configuring the Project 114
Terminology 116
Middleware 116
Pipeline 117
Extension 118
Selectors 118
Implementing the Sainsbury Scraper 120
What’s This allowed_domains About? 121
Preparation 122
Using the Shell 122
def parse(self, response) 124
Navigating Through Categories 126
Navigating Through the Product Listings 130
Extracting the Data 132
Where to Put the Data? 137
Why Items? 141
Running the Spider 141
Exporting the Results 147
To CSV 148
To JSON 149
To Databases 151
MongoDB 152
SQLite 154
Bring Your Own Exporter 157
Filtering Duplicates 158
Silently Dropping Items 159
Fixing the CSV File 161
CSV Item Exporter 164
Caching with Scrapy 167
Storage Solutions 168
File System Storage 169
DBM Storage 169
LevelDB Storage 170
Cache Policies 170
Dummy Policy 170
RFC2616 Policy 171
Downloading Images 172
Using Beautiful Soup with Scrapy 175
Logging 176
(A Bit) Advanced Configuration 176
LOG_LEVEL 177
CONCURRENT_REQUESTS 178
DOWNLOAD_DELAY 178
Autothrottling 179
COOKIES_ENABLED 180
Summary 181
Chapter 5: Handling JavaScript 182
Reverse Engineering 182
Thoughts on Reverse Engineering 185
Summary 185
Splash 185
Set-up 186
A Dynamic Example 189
Integration with Scrapy 190
Adapting the basic Spider 192
What Happens When Splash Isn’t Running? 196
Summary 196
Selenium 196
Prerequisites 197
Basic Usage 198
Integration with Scrapy 199
scrapy-selenium 200
Summary 202
Solutions for Beautiful Soup 202
Splash 203
Selenium 204
Summary 205
Summary 205
Chapter 6: Website Scraping in the Cloud 206
Scrapy Cloud 206
Creating a Project 207
Deploying Your Spider 208
Start and Wait 209
Accessing the Data 211
API 213
Limitations 215
Summary 216
PythonAnywhere 216
The Example Script 216
PythonAnywhere Configuration 217
Uploading the Script 217
Running the Script 219
This Works Just Manually… 220
Storing Data in a Database? 223
Summary 227
What About Beautiful Soup? 227
Summary 229
Index 231

Erscheint lt. Verlag	14.9.2018
Zusatzinfo	XVIII, 223 p. 56 illus.
Verlagsort	Berkeley
Sprache	englisch
Themenwelt	Informatik ► Programmiersprachen / -werkzeuge ► Python
Themenwelt	Mathematik / Informatik ► Informatik ► Web / Internet
Schlagworte	BeautifulSoup 4 • Chrome Developer Tools • Cloud • CSS • Data processing • HTML • Python • ScrapingHub • Scrapy • selenium • Spiders • Splash • Web Driver • Web Scraping • XML • XPath
ISBN-10	1-4842-3925-3 / 1484239253
ISBN-13	978-1-4842-3925-4 / 9781484239254

Haben Sie eine Frage zum Produkt?

PDF (Wasserzeichen)
Größe: 4,9 MB

DRM: Digitales Wasserzeichen
Dieses eBook enthält ein digitales Wasserzeichen und ist damit für Sie personalisiert. Bei einer missbräuchlichen Weitergabe des eBooks an Dritte ist eine Rückverfolgung an die Quelle möglich.

Dateiformat: PDF (Portable Document Format)
Mit einem festen Seitenlayout eignet sich die PDF besonders für Fachbücher mit Spalten, Tabellen und Abbildungen. Eine PDF kann auf fast allen Geräten angezeigt werden, ist aber für kleine Displays (Smartphone, eReader) nur eingeschränkt geeignet.

Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen dafür einen PDF-Viewer - z.B. den Adobe Reader oder Adobe Digital Editions.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen dafür einen PDF-Viewer - z.B. die kostenlose Adobe Digital Editions-App.

Zusätzliches Feature: Online Lesen
Dieses eBook können Sie zusätzlich zum Download auch online im Webbrowser lesen.

Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.