Posted on

Python Selenium Scraper. Website is Your API


python selenium scraper

Designed by Freepik

Hi, sometimes there’s a website which content you want to use, but there’s no API provided. Today we’re going to look at web scraping, it’s not an Android dev, but is part of the whole thing and is a great opportunity. We’re going to look at scraping static and dynamic pages (JS generated content).

 

Why

Web scraping is a very common thing, although it might seem something wrong or illegal. The core algorithm of one of the giants is web scraping/crawling. I’m talking about Google definitely.

Scraping/Crawling might seem like something you never going to need if you’ve never done it. But just like Google/Facebook and other companies use our search history, preferences, etc to enhance their ad service, we can use it likewise to create cool services.

It’s not something trivial or straightforward thing to do. But if you’ve never done it before, after this post you’ll probably be interested in doing so, discovering a huge opportunity. Of course, you should use it according to a legal appliance.

So what are the possible options? First of all, you can do all the work in your app or create a server. This server is like an API over a website. And there are static/dynamic pages, which has a huge difference in the process.

Static Pages

What are static pages and how to determine whether it’s one? You can simply make an HTTP request from your app, use jsoup, a Java HTML parser library, to find elements, read their content. You need to use browser Inspector tools prior to writing any code to know what you’re looking for. Finding elements by id’s, text, link, etc.

If everything works – congratulations, you’ve got a static page, scraping which is piece of cake. Or you can click View page source in Google Chrome and compare with Inspect HTML. If they have the same content, it’s a static page.

Now you can either leave it as in app code or put it on your server. Choose second if you have iOS, web app, you don’t wanna rewrite everything to another language. Write once, use everywhere. As an API.

Dynamic Pages

But what if you’ve got a modern website where everything is generated with JavaScript? The first error you’ll get is not being to able to find elements using jsoup. And now you need those scripts to run, which doesn’t happen if you’re using jsoup or any other HTML parser libraries.

How

What I’m using is Selenium, it’s a website testing framework in multiple languages. But it can be used for scraping/crawling. And surprise, from now I’m not going to use Java. Let’s stick with Python.

Why change language if there’re some Java libraries probably? First of all, if you’re comfortable with changing between languages, it’s better to use a language which is commonly used for the purpose you attempt to.

What does it mean? Let’s say you need a server for your app. Definitely, there’s a bunch of options you can go with, but I found App Engine (using Java) and Node.js as the fastest options. And in most cases, I’ll go with Node, because it’s popular for exactly those use cases that you need and you can get results super fast despite possible struggle with JS syntax, errors

The same with Python, there’s probably less than 5 hours that I spent writing in this language before, so I’m not an expert in syntax. But when I saw how simple and fast it is to use Selenium with it, I realized that I should stick with it.

There’s a huge number of samples for Selenium in it and it’s pretty easy to find it. Plus just using Java all the time gets boring

Selenium with Python

If you’re like me and not a big Python pro, then don’t worry, it’s a way simpler language than Java. If you don’t have Python installed, do so and then download and extract geckodriver to a new folder. Install selenium with this command

Create selenium_demo.py file and add this code in it

If you want to use Chrome, download chromedriver, extract in your directory and replace executable_path and change Firefox to Chrome. It’s much easier to add those files to environment path and never need to worry about them.

Find Elements

You can find elements by id’s, attributes, but sometimes you need to wait till they appear on screen. Use this function or any variations of those

Install missing dependencies if needed. Finding elements is not so trivial yet though, in many cases it doesn’t find them, so try to experiment with queries and expected conditions

Facebook Login

Finally here’s a sample for Facebook login. Install missing packages if there are

 

There’s way more to cover, but all of what we did was with the browser being open, that’s fine if you’re making a desktop script for yourself, but how about a server? You can use PhantomJS as a driver

And everything will be in background

 

Alright, huge topic, lot’s of opportunity using it. You can get the source code here. If you like the topic, let me know, don’t forget to subscribe, follow me on Facebook, Twitter, G+ and share with friends if you think this will benefit them!