Page 1 of 1

Web scraping

Posted: Mon Feb 10, 2020 2:38 pm
by Ralan
Previous to discovering Ubuntu about 6 months ago, Windows XP was my sole O.S. Now it's the other way round.

I'm a hobby programmer and many of my code concoctions involve scraping data off various websites. Windows has the WinHTTP library which is the backbone of Internet Explorer...so it was an ideal starting block for some web scraping. In Python, i'm kinda struggling to find an equivalent. My requirements are:

* automatic cookie handling (essential for login sessions)
* ok with http AND https
* supports both GET and POST requests
* browser-like request headers (particularly 'User-Agent')
* automatic gzip handling

I've played around with urllib and urllib2 but i'm not sure that they can give me what i need. At the end of the day, all i require is the html but of course it's not always that simple.

While on the subject, what is the best way to parse the data? The html often contains lots of JavaScript, from which much of my data is collected. The way i handled it in the past was by using string functions and delimiters. I'm not convinced that Beautiful Soup will be of use to me (due to the JavaScript?), but perhaps some kind of tokenizer lib would make parsing less cumbersome?

Suggestions please?