Basic Python Web Scrapper

After wondering what the popularity of certain terms in certain areas I decided to scrap a little data from Craigslist. Getting all the urls for all of the different Craigslist sites was trivial; cleaning and polishing it was not. After breaking down the bulk of the data using find and replace actions in
Atom, I used a Google Spreadsheet to organize it. The next step was to identify the locations of each Craigslist. I searched to no avail, setting out to scrap as much as I could in Google Spreadsheet. Using this script made it easy, where the type of response was used as a qualifier to determine accuracy. About 20% of the latitude and longitudes needed to be adjusted. Also giving me the time to clean up a few vague labelings of regions.

Using a Python script to scrap Craigslist, consisting of the following libraries: beautifulsoup, time, re, json, and urllib2. A 1 second delay is healthy enough of a wait to not get a 408 error. The JSON could be lighter as it comes in at 46Kb, but it contains all the information for either a row-n-column or visual presentation. This script takes about 25 minutes per term to iterate through the list of 412 different cities that Craigslist maintains individual boards. This meets the needs of just trying to get some baseline data.

Targeting the results number made the task a matter of parsing a the number simple. This results number is not always accurate when the results are too few for a local area, Craigslist suggests postings from the nearby region. This does not affect the overall goal of estimating the popularity of a term in a certain area, and enhances seeing where the areas of a more rare search term is popular.

Future plans would include a way to run the queries in a more asynchronous manner that does not upset the reasonable use policy that Craigslist sets forth, making the JSON lighter, and providing actual results (explicit local results, rather than including nearby results) for each region.

You can find the code here:

Leave a Reply

Your email address will not be published. Required fields are marked *