Closing Thoughts on the Web Scraping Side

We figured out early on that for our project, web scraping would be a pretty essential tool. Since Augie Athletics' whole database system seems to be pretty jacked up and beyond even their comprehension, that took away any hope of being able to directly access their data through a more straightforward means. So, scraping!

Having touched on scraping on the blog before, I'm only going to rehash the most important bits, that web scraping involves singling out and utilizing information from the source code of a website. Working on Android and therefore Java, jsoup was an essential library for us to be able to download and parse the HTML of different pages on the Augie Athletics website.

Some notable things we ran into included:

The discovery of Sidearm Sports and its ubiquity. It turns out that they partner with a vast number of colleges throughout the country to build their sports websites, Augustana College included. Their logo is actually visible but easy to miss on Augie Athletics web pages, and I had no idea they existed before digging into the website's HTML, which contains numerous references to them including various data objects being denoted with "sidearm-," such as "sidearm-roster-view." If you're a college sports fan and ever wondered why half the websites look the same, now you know!

The discovery of dynamically loaded information not contained in the HTML. Since this process involved learning scraping from the ground up, we ran into a considerable hurdle on realizing that none of the actual data about the calendar, such as game details, are included in the calendar page's HTML! After a little frustration and a lot of digging, it turned out this was thanks to Ajax and JavaScript doing some work behind the scenes. To figure out where the data was coming from, it was necessary to watch the network (specifically XHR) requests being sent from my browser to ultimately be able to find the page containing all upcoming games in JSON format. Tricky but ultimately rewarding.

Other than that, taking care of the web scraping for our project turned out to be a straightforward enough process requiring just a bit of a learning curve regarding jsoup and their "CSS or jquery-like selector syntax" to find the relevant data. Spending hours of work to end up with just a couple lines of code felt a little ridiculous at times, but it was a great thing to get more familiar with and also really interesting to see just how powerful those couple lines of code could be. I have no doubt that the greater familiarity with HTML and the like will be a great experience to have moving forward.

Comments