Frank Fineis
Data Science blogging. Statistics. Data Engineering. Consulting.

Tired of MNIST?

You’ve heard of the debate All data scientists have heard of the famous frequentist-Bayesian controversy. If you’re like me, when you’re asked the question, “Bayesian or a frequentist?” you just say “frequentist” because you’re just guessing that Bayesian statistics has something to do with Bayes’ Theorem, and you haven’t used Bayes’ Theorem since college when you had the Monty-Hall problem on your homework. Am I alone? Quite possibly.

What’s OpenCV? Ahhh, computer vision, such a cool field! Lately, I’ve been trying to become more knowledgeable about CV and image processing in python. OpenCV (CV = ‘computer vision’) is an excellent open source computer vision software library written in C++ that supports C++, C, Python, Java, and Matlab API’s. OpenCV will supply you with functions that will let you detect faces in images, track objects in a video, and perform any number of image processing tasks.

After finishing Part 1 of this tutorial we have our data features - recall that we saved the TF-IDF transformed text data from the names and description/caption fields and country names we got from the Geonames API in the ./data directory - we’ll assemble our training and test data matrices. After that, we’ll train an xgboost model comprised of trees and (briefly) tune a few hyperparameters.

Alright, so you’re an aspiring data scientist, say a graduate student in a STEM field trying to get into private industry, and congratulations, you’ve made it to the case study round! What do I mean by case study? Ah, I mean a timed test with a training set and a test set. You’ve been instructed to build a model that will give predicted values for the observations contained in the test set, and then your (hopefully) future employer will compare your values with the test set’s true values. Few things get more meritocratic than that!