Facebook RSS

Pick Your Fiction!

(This was originally posted as a guest post at the very awesome Insight Data Science blog. I highly recommend checking it out!)

It's more than a little frightening to make any major career transition in life, and switching from academia into industry is certainly one of the scarier. For me, graduate school, while full of frustrations and stressful moments, felt surprisingly safe. In academia, it's usually very clear what is expected of you, there's no real risk of being fired or your company not succeeding, and you're typically free to set your own schedule (albeit, a long and not necessarily ideal one). In its own way, graduate school can feel almost comfortable, while switching careers into an unknown territory can be very intimidating.

On the other hand, there are a number of real upsides to making the transition. Data science has the potential to be an extremely rewarding career, with the possibility of making real, day-to-day impacts on the lives of hundreds, thousands, even millions of people. There's nothing more satisfying than building some app, or website, or policy, or program, that you can point to and say, "I made that happen!" And you can do this, not on a five, six, seven year time-scale, but in weeks or even days (or hours!).

The hardest part is taking that leap of faith necessary to get your foot in the door. Although applying to graduate school was very stressful, it was also fairly objective. Getting the perfect job in industry is a completely different story - the good grades, test scores, and research results can certainly matter, but much of the job decision will come down to how you present yourself in your resume and through the interviews. 

Figuring out how to navigate through the transition process is very tricky, so where do you start? The first (and probably the trickiest) step is to learn what you need to be focusing on in the first place. For example, I had a fairly strong background in programming prior to Insight, but I learned all of it on my own through many different graduate school projects. In order to show that I had the right skills in interviews, I had to go back and learn many of the CS fundamentals that come with having studied them in a university setting: algorithms, dynamic programming, etc. Similarly, I had lots of experience with using data, making data usable in the first place, running statistical tests, and so on, but I had no way of knowing the right language to use in interviews or how to highlight the most important and relevant parts of work that I'd done.

 For me, learning what to focus on in order to be successful was by far the most helpful benefit of participating in the Insight program. Surrounding myself with people who have been through the same process and people who were going through the same process alongside me was crucial to my confidence in my decision. If you're considering making the transition, and it really is an awesome one, I highly encourage you to seek out as many folks as you can, and ask a lot of questions.

Here are a few other questions I get asked a lot today, and my answers for those of you ready to take the first scary step:

What was your project at Insight? 

I had an absolute blast making my web-app, Pick Your Fiction. The idea behind the app is simple: sometimes, I want to read a book that features certain things - say, a fantasy novel with elves. But, there are many things I don't necessarily like in fantasy novels (gothic, angst-ridden, teenage vampires?). What I'd really like to be able to do is to say, "These are the things that make me happy, these are the things that make me sad, find me a book tailored to those preferences."

With this in mind, I developed Pick Your Fiction. Here, you can enter a title of a book you love (or leave it blank) and things that you like and/or don't like in books, and it will recommend a book based on your input. For example, suppose I want a book about elves:

However, what if I want a book that doesn't have dragons in it?

There are also a few additional options - you can control how important it is to you that the books are popular (for example, if you've read all the Harry Potter books, you may also have read other very popular young adult fantasy novels, and might want something less well known), or that the books are of the same genre as your original title (i.e., say you want an adult version of a children's story, or a science fiction version of a historical novel). The app will also provide a few suggestions for you based upon features that tend to be divisive within a subject.

How does all of this work? The app primarily relies upon a very large number of customer reviews scraped from a well-known website, as well as the description/summary provided for each book. These reviews compose a significant database of words for each of the books. The app starts off by comparing the similarity between books using Python with nltk for tokenizing + stemming + Tfidf weighting + cosine similarity. This similarity metric is then weighted based upon the similarity in genre and the popularity of the book, where each of these weights is controlled by the user.

The final scores are then boosted based upon the frequencies of the user-added features relative to their baseline frequencies. Finally, the additional suggestions are chosen based upon those words that appear in as close to 50% of the closest-matching books as possible, but where this value is high or low relative to the typical frequency. All of the information is stored in a MySQL database, and the front-end is built using Flask + Bootstrap + jQuery, and hosted on AWS.

How did you pick your project? 

This is a really tricky question, and was actually one of the hardest parts of the program for me. There is a huge amount of freedom in deciding what to work on, and there's essentially no limit to what you can do. There are, however, a few key things to think about when deciding. Obviously, the project has to be doable (and doable within a 2-3 week timespan!), and, ideally, it should involve using techniques that highlight your ability to work with data. Projects that involve cleaning data and using interesting techniques/software are good, since interviews will often involve a significant amount of time spent talking about your project. In my case, the project was heavy on natural language processing and I ended up talking about that quite a bit.

The most important thing, however, is to make sure that it's a project that you're enthusiastic about and will enjoy working on. The reason for this is that you're going to spend an inordinate amount of time talking about your project, and it shows right away if it's a project you care strongly about. In my case, Pick Your Fiction was an app I was extremely excited about - I loved testing out different people's requests, talking about all of the many issues I faced along the way, discussing what I would do if I were to try and monetize and grow the app, etc. I was always extremely happy when interviewers asked about my app, because I loved talking about it, and I genuinely thought it was a useful/interesting thing.

A few weeks ago, John Joo posted an excellent list of things to do to prepare for Insight in the weeks/months leading up to the program. However, what if you're still several years out? What can you do throughout your Ph.D. to prepare for a future career in data science? 

The best thing you can do for yourself, by far, is to work on some pet project that involves data. It doesn't have to be anything mind-blowing or enormous, just something that you can point to as evidence of your skills in data science, and also your interest and enthusiasm for the data science industry. As I mentioned in the previous question, the single best thing you can do to convince someone you're capable of being a great data scientist is to show them with something tangible. And perhaps even more importantly, steering interviews to a project that you've done yourself means you'll know all the answers to the questions they're likely to ask!

There are a bunch of other ways to get involved in data science as well. A big one for me was the data science competitions over at kaggle.com. This was another case of having a massive community from which you can learn how to do things. I started off with the basic tutorial competitions, which come with a ton of different sample code snippets, and from there started to develop a better intuition for which techniques worked and which didn't. (These projects are also great things to talk about in interviews, and they typically provide good data sets to play around with).

When you're ready to start making the transition to a career in data science (or any new career for that matter), your first step should be to reach out to people who have been in your position before and ask them about their experiences directly. For me, I talked with many different alumni from the Insight program, as well as friends from my graduate program who had graduated prior to me, and other people in the alumni network. One of the biggest benefits of doing this is simply the exposure to the field - there were a huge number of different phrases and keywords that I'd heard for the very first time in my first week of Insight, that are now just everyday occurrences.  Knowing that many of my colleagues and peers had made the same transition made my decision that much less scary, and helped steer me towards Insight and my new role at Khan Academy. Good luck taking that first step!


  • Facebook
  • Twitter
  • Myspace
  • Google Buzz
  • Reddit
  • Stumnleupon
  • Delicious
  • Digg
  • Technorati