Category : PROspective
By: Nicole Luisi, MPH, MS
In movies and television, when a data analyst is assigned a task they spring to action, typing hundreds of words per minute without ever looking down from their wall of monitors — cut to them presenting the results, glossing over the details on how they scraped all the data from some website, performed a complex analysis, and solved a crime, all in about 60 seconds. Admittedly, it would make for some pretty boring television if they showed the 7 hours that person had to spend reformatting and cleaning a dataset, the 4 hours it took to resolve an error they encountered, or the 2 hours they spent staring at the screen trying to find the missing parenthesis that broke everything and caused them to question all of their life choices. Although we all envision ourselves solving the world’s problems with our fancy analyses, the reality is that real-world data can be messy, and we will probably spend 25% of our time on those fancy analyses and the other 75% on data cleaning and preparation.
In the classroom, we often use examples that demonstrate techniques (working with missing values, cleaning character data, etc.), but there is really no substitute for time and experience with real data. Real-world data is predictably unpredictable! Even the best systems can’t anticipate every issue that will occur during data collection, but it is safe to assume that you will encounter something you did not expect. Some things are just out of our control – your online survey platform could glitch and cause the skip patterns to fail, the intern you are training could accidentally enter data in the wrong table, a teenager might create a bunch of fake identities to repeatedly join your online study and scam you out of incentives, someone may even show up to enroll in person for your study and turn out to be 2 kids in a trench coat.
So, what can aspiring data analysts do to prepare to work with messy, unpredictable data and stand out to potential employers?
- Get a solid foundation in the platform(s) you plan to use most. At Rollins we focus on SAS and R which are both widely used in epidemiology, but if you have a dream job in mind, find out what that organization prefers. It’s great to have some working knowledge of a variety of tools, but you want to get really good with one or two. If you like the structure of a class, you can consider formal classes and workshops offered by companies like SAS or R Studio, and there are a number of great online platforms that offer training as well (LinkedIn Learning, Coursera, DataCamp). There are also a lot of useful books out there with companion websites that provide datasets and practice exercises.
- Get some experience with real data. Again, you can only simulate so much in the classroom – working with real data (and its issues) will expose you to all kind of things. This experience might come through formal opportunities with employment, internships or volunteer work, through a thesis or practicum, or even on your own through the use of publicly available data. There is a ton of public health data available online (Census, NIH, CDC), and if you just want to play around and improve your skills, people have created all kinds of interesting datasets and made them available online (Github, Kaggle, FiveThirtyEight) – go ahead and download a dataset full of Netflix movie reviews, sportsball stats, Twitter posts with the latest controversial hashtag, anything that is interesting to you!
- Practice, practice, practice. Programming is otherwise known as…learning a programming language. Classroom training is only part of this – it can get you the foundation but to really excel you must put in the time. If you were trying to learn a new language, even after taking a traditional class, you might spend an hour a day on Duolingo, listen to songs in that language, or read books and articles in that language, looking up any words you don’t know along the way…it’s the same thing here. The more practice you get, especially with real data, the more you will have to draw from when you encounter something new. You don’t have to know everything (I certainly don’t), but you will get better at doing things from memory that you have done dozens of times, and you will also remember unique examples that forced you to learn something new. Much like reading will expand your vocabulary, practice will add techniques to your toolbox that you can adapt when faced with similar tasks later on.
- Focus on other related skills such as problem solving, communication, and critical thinking – it’s not ALL about programming. Even if you are starting out in a position where you don’t have a lot of input, you can still exercise these skills as an analyst. The best data analysts I have worked with are detail oriented people that take the time to ask questions (even of themselves) and carefully evaluate their own work. In some ways as an analyst it’s helpful to be a bit of a pessimist (at least that’s my excuse) – I spend a lot of my time anticipating things that could go wrong to prevent and identify data quality issues. As a hiring manager, I think programming and analysis skills are necessary, but I also think it is important to give equal weight to other skills like subject matter expertise, creativity, communication skills, attention to detail, etc. I would be more likely to hire a well-rounded person that has experience working on a real study, over someone with numerous technical skills and certifications on their resume that has only ever done this type of work in a classroom.
If you want to be a great analyst, in some ways you will always need to consider yourself a student – I learn something new every day doing this type of work. Spend some time early in your program figuring out what appeals to you most in terms of software and data and then get to it!
Nicole Luisi (MPH, MS) is a Director of Data Analytics and Informatics Projects in the Rollins School of Public Health, and an instructor in the Department of Epidemiology. She also teaches several courses for the Executive MPH program and serves as the EMPH Applied Epidemiology thesis advisor. Featured here is her dog, Doug, doing some casual data analysis.