Fall Course Offering, Info 550: Data Science Toolkit

Fall Course Offering, Info 550: Data Science Toolkit

Instructor 

Dr. David Benkeser 

Course Description

This course is an elective for Masters and PhD students interested in learning some fundamental tools used in modern data science. Together, the tools covered in the course will provide the ability to develop fully reproducible pipelines for data analysis, from data processing and cleaning to analysis to result tables and summaries. By the end of the course students will have learned the tools necessary to: develop reproducible workflows collaboratively (using version control based on Git/GitHub), execute these workflows on a local computer (using command line operations, RMarkdown, and GNU Makefiles), execute the workflows in a containerized environment allowing end-to-end reproducibility (using Docker), and execute the workflow in a cloud environment (using Amazon Web Services EC2 and S3 services). Along the way, we will cover a few other tools for data science including best coding practices, basic python, software unit testing, and continuous integration services.

Pre-Requisites 

Many topics covered will involve the R programming language and so familiarity with R is needed (e.g., BIOS 544/545 or similar level of competency). Necessary skills include: reading data into R, basic data cleaning in R (e.g., subsetting data, finding missing values, merging data), operating on data.frames (e.g., changing column names, row names, summarizing rows/columns of data using simple statistics), basic graphics (e.g., plot or ggplot2).

Given the similarities between python and R, students with a background in python programming should also be equipped to succeed in the class, but will possibly require more effort to get up to speed with R.

Course Learning Objectives 

  • -Understand why automation is a key element of reproducible data science.
  • Develop reproducible workflows for data cleaning, analysis and report generation using the suite of tools learned in the class.
  • Understand the importance of version control and best practices for collaborative coding projects.
  • Use Docker to develop containerized workflows.
  • Set up and work in Jupyter notebooks.
  • Utilize cloud computing services for computation and storage.

Click here to view a PDF of the course syllabus! 


Log out of this account

Leave a Reply

Upcoming Events

  • Humphrey NoonTime Seminar Series March 13, 2025 at 12:00 pm – 1:00 pm Seminar Series; zoom.us… Online Location: https://zoom.us/J/95658300925Event Type: Seminar SeriesSeries: Humphrey NoonTime Seminar SeriesSpeaker: Humphrey FellowsContact Name: Deirdre RussellContact Email: dwruss2@emory.eduRoom Location: RRR_R809Link: https://zoom.us/J/95658300925the Humphrey Fellowship, a Fulbright Exchange Program proudly present a series of presentations from around the world.Participants may join via zoom or in person Pizza will be provided.
  • Humphrey NoonTime Seminar Series March 20, 2025 at 12:00 pm – 1:00 pm Seminar Series; zoom.us… Online Location: https://zoom.us/J/95658300925Event Type: Seminar SeriesSeries: Humphrey NoonTime Seminar SeriesSpeaker: Humphrey FellowsContact Name: Deirdre RussellContact Email: dwruss2@emory.eduRoom Location: RRR_R809Link: https://zoom.us/J/95658300925the Humphrey Fellowship, a Fulbright Exchange Program proudly present a series of presentations from around the world.Participants may join via zoom or in person Pizza will be provided.
  • Humphrey NoonTime Seminar Series March 27, 2025 at 12:00 pm – 1:00 pm Seminar Series; zoom.us… Online Location: https://zoom.us/J/95658300925Event Type: Seminar SeriesSeries: Humphrey NoonTime Seminar SeriesSpeaker: Humphrey FellowsContact Name: Deirdre RussellContact Email: dwruss2@emory.eduRoom Location: RRR_R809Link: https://zoom.us/J/95658300925the Humphrey Fellowship, a Fulbright Exchange Program proudly present a series of presentations from around the world.Participants may join via zoom or in person Pizza will be provided.

Follow Us on Social Media: