Fall Course Offering, Info 550: Data Science Toolkit

Fall Course Offering, Info 550: Data Science Toolkit

Instructor 

Dr. David Benkeser 

Course Description

This course is an elective for Masters and PhD students interested in learning some fundamental tools used in modern data science. Together, the tools covered in the course will provide the ability to develop fully reproducible pipelines for data analysis, from data processing and cleaning to analysis to result tables and summaries. By the end of the course students will have learned the tools necessary to: develop reproducible workflows collaboratively (using version control based on Git/GitHub), execute these workflows on a local computer (using command line operations, RMarkdown, and GNU Makefiles), execute the workflows in a containerized environment allowing end-to-end reproducibility (using Docker), and execute the workflow in a cloud environment (using Amazon Web Services EC2 and S3 services). Along the way, we will cover a few other tools for data science including best coding practices, basic python, software unit testing, and continuous integration services.

Pre-Requisites 

Many topics covered will involve the R programming language and so familiarity with R is needed (e.g., BIOS 544/545 or similar level of competency). Necessary skills include: reading data into R, basic data cleaning in R (e.g., subsetting data, finding missing values, merging data), operating on data.frames (e.g., changing column names, row names, summarizing rows/columns of data using simple statistics), basic graphics (e.g., plot or ggplot2).

Given the similarities between python and R, students with a background in python programming should also be equipped to succeed in the class, but will possibly require more effort to get up to speed with R.

Course Learning Objectives 

  • -Understand why automation is a key element of reproducible data science.
  • Develop reproducible workflows for data cleaning, analysis and report generation using the suite of tools learned in the class.
  • Understand the importance of version control and best practices for collaborative coding projects.
  • Use Docker to develop containerized workflows.
  • Set up and work in Jupyter notebooks.
  • Utilize cloud computing services for computation and storage.

Click here to view a PDF of the course syllabus! 


Log out of this account

Leave a Reply

Upcoming Events

  • The Summer Institute in Statistics and Modeling in Infectious Diseases (SISMID) July 15, 2024 – July 31, 2024 Conference / Symposium Event Type: Conference / SymposiumSeries: The Summer Institute in Statistics and Modeling in Infectious Diseases (SISMID)Speaker: Leaders in the FieldContact Name: Pia ValerianoContact Email: pvaleri@emory.eduLink: https://sph.emory.edu/SISMID/index.htmlThe Summer Institute in Statistics and Modeling in Infectious Diseases (SISMID) is designed to introduce infectious disease researchers to modern methods of statistical analysis and mathematical modeling.
  • The Second Annual RSPH Staff and Post-Doctoral Ice Cream Social August 14, 2024 at 1:00 pm – 2:00 pm Networking and Special Event Event Type: Networking,Special EventContact Name: Staff CouncilContact Email: rsphstaffcouncil@emory.eduRoom Location: RRR_Terrace 2nd FloorRSPH staff and post-docs are invited to join us for ice cream and delightful conversation. This event is hosted by the RSPH Staff Council.
  • Tricks and Treats with the Council, hosted by the RSPH Staff Council October 31, 2024 at 10:00 am – 11:30 am Networking and Special Event Event Type: Networking,Special EventContact Name: Staff CouncilContact Email: rsphstaffcouncil@emory.eduRoom Location: CNR_8030 Lawrence P. &Ann Estes Klamon roomRSPH staff and post-docs are invited to join the RSPH Staff Council for a festive gathering featuring sweet treats and refreshments. Costumes are encouraged but not required.

Follow Us on Social Media: