Fall Course Offering, Info 550: Data Science Toolkit
Category : Student Opportunities
Instructor
Dr. David Benkeser
Course Description
This course is an elective for Masters and PhD students interested in learning some fundamental tools used in modern data science. Together, the tools covered in the course will provide the ability to develop fully reproducible pipelines for data analysis, from data processing and cleaning to analysis to result tables and summaries. By the end of the course students will have learned the tools necessary to: develop reproducible workflows collaboratively (using version control based on Git/GitHub), execute these workflows on a local computer (using command line operations, RMarkdown, and GNU Makefiles), execute the workflows in a containerized environment allowing end-to-end reproducibility (using Docker), and execute the workflow in a cloud environment (using Amazon Web Services EC2 and S3 services). Along the way, we will cover a few other tools for data science including best coding practices, basic python, software unit testing, and continuous integration services.
Pre-Requisites
Many topics covered will involve the R programming language and so familiarity with R is needed (e.g., BIOS 544/545 or similar level of competency). Necessary skills include: reading data into R, basic data cleaning in R (e.g., subsetting data, finding missing values, merging data), operating on data.frames (e.g., changing column names, row names, summarizing rows/columns of data using simple statistics), basic graphics (e.g., plot or ggplot2).
Given the similarities between python and R, students with a background in python programming should also be equipped to succeed in the class, but will possibly require more effort to get up to speed with R.
Course Learning Objectives
- -Understand why automation is a key element of reproducible data science.
- Develop reproducible workflows for data cleaning, analysis and report generation using the suite of tools learned in the class.
- Understand the importance of version control and best practices for collaborative coding projects.
- Use Docker to develop containerized workflows.
- Set up and work in Jupyter notebooks.
- Utilize cloud computing services for computation and storage.
Click here to view a PDF of the course syllabus!
Recent Comments