Instructor & Course Logistics
Instructor: Prof. Gil Salu
Meeting Times: Tuesday / Thursday, 9:00 AM – 10:20 AM
Location: Library 209
Office Hours: 12:00 PM – 2:00 PM
Office Location: TBD
Appointments via Zoom are available outside of office hours. If you need to meet, send me a note and we will find a time.
Distributed Systems for Data Science is a hands-on course that teaches students to build and manage data pipelines at scale using cloud and distributed computing technologies. Topics include AWS services (Lambda, S3, DynamoDB, SQS), Git/GitHub workflows, Apache Spark with Databricks, Snowflake data warehousing, streaming with Kafka, and data engineering tools like dbt and Airflow. Students complete labs deploying serverless APIs, building distributed data pipelines, and working with modern file/table formats (Parquet, Delta, Iceberg).
Meeting Times
Tuesdays and Thursdays, 9:00 AM, in person.
Key Dates (per campus calendar)
- Mon, Jan 26 — Spring classes begin
- Mon, Feb 16 — Presidents Day (No classes; offices closed)
- Mon–Fri, Mar 16–20 — Spring Break (No classes)
- Tue, Mar 31 — Advising Day (No classes)
- Mon–Wed, Apr 27–29 — Reading Days (No classes)
- Fri, May 15 — Spring classes end
- Mon–Fri, May 18–22 — Final Exam week
Weekly Topics (Weeks 1–15)
- Week 1 (Tue Jan 27, Thu Jan 29) — Git fundamentals; AWS onboarding (IAM, CLI, budgets). Lab: branch conflict + GitHub Pages.
- Week 2 (Tue Feb 3, Thu Feb 5) — S3 static sites; CloudFront overview. Lab: publish static site to S3 (+ optional CloudFront).
- Week 3 (Tue Feb 10, Thu Feb 12) — Serverless: API Gateway + Lambda with DynamoDB. Lab: deploy Python API.
- Week 4 (Tue Feb 17, Thu Feb 19) — Scaling fundamentals + AWS patterns (SQS/SNS/EventBridge). Short design critique.
- Week 5 (Tue Feb 24, Thu Feb 26) — AWS Lab Week: Distributed Cipher (Lambda + SQS + DynamoDB + S3) design and implementation.
- Week 6 (Tue Mar 3, Thu Mar 5) — Data at scale: pandas limits → Polars; benchmarking.
- Week 7 (Tue Mar 10, Thu Mar 12) — Spark 101 + Databricks notebooks. Midterm opens Thu evening.
- Week 8 (Mar 16–20) — Spring Break (no class).
- Week 9 (Tue Mar 24, Thu Mar 26) — File formats (CSV/Avro/Parquet); table formats (Delta/Iceberg). Midterm due Tue night.
- Week 10 (Tue Mar 31 — no class; Thu Apr 2) — Snowflake fundamentals.
- Week 11 (Tue Apr 7, Thu Apr 9) — Load from S3; operating warehouses & cost control.
- Week 12 (Tue Apr 14, Thu Apr 16) — Streaming with Redpanda/Kafka; producer/consumer lab.
- Week 13 (Tue Apr 21, Thu Apr 23) — Project/buffer week; integration time.
- Week 14 (Tue Apr 28 — no class; Thu Apr 30) — Snowpark for Python + ML.
- Week 15 (Tue May 5, Thu May 7) — Project studio + final presentations; final short quiz released.
Assessments & Weights
- Weekly assignments (labs/quizzes): 35%
- Project 1 (Serverless Data App on AWS): 15%
- Midterm (Test on Canvas): 20%
- Project 2 (Data-at-Scale + AI Pipeline): 25%
- Final short quiz: 5%
Tooling & Accounts
- AWS: standard student accounts with credits; enforce budgets and tear-down checklists.
- Databricks: paid account with strict auto-termination and small clusters.
- Snowflake: smallest warehouses with auto-suspend; time-boxed work.
- GitHub Classroom for repos/tests; Canvas for quizzes and submissions.
Using Git in this Class
We will use GitHub Classroom to distribute starter code and run autograding. In Week 1 you will complete a guided merge-conflict exercise and publish a GitHub Pages site from a dedicated branch.
Submission, Deadlines, and Late Work
This course is designed around steady progress rather than one-off high-stakes deadlines. You are expected to submit work on time, but the policies below are meant to support learning rather than penalize recovery.
- On-time submissions: Assignments submitted by the posted due date receive full consideration.
- Late submissions: Late work is accepted for full credit if submitted before the relevant cutoff date.
- Cutoff rules:
- Work assigned before the midterm must be submitted before the midterm closes.
- Work assigned after the midterm must be submitted before the end-of-semester deadline.
- After these cutoffs, missing work will be recorded as zero unless prior arrangements have been made.
If you fall behind, the correct move is to submit incomplete or imperfect work rather than nothing at all. The goal is to keep you engaged with the material.
Use of AI and External Tools
AI tools are part of modern data science practice, and their thoughtful use is allowed and encouraged in this course.
- You may use AI tools to:
- Clarify concepts
- Debug errors
- Explore alternative approaches
- Ask “why does this work?” questions
- You may not submit AI-generated work that you do not understand or cannot explain.
- Any submitted work must reflect your own understanding and decision-making.
If you are unsure whether a particular use of AI is appropriate for an assignment, ask. Asking first is always the right call and will never count against you.
Course Flexibility and Adjustments
This syllabus reflects the intended structure of the course, but distributed systems are a fast-moving field and class pacing varies by cohort.
- Topics, labs, or tools may shift slightly to better support learning outcomes.
- Any changes will be communicated clearly in class and on Canvas.
- No changes will be made retroactively in a way that disadvantages students.
Your responsibility is to stay engaged with Canvas announcements and in-class guidance.
Attendance and Participation
This course is built around in-class labs, walkthroughs, and design discussion. Attendance and active participation are expected and directly support your ability to complete graded work.
- Regular attendance is important. Many labs and explanations are difficult to replicate outside of class.
- Participation includes asking questions, engaging in lab work, and contributing to design or troubleshooting discussions.
- There is no separate attendance grade; however, most weekly assignments and labs assume in-class participation.
- Missing class does not exempt you from assignments or deadlines.
If you must miss class, you are responsible for catching up on material and announcements. Reach out early if attendance becomes an ongoing issue.
Required and Optional Resources
There is no required textbook for this course. This is intentional and based on prior student feedback.
Instead, we will rely on a mix of instructor material, official documentation, and selected online references. Any required readings or videos will be clearly linked in Canvas.
Frequently Useful References
- AWS Documentation:
https://docs.aws.amazon.com/
- AWS Well-Architected Framework:
https://docs.aws.amazon.com/wellarchitected/latest/framework/welcome.html
- GitHub Documentation:
https://docs.github.com/
- Python (official docs):
https://docs.python.org/3/
- pandas Documentation:
https://pandas.pydata.org/docs/
- Polars Documentation:
https://pola.rs/
- Apache Spark Documentation:
https://spark.apache.org/docs/latest/
- Databricks Documentation:
https://docs.databricks.com/
- Snowflake Documentation:
https://docs.snowflake.com/
- Apache Kafka Documentation:
https://kafka.apache.org/documentation/
You are not expected to read all of these end-to-end. They are reference material to support labs, projects, and independent exploration.