CAP 5768: Fall 2019
Introduction to Data Science - Course Homepage
MW 6:25 - 7:40, PG6 114
Office: ECS 254B; Phone: (305) 348-3748;
Office Hours: By Appointment Only
e-mail: giri@cs.fiu.edu
ANNOUNCEMENTS
- Nov 29: Your webpages are due tonight by midight. Read the instructions below before you submit.
- Nov 30: Your presentations will be held on Dec 2, Dec 4, and Dec 11. On Dec 2 and 4 it will be during
regular class time. On Dec 11, it will be during the scheduled final exam time. Attendance for the talks is
mandatory. Email me your presentation prior to the class time when you are scheduled to present.
- Nov 30: The following evaluation form
(Link will be used for your evaluations.
All of you will also help in the evaluations.
Each of you will fill the forms for all the talks. The evaluations will not be seen by your classmates, but will be used
by me to see your ability to critically understand a presentation.
I will collect anonymized comments, filter the language, and mail them to the speakers.
- Nov 30: Your final reports are due on Dec 7 for ALL TEAMS via Canvas.
Teams presenting on Dec 11 will be given one opportunity to
make minor updates to their final reports by end of Dec 11 by emailing me.
SEMESTER PROJECTS
COURSE SYLLABUS
LECTURE TRANSPARENCIES
- Aug 26: Lecture #1 - Introduction [pdf]
- Aug 28: Lecture #2 - Python [pdf];
Python Notebooks:
Basics,
Strings,
Sets,
Dictionaries.
- Sep 02: LABOR DAY
- Sep 04: Lecture #3 - More Python Notebooks:
Data Frames,
Movie Lens;
R [pdf];
R Notebooks:
R Basics,
- Sep 09: Lecture #4 - Summarization
[pdf];
- Sep 11: Lecture #5 - Visualization with matplotlib
[pdf];
Python Notebooks:
simple plots,
advanced plots,
MDCPS example.
- Sep 16: Lecture #6 - Statistics; text mining
[pdf]
- Sep 18: Lecture #7 -
- Sep 23: Lecture #8 - Map Reduce
[pdf]; Read all the material on
MapReduce provided below in Additional Reading
- Sep 25: Lecture #9 - More Map Reduce
[pdf]
- Sep 30: Lecture #10 - APRIORI, Frequent Itemsets
[pdf]
- Oct 02: Lecture #11 - Similarity, MinHash
[pdf]
- Oct 07: Lecture #12 - Streams
[pdf];
- Oct 09: Lecture #13 - review
[pdf]
- Oct 14: Lecture #14 - Visiting lectureby Camilo Valdes on AWS &
Cloud Computing [pdf]
- Oct 16: Lecture #15 - Midterm Exam
- Oct 21: Lecture #16 - Bloom Filters
[pdf]
- Oct 23: Lecture #17 - Clustering
[pdf];
Python code for clustering [clustering]
- Oct 28: Lecture #18 - Class Canceled!
- Oct 30: Lecture #19 - Clustering (slides from Lec 17)
- Nov 04: Lecture #20 - Time Series
[pdf]
See
Engineering Statistics Handbook: Notes on Time Series Analysis;
Also see
Time Series Analysis with R, and
ARIMA models.
- Nov 06: Lecture #21 - Outliers
[pdf]
- Nov 11: VETERAN'S DAY
- Nov 13: Lecture #22 - Normality Testing; PCA, Matrices, Spectral Decomposition,
Eigenvalues and eigenvectors, dimensionality reduction;
[pdf]
- Nov 18: Lecture #23 - PageRank
[pdf]
- Nov 20: Lecture #24 - Causality
[pdf];
Extra slides revisiting Time Series Analysis
[pdf];
- Nov 25: Lecture #25 - NLP & ML
[pdf];
Text Analytics [pdf];
[Chapters 4, 5, 6, 8 from Anandarajan]
- Nov 27: Lecture #26 - Exam Review
[pdf]
- Dec 02: Lecture #27 - Class Project Presentations
Real Estate Analysis (Roger, Adrian, Miguel) [1];
Florida Education (Qiang, Pablo, Sakib, Haiming) [2];
MoneyBall (Lijing, Andres, Gonzalo, Luis) [3]
- Dec 04: Lecture #28 - Class Project Presentations
Energy Market (Poonam, Rocio, Richard, Marcus) [4];
Digital Market (Luis, Lucas, Faraz) [5];
Student Success (Jimeng, Chengwei, Liping, Sandeep) [6]
- Dec 11: 5:00 PM -- 7:00 PM
Restaurant Placement (Hongjing, Linlin, Zheya, Peng, Yan) [7];
School Shootings (Emmanuel, Kyra, Matthew) [8];
Healthcare (Shahid, Priyanka, Maurice, Prasad) [9];
Take Home FINAL EXAM (tentatively: 9:00 PM - 11:15 PM)
HANDOUTS AND HOMEWORK ASSIGNMENTS
- HW #1
- HW #2
- HW #3
- Project Webpage Creation: Instructions for creating
project website : due 11:59 PM, Friday, Nov 29, 2019.
On Canvas look for "WebPg" and submit a zip file.
- Final Project Reports: due on Dec 7, 11:59 PM for all teams.
On Canvas look for "ProjReport" and submit a zip file.
A RUBRIC
file for evaluating the report acts as instructions for
what goes in the report.
RECOMMENDED TEXT
- Mining of Massive Datasets, by Leskovec, Rajaraman,
and Ullman, Cambridge University Press, 2014,
2nd Edition, ISBN-13: 978-1107077232; ISBN-10: 1107077230
Paperback: ISBN-13: 978-1316638491; ISBN-10: 1316638499
OTHER TEXTS
- Foundations of Data Science, by Blum, Hopcroft, and
Kannan,
PDF
- Python for Data Science, by McKinney, O'Reilly
Publishers, 2012.
- R Programming for Data Science, Roger Peng, Lean Publishing, 2014.
- R for Data Science, G. Grolemund, H. Wickham, O'Reilly Publishers, 2017.
- Text Mining with R, J. Silge, D. Robinson, O'Reilly Publishers, 2020.
- Data Intensive Science, Eds., Critchlow, van Dam,
CRC Press, 2013.
- Practical Text Analytics, Anandarajan, M., Hill, C. and Nolan, T.,
Springer Link,
"Advances in Analytics and Data Science" Series, 2019,
DOI: 10.1007/978-3-319-95663-3
- Forecasting Principles and Practice, Rob J Hyndman and George Athansopoulos,
Online Text, 2nd Edition.
TOOLS AND USEFUL LINKS
ADDITIONAL READING
- Hadoop-MapReduce-Python tutorials (all require prior Hadoop
installation):
by M. Noll;
by Princeton Research Computing; by
T. Henson;
on Apache Spark in Python
-
Original
Paper on MapReduce by Dean and Ghemawat, 2004;
Updated version from CACM 2008; Also read
Further thoughts on MapReduce by Dean and Ghemawat, CACM, 2010;
MapReduce Examples;
- Coursera offers a 5-course sequence designed by the
University of Michigan in
Applied Data Science with Python Specialization. This series contains the following courses:
Introduction to Data Science in Python, Introduction to Data Science in Python,
Introduction to Data Science in Python, Introduction to Data Science in Python,
and Introduction to Data Science in Python. These and the one below can be helpful for you
to work on the semester project as well as the Capstone course in the MS in Data Science.
- Coursera offers an IBM-approved
specialization in
Advanced Data Science. It includes 4 courses: Fundamentals of Scalable Data Science,
Advanced Machine Learning and Signal Processing, Applied AI with DeepLearning, and
Advanced Data Science Capstone.
- Useful Blog sites: PlanetPython;
Dataskeptic
- Useful Data Repository:
Kaggle datasets;
Global Health Data Exchange;
- Goodhart's Law:
Funny Podcast
- DataCamp Courses:
Forecasting using R
- News Items:
Dr. Rodrigo Guerrero wins Roux Prize
for using data to address violence as a public health crisis in Cali, Colombia
- Centers and Institutes:
Institute for Health Metrics and Evaluation (IHME);
Center for Health Trends and Forecasts (CHTF);
Giri Narasimhan
Last modified: Mon Dec 2 16:09:54 EST 2019