CAP 5768: Fall 2018
Introduction to Data Science - Course Homepage
MW 7:50 - 9:05, ECS 132
Office: ECS 254B; Phone: (305) 348-3748;
Office Hours: By Appointment Only
e-mail: giri@cs.fiu.edu
ANNOUNCEMENTS
- Aug 20: Homework 1 is ready and is due Aug 27.
- Aug 22: Please bring your laptop to class. We will need
to do a survey in order to use an educational App called
Momentos. Before the survey, please read the following
Consent Form. Note that I will not see the results of the survey
until the semester ends. DO NOT START SURVEY UNTIL CLASS TIME. Survey link:
- Aug 24: E-mail all homework to
cap5768-f18@cs.fiu.edu
- Aug 24: Teams of 3-5 individuals would be
acceptable for the class project.
- Sep 01: It has been pointed out that question #5 in HW1 does not
ask a question that needs answering. This is true and I will not expect
a response on this question for HW1.
However, the questions will appear in HW2. It was also pointed out the URLs for
the data are wrong. The correct ones, if you want to explore, are:
https://users.cs.fiu.edu/~giri/teach/5768/F18/epaData/RD_501_88101_1999-0.txt and
https://users.cs.fiu.edu/~giri/teach/5768/F18/epaData/RD_501_88101_2012-0.txt
- Sep 05:There will be no class today. Instead you are asked to review 3
Python notebooks. See below for details
- Sep 10 and 12: No class. Python notebooks and ppt slides
will be uploaded here.
- Sep 26: As mentioned in Lec 1 (Aug 20), you do have to
plan for a class project. You are required to submit a brief project
proposal by Oct 11. It should provide a concise problem
statement, goals/questions, motivation, hypotheses (if any),
approach/methodology, data source(s), available software tools,
and final deliverables. Treat this as HW #3.
- Sep 30: There are several teams still looking for a possible teammate.
If you are still not part of a team or you need one more team member,
email me immediately with your plan for a project (if you have
one).
- Oct 24:
Prepare a 5-minute presentation for your
project on Oct 29. There will be 1 presentation per team.
Each of you must have a 1-minute part in the presentation so
that no one is left out. Prepare a
5-slide deck with (1) title and participants; (2) motivation;
(3) background; (4) questions/aims/goals; (5) methodology. You
can use less than 5 slides if you wish. The above 5 are merely
suggestions and do not need to be strictly adhered to. You may
use more than 5 slides, but there will be a strict 5-minute
limit for the presentation, followed by questions. Send me the
slides by 7:00 PM on Oct 29. You will be evaluated by your
classmates on the quality of slides, quality of presentation,
answers to questions, and creativity of project. If you
are going to miss class for any reason, you would have to email
me with a reason. There will be no make up presentation.
- Nov 17:
Project Presentations will happen in class on Nov 26 and 28.
Attendance on both days is mandatory Each team presentation will
be 15 minutes long with 2-3 minutes for questions. All of you have been
given some feedback. Make sure you address the provided feedback.
Presentations will be evaluated by me and by all of you. It is vital
that you understand every presentation. The main objective of the
presentation is for the whole class to understand the project as
best as possible. If your project cannot be explained in 15 minutes then
present the highlights and leave the rest for the report. All feedback
and questions during your presentation should be incorporated into
a final report. The main objective of the report is to make it possible
for someone else to replicate the work. You will be graded separately
for the report. Final reports (one per team) should be mailed to me
latest by noon on Nov 3.
- Nov 25: Presentation Order:
- Monday, Nov 26: NBA Analytics (Giancarlo, Joshua, Daniel, Gorav);
Streaming Analytics (Hector, Hao, Andrew); Time Usage (Dewan, Alberto, Mona, Sanjay);
Avocado Prices (Kierstin, Marco, Alekhya);
- Wednesday, Nov 28: Chicago Crime Analytics (Abhishek, Arjuna, Soujanya);
School Funding (Constanza, Lisbet, Paulo); Drug Trends (Alejandro, Bernardo, Fu, Roy);
Cats Vs Dogs (Elianna, Grace, Muhammad, Vitalii)
- Nov 25: Project Final Reports are due Monday, Dec 3, by noon.
To know what to include in your final report: Click here.
- Nov 26: All the presentations today were looking at interesting data
sets. Several of the projects have great potential for interesting and
valuable results. However, severan mistakes wee made and if you are
scheduled to present on Wednesday, it is in your best interests not to
make the same mistakes. The following are soe takeaways that you might
consider:
- Come with a laptop to use for the presentation. Bring a VGA adapter
for your laptop.
- The 15-minute time limit is very strict and be prepared to finish
within the time limit.
- Plan on each team member presenting at least one important
component of the presentation. It does not help the team if only
person dominates. If one of you ends up talking for less than 3
minutes, it is not such a big deal. What is more important is what you
say during those brief minutes. Remember that even if one person knows
more then other people on the team, if the whole team does not learn
the details of the project well, then the whole team looks bad. Team
coordination is important. Poor coordination can be easily spotted and
will hurt the grade for the whole team.
- Practice your presentations with each other. Most of you have not
learnt how to say things succinctly. Too many speakers today wasted
time on stating the obvious or reading what is on the slide. The
introduction is important, but it loses its impact if it is not used
effectively. Provide only the background that people may not know.
- Show schematics of your analysis so that the big picture of the
questions become clear.
- Explain your graphs/results. Simple trends are usually
uninteresting. Focus on the anomalies and find ways to explain where
the trend gets bucked.
- Look for ways to describe the transition from one topic or one
speaker to the next. Learn how to make an impact, not how to fill up
the allotted time.
- Send me your presentations before 5 PM on Wednesday (even those who
presented today) so that I can give it a grade.
- Avoid Python demos unless you have something important to
show. These things take time to set up and waste valuable time.
- Leave one minute at the end to wrap up your presentation with the
most impactful of conclusions.
COURSE SYLLABUS
LECTURE TRANSPARENCIES
- Nov 21: Review Questions
- Nov 14: Lec 23: Time Series Analysis;
See
Engineering Statistics Handbook: Notes on Time Series Analysis;
Also see
Time Series Analysis with R, and
ARIMA models.
- Nov 12: Holiday
- Nov 07: Lecture was canceled.
- Nov 05: Lec 22: Lecture by Prof. Mark Finlayson on Text Mining and NLP;
Chapters 4, 5, 6, 8 from Anandarajan
- Oct 31: Lec 21: Lecture by Camilo Valdes on Spark Architecture
- Oct 29: Lec 20: Lecture by Prof. Miguel Alonso on Machine Learning;
Types of ML: Supervised, Unsupervised, Reinforcement;
Linear Regression;
- Oct 24: Lec 19: PageRank
- Oct 22: Lec 18: Outliers
- Oct 17: Lec 17: PCA, SVD,
dimensionality reduction;
- Oct 15: Lec 16: PCA, Matrices, Spectral Decomposition,
Eigenvalues and eigenvectors, dimensionality reduction;
- Oct 10: Lec 15: High-Dimensional Space;
Python code for clustering and normality tests [clustering,
normality testing]
- Oct 08: Lec 14: More on Clustering
- Oct 03: Lec 13: Clustering
- Oct 01: Lec 12: Clustering (White-board Lecture for most part; Watch video)
- Sep 26: Lec 11: Streams
- Sep 24: Lec 10:
Similarity, MinHash
- Sep 19: Lec 9; Class slides: [Market Baskets,
Frequent Itemsets, Association Rules, Support, Confidence];
- Sep 17: Lec 8; Class slides: [Python Visualization with
mathprotlib];
Download and study the following python notebooks:
simple plots,
advanced plots,
an
example (Data for this example are
here).
For an outstanding tutorial on visualization, click here; We also discussed
MapReduce during this lecture and went over the last lecture.
- Sep 12: Lec 7; No Class today. Class slides: [Big Data and
MapReduce] (Updated); Read all the material on
MapReduce provided below in Additional Reading
- Sep 10: Lec 6; No Class today. Class slides: [Python DataFrames and SQL];
Download and study the following python notebooks:
SQL and Python,
Some hints on the EPA dataset,
- Sep 05: Lec 5; No Class today. Please download the following 3
python notebooks and run them after reading the comments:
numPy-randomWalk,
WineReviews-150k,
headlines.
The data for "WineReviews" are
here.
The data for "headlines" are
here.
- Sep 03: Holiday for Labor Day. No Class
- Aug 29: Lec 4 [More Python
Features]; [Statistical Preliminaries]
Download and study the following python notebooks:
Files,
Maps,
- Aug 27: Lec 3 [General Python
Features]; Case Study (MovieLens-1M):
MovieLens1M;
Download and study the following python notebooks:
nd.Arrays,
Series
- Aug 22: Lec 2 [pdf];
Python Notebooks and data can be found in the
Notebook and Data Directory;
In particular download and study the following notebooks:
Basics,
Sets,
Strings,
Dictionaries,
DataFrames
- Aug 20: Lec 1 [pdf]
[Introduction & Motivation];
Reading:
HANDOUTS AND HOMEWORK ASSIGNMENTS
- As mentioned above, you are required to submit a brief project
proposal by Oct 10. It should provide a concise problem
statement, goals/questions, motivation, hypotheses (if any),
approach/methodology, data source(s), available software tools,
and final deliverables.
- HW 2 Due Sep 17, at the
start of class. E-mail homework to cap5768-f18@cs.fiu.edu
- HW 1 Due Aug 27, at the
start of class. E-mail homework to cap5768-f18@cs.fiu.edu
RECOMMENDED TEXT
- Mining of Massive Datasets, by Leskovec, Rajaraman,
and Ullman, Cambridge University Press, 2014,
2nd Edition, ISBN-13: 978-1107077232; ISBN-10: 1107077230
Paperback: ISBN-13: 978-1316638491; ISBN-10: 1316638499
OTHER TEXTS
- Foundations of Data Science, by Blum, Hopcroft, and
Kannan,
PDF
- Python for Data Science, by McKinney, O'Reilly
Publishers, 2012.
- R Programming for Data Science, Roger Peng, Lean
Publishing, 2014.
- Data Intensive Science, Eds., Critchlow, van Dam,
CRC Press, 2013.
- Practical Text Analytics, Anandarajan, M., Hill, C. and Nolan, T.,
Springer Link,
"Advances in Analytics and Data Science" Series, 2019,
DOI: 10.1007/978-3-319-95663-3
- Forecasting Principles and Practice, Rob J Hyndman and George Athansopoulos,
Online Text, 2nd Edition.
TOOLS AND USEFUL LINKS
ADDITIONAL READING
- Hadoop-MapReduce-Python tutorials (all require prior Hadoop
installation):
by M. Noll;
by Princeton Research Computing; by
T. Henson;
on Apache Spark in Python
-
Original
Paper on MapReduce by Dean and Ghemawat, 2004;
Updated version from CACM 2008; Also read
Further thoughts on MapReduce by Dean and Ghemawat, CACM, 2010;
MapReduce Examples;
- Coursera offers a 5-course sequence designed by the
University of Michigan in
Applied Data Science with Python Specialization. This series contains the following courses:
Introduction to Data Science in Python, Introduction to Data Science in Python,
Introduction to Data Science in Python, Introduction to Data Science in Python,
and Introduction to Data Science in Python. These and the one below can be helpful for you
to work on the semester project as well as the Capstone course in the MS in Data Science.
- Coursera offers an IBM-approved
specialization in
Advanced Data Science. It includes 4 courses: Fundamentals of Scalable Data Science,
Advanced Machine Learning and Signal Processing, Applied AI with DeepLearning, and
Advanced Data Science Capstone.
- Useful Blog sites: PlanetPython;
Dataskeptic
- Useful Data Repository: Kaggle datasets
- Goodhart's Law:
Funny Podcast
- DataCamp Courses:
Forecasting using R
Giri Narasimhan
Last modified: Wed Oct 24 21:28:09 EDT 2018