CSE 5095




KNS 301, Mon/Wed 11am — 12:15pm




ITEB 138, Mon 9/20, 10/18, 11/15, 11am-12:15pm



Jinbo Bi

Phone: 486-1458

Email: jinbo@engr.uconn.edu

Office hours: Mon/Wed 3 pm — 4pm

Office: ITEB 233


Jiangwen (Javon) Sun

Phone: 486-0510

Email: jiangwen.sun@uconn.edu

Office hours: to be determined

Office: ITEB 213


The purpose of this course is to introduce to the students the general topics and techniques of data mining and machine learning with specific application focus on medical informatics. This course introduces multiple real-world medical problems with real patient data, and how multiple analytic algorithms have been used in an integrated fashion to cope with these problems. It covers some cutting-edge data mining technology which can successfully tackle problems that are complex, highly dimensional, and/or ambiguous in labeling. General topics of data mining, such as clustering, classification, regression, dimension reduction, will be described. However, efforts will also be given to more advanced and recent topics, including multiple instance learning, multi-task learning, collaborative filtering, clustering with dimension reduction etc. Throughout the entire course, practical medical/healthcare problems will be used as examples to demonstrate the adoption and effectiveness of data mining methods.

The course will consist of lectures, labs, paper reviews and projects. Lectures will serve as the vehicle to introduce new information to the students. Labs will be used to enforce the material given in lectures and students paper reviews will be used to study the state-of-the-art work from researchers in the field. Participation is encouraged during the class.

As part of the course, the students will work on a term project with the goal of applying any of the studied techniques to a problem selected from a list of projects. Students are also encouraged to propose and design their own problems which need to be approved by the instructor for class suitability. Teams of two-three students will be created for each project. Each team is required to present in the classroom and submit a project report, of 15-20 pages, which includes the definition of the problem, techniques used to solve the problem and experimental results obtained. This exercise will help the team gain a hands-on understanding of the material studied in this course and promotes collaborations among team members.


  • Get to know some general topics in medical informatics
  • Focus on some high-demanding medical informatics problems with hands-on experience of applying data mining techniques
  • Equip students with knowledge about the basic concepts of machine learning and the state-of-the-art literature in data mining/machine learning


  1. Introduction to Data Mining by Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, ISBN-10: 0321321367
  2. Pattern Classification (2nd Edition) by Richard O. Duda, Peter E. Hart and David G. Stork, ISBN-10: 0471056693
  3. Pattern Recognition and Machine Learning (Information Science and Statistics) by Christopher M. Bishop, ISBN-10: 0387310738


  1. In-Class Lab Assignments (3): 30%
  2. Paper review (2): 30%
  3. Term Project (1): 30%
  4. Participation: 10%


  • 90.0 – 100.0 — A
  • 85.0 – 89.99 — A-
  • 80.0 – 84.99 — B+
  • 75.0 – 79.99 — B
  • 70.0 – 74.99 — B-
  • 65.0 – 69.99 — C+
  • 60.0 – 64.99 — C


Week Lecture dates Topics Notes
1 8/30 Introduction
9/1 Review of basic algebra and probability Hand-outs will be given at most of the class meetings
2 9/6 Labor Day Due: survey
9/8 General clustering topics, k-means Topics of papers for review (1)
3 9/13 Hierarchical clustering: traditional techniques
9/15 Spectral clustering: modern techniques Due: list of papers selected by students for review session (1)
4 9/20 Medical problem 1: Cardiac Ultrasound Image Categorization (deal with medical images)
9/22 Lab (1): matlab introduction and conduct clustering assignments on cardiac image data
5 9/27 Students paper review presentation (1) 5-10min discussion for lab (1) Discussion of term projects, form teams
9/29 Student paper review presentation (1) cont.
6 10/4 General classification and regression Due: lab (1) assignment
10/6 Linear models for regression
7 10/11 Back-propagation neural nets
10/13 Support vector machines Due: list of papers selected by students for review session (2)
8 10/18 Medical problem 2: Computerized Decision Support for Trauma Patients (deal with physiological data)
10/20 Lab (2): conduct classification assignments on physiological features
9 10/25 Students paper review presentation (2) Due: lab (2) assignment
10/27 Student paper review presentation (2) cont.
10 11/1 General topics on dimension reduction Round-table discussion about final project topics
11/3 Unsupervised dim reduction: PCA, CCA, ICA
11 11/8 Supervised dim reduction: LASSO, group LASSO, or 1-norm SVM
11/10 Medical problem 3: Computerized Diagnostic Coding (deal with natural language text data)
12 11/15 Lab (3): conduct dimension reduction assignments on diagnostic coding data
11/17 State of the art research: multiple instance learning
13 11/29 State of the art research: multi-task learning, collaborative filtering Due: lab (3) assignment
12/1 State of the art research: uncertainty in expert labeling/annotation
14 12/6 Presentation of final term projects
12/8 Presentation of final term projects (cont.)
15 12/13 Presentation of final term projects (cont.) Due: term project report
12/15 Final exam (for make-up)

Final project report will be due on Wednesday of the Final Exam week, 12/15/2010.



  1. Computers are allowed in classroom for taking notes or any activity related to the current class meeting.
  2. Assignments must be submitted electronically via HuskyCT. If the assignment is handed in late, 10 credits will be reduced for each additional day.
  3. Participation in paper review itself will earn 80 credits for each review assignment. Paper review presentation slides need to be turned in via HuskyCT before the class that the presentation is scheduled. The quality of your paper review presentation will be judged by the instructor (10 credits) and scoring of peer students in the class (10 credits).
  4. Assignments and paper reviews will be graded by the teaching assistant assigned to this course under guidance and consulting of the instructor.
  5. Final term projects will be graded by the instructor based on the clarity and creativity of the project report and the comparison of final presentation of all teams.


  1. If a lab assignment or a paper review presentation is missed, there will be a take-home final exam to make up the credits.
  2. If two of the lab assignments or paper reviews are missed, there will be an additional assignment and a take-home exam to make up each of the two items.


A HuskyCT site has been set up for the class. You can access it by logging in with your NetID and password. You must use HuskyCT for submitting assignments and check it regularly for class materials, grades, problem clarifications, changes in class schedule, and other class announcements.



Possible projects (below are some websites for benchmark data):

  1. UCI Machine learning repository
  2. SIGKDD CUP 2006
  3. SIGKDD CUP 2007
  4. SIGKDD CUP 2008
  5. Challenges in Machine Learning
  6. CMC NLP Challenge on ICD-9-CM automatic coding
  7. PhysioNet: Physiologic signal archives for biomedical research


Tools that may help with course projects (to be complete)

  1. Matlab Optimization Toolbox
  2. SVM_Light (support vector machines)
  3. LIBSVM (support vector machines)
  4. Bayesian Knowledge Discoverer (BKD): computer program able to learn Bayesian Belief Networks from databases
  5. Bayes net toolbox for Matlab
  6. TSP Demo
  7. LeNet (neural networks)
  8. Neural networks demo
  9. Neural networks flash demo
  10. GAUL (genetic algorithm)
  11. Java genetic algorithm demo
  12. A complete notebook GA
  13. A system for distributing statistical software, datasets, and information by electronic mail, FTP and WWW
  14. Tools for mining large databases C5.0 and See5
  15. Description of the SLIPPER rule learner, that is a system that learns sets of rules from data based on original RIPPER rule learner
  16. Information about Data Mining and knowledge discovery in Databases
  17. Clustering Algorithms


You are expected to adhere to the highest standards of academic honesty. Unless otherwise specified, collaboration on assignments is not allowed. Use of published materials is allowed, but the sources should be explicitly stated in your solutions. Violations will be reviewed and sanctioned according to the University Policy on Academic Integrity. Collaborations among team members are only allowed for the final term projects that are selected.

Academic integrity is the pursuit of scholarly activity free from fraud and deception and is an educational objective of this institution. Academic dishonesty includes, but is not limited to, cheating, plagiarizing, fabricating of information or citations, facilitating acts of academic dishonesty by others, having unauthorized possession of examinations, submitting work for another person or work previously used without informing the instructor, or tampering with the academic work of other students.


If you have a documented disability for which you are or may be requesting an accommodation, you are encouraged to contact the instructor and the Center for Students with Disabilities or the University Program for College Students with Learning Disabilities as soon as possible to better ensure that such accommodations are implemented in a timely fashion.

Jinbo Bi  2010/8-2010/12
Last revised: 8/27/2010