COURSE SYLLABUS AND OUTLINE
CSE-5095 and Engineering-SEC001-1128
COMPUTATIONAL BIOMEDICAL INFORMATICS
LECTURE:
CAST 201, Tue/Thu 3:30pm — 4:45pm
|
LAB:
ITEB 138, TBD
|
INSTRUCTOR:
Jinbo Bi Phone: 486-1458 Email: jinbo@engr.uconn.edu Office hours: Tue. 2:30pm — 3:15pm Office: ITEB 233 |
TEACHING ASSISTANT:
Tingyang Xu Phone: 486-3654 Email: tingyang.xu@engr.uconn.edu Office hours: to be determined Office: ITEB 211 |
PURPOSE AND APPROACH:
The purpose of this course is to introduce students the general topics and techniques of data mining and machine learning with specific application focus on biomedical informatics. This course introduces multiple real-world medical problems with real patient data, and how multiple analytic algorithms have been used in an integrated fashion to cope with these problems. It covers some cutting-edge data mining technology which can successfully tackle problems that are complex, highly dimensional, and/or ambiguous in labeling. General topics of data mining, such as clustering, classification, regression, dimension reduction, will be described. However, efforts will also be given to more advanced and recent topics. In particular, imprecisely supervised learning problems will be discussed, including multiple instance learning, metric learning, and learning with multi-labeler annotations etc. Throughout the entire course, practical medical/healthcare problems will be used as examples to demonstrate the adoption and effectiveness of data mining methods. Invited lectures are planned to motivate deeper understanding of specific biological, medical, nursing or pharmaceutical concepts.
The course will consist of lectures, labs, paper reviews and projects. Lectures will serve as the vehicle to introduce concepts and knowledge to students. Labs will be used to enforce the material given in lectures and students paper reviews will be used to study the state-of-the-art from researchers in the field. Participation is encouraged during the class.
As part of the course, the students will work on a term project with the goal of applying any of the studied techniques to a problem selected from a list of projects. Students are also encouraged to propose and design their own problems which need to be approved by the instructor for class suitability. Teams of two-three students will be created for each project. Each team is required to present in the classroom and submit a project report, of 15-20 pages, which includes the definition of the problem, techniques used to solve the problem and experimental results obtained. The related software package will also need to be submitted in order for the reports to be properly graded. This exercise will help the team gain hands-on experience of applying the material studied in this course and will promote collaborations among team members.
COURSE OBJECTIVES:
- Get to know some general topics in biomedical informatics
- Focus on some high-demanding medical informatics problems with hands-on experience of applying data mining techniques
- Equip students with knowledge about the basic concepts of machine learning and the state-of-the-art literature in data mining/machine learning
TEXTBOOKS:
- Introduction to Data Mining by Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, ISBN-10: 0321321367
- Pattern Classification (2nd Edition) by Richard O. Duda, Peter E. Hart and David G. Stork, ISBN-10: 0471056693
- Pattern Recognition and Machine Learning (Information Science and Statistics) by Christopher M. Bishop, ISBN-10: 0387310738
GRADING:
- In-Class Lab Assignments (3): 30%
- Paper review (2): 20%
- Term Project (1): 40%
- Participation: 10%
TENTATIVE SCHEDULE:
Week | Date | Lecture Notes | HW | Announcement |
1 | 8/27 | Introduction | ||
8/30 | Review of Probabilistic and linear algebra basics | |||
2 | 9/04 | General clustering topics, k-means | ||
9/06 | Hierarchical clustering: traditional techniques | |||
3 | 9/11 | Spectral clustering: modern techniques | ||
9/13 | Medical problem 1: Cardiac Ultrasound Image Categorization (deal with medical images) | Check HuskyCT, lab assignment 1 has been loaded | ||
4 | 9/18 | Lab (1): introduction of Matlab and homework assignment on clustering of cardiac image data | ||
9/20 | Student paper review presentation (1) session 1 | |||
5 | 9/25 | Student paper review presentation (1) session 2 | ||
9/27 | Invited Lecture: EEG analyses for neurophysiological disorders by a psychologist Prof. Chen | |||
6 | 10/02 | Student paper review presentation (1) session 3 | ||
10/04 | Student paper review presentation (1) session 4 | |||
7 | 10/09 | General classification and regression | ||
10/11 | Invited Lecture: nursing informatics on neonatal care (baby care) by a Registered Nurse Prof. Cong | |||
8 | 10/16 | Linear models for regression | ||
10/18 | Support vector machines | |||
9 | 10/23 | Medical problem 2: Clinical Decision Support Systems | ||
10/25 | Lab (2): homework assignment on classification using physiological features | Make-up class meeting may be scheduled later Oct, and may be used for student presentations | ||
10 | 10/30 | Student paper review presentation (2) session 1 | ||
11/01 | Student paper review presentation (2) session 2 | |||
11 | 11/06 | Student paper review presentation (2) session 3 | ||
11/08 | General topics on dimension reduction | |||
12 | 11/13 | Unsupervised dim reduction: PCA, CCA, ICA | ||
11/15 | Supervised dim reduction: LASSO, group LASSO, or 1-norm SVM | |||
13 | 11/27 | Medical problem 3: Computerized Diagnostic Coding (deal with natural language text data) | ||
11/29 | Lab (3): conduct dimension reduction assignments on diagnostic coding data | |||
14 | 12/04 | Presentation of final term projects (cont.) | ||
12/06 | Presentation of final term projects (cont.) | |||
15 | 12/10-12/14 | Final Exam Week: Presentation of final term projects (cont.), Make-up Exam, Term project reports are due on Friday |
COURSE POLICY:
- Computers are allowed in classroom for taking notes or any activity related to the current class meeting.
- Assignments must be submitted electronically via HuskyCT. If the assignment is handed in late, 10 credits will be reduced for each additional day.
- Participation in paper review itself will earn 80% credits for each review assignment. Paper review presentation slides need to be turned in via HuskyCT before the class that the presentation is scheduled. The quality of your paper review presentation will be judged by the instructor (10 credits) and scoring of peer students in the class (10 credits).
- Assignments and paper reviews will be graded by the teaching assistant assigned to this course under guidance and consulting of the instructor.
- Final term projects will be graded by the instructor based on the clarity and creativity of the project report and the comparison of final presentation of all teams.
MAKEUP PLAN:
- If a lab assignment or a paper review presentation is missed, there will be a take-home final exam to make up the credits.
- If two of the lab assignments or paper reviews are missed, there will be an additional assignment and a take-home exam to make up each of the two items.
HUSKYCT:
A HuskyCT site has been set up for the class. You can access it by logging in with your NetID and password. You must use HuskyCT for submitting assignments and check it regularly for class materials, grades, problem clarifications, changes in class schedule, and other class announcements.
PROJECT MATERIALS:
The instructor highly encourages students to develop interesting term projects on their own. Meanwhile, the following very important projects in the Bioinformatics or Medical Informatics domain are provided for consideration. For these projects, our lab has obtained substantial results which have validated the significance of the projects and these projects are in the frontier of the research field.
Algorithm-related:
- Genotype-phenotype association for complex human disease, such as psychiatric disorders, drug dependence, or alcoholism. There is evidence that these disease traits are heritable, but efforts to identify genes contributing to risk for these disorders have been hampered by their complex etiology and variable clinical manifestations. In our lab, we have decomposed the complex disease phenotype into homogeneous subtypes in the hope that these subtypes can help to identify genetic factors that underlie the disease. Potential term projects are as follows:
- Soft clustering techniques that group subjects into subtypes with probabilistic scores. In the current subtyping method, a subject will either belong to a subtype or not. In the new method, a scoring function will be constructed automatically during cluster analysis that calculates a subject’s membership likelihood of a subtype. For example, a subject may belong to a subtype with a probability of 80%. Students who would like to choose this project will need to read a paper about a soft clustering algorithm and implement the algorithm in MatLab. Students will work with our lab to deploy the implemented program on a dataset for evaluation.
- Many complex diseases are due to gene-gene interactions and gene-environment interactions. As single SNP (single nucleotide polymorphism) association studies could explain very limited heritability of the complex disease, researchers have started exploring multi-SNP interactions which hopefully can discover more significant association with the disease. Multi-SNP interactions are also called “epistatic interactions”. Students who choose this project will need to read a few related papers, and apply existing state-of-the-art algorithms to detect epistatic interactions for subtypes.
- Longitudinal analysis of risky alcohol use patterns among adolescent and young adults, especially college students. The U.S. Department of Health and Human Services has identified heavy episodic consumption of alcohol in college students as a major public health problem. With reports of binge drinking in college students increasing every year, this once considered “harmless rite of passage” has now been reframed as a top public health problem. It is estimated that roughly 90% of the alcohol consumed by youth under the age of 21 in the US is in the form of binge drinks. The Federal government has called attention to targeting binge drinking among college students in efforts to reduce the rate of frequent binge drinking episodes. A study has been performed at UConn Alcohol Research Center to recruit students and have them report their daily activities and daily alcohol use over a period of 30 days. It requires novel longitudinal methods for analyzing the day-to-day dynamics in order to identify which daily activities or mood, or stress is the major risk factors contributing to binge drinking. Students who take this project will need to read a few papers on group-LASSO, and implement the algorithm in MatLab. Please talk to the instructor for the details, and work with our lab to deploy the implemented algorithm.
- Natural language processing techniques based on multi-instance multi-annotation learning methods. For example, a text document is annotated with a topic. However, the document may not have all its paragraphs talk about the same topic, so what topic each paragraph talks about? This requires paragraph-level annotation, but paragraph-level topic labeling is often time-consuming and not tractable. A technical question is how to obtain paragraph-level topics only based on document-level annotation. Adding in more difficulty is, if a document has been annotated by multiple human experts who have varying expertise. How to use the topics labeled by all these labelers to make more accurate prediction of topics of paragraphs is a difficult problem. We have designed an algorithm called multi-instance multi-annotation learning method. Students who take this project will need to read a few related papers, understand my algorithm, and implement the algorithm in MatLab. Students will work with our lab to deploy the implemented program on a dataset for evaluation.
Software-related:
- Clinical decision support system for asthma patient management. Our lab has built a web-based system to computerize a successful asthma disease management program, called Easy Breathing. Easy Breathing was created by a pulmonologist, an expert for asthma, and was originally paper-pencil-based. We have computerized the logic flow of the program into computer logics, and built a web-based system. Please check our lab website http://www.healthinfo.lab.uconn.edu/EasyBreathing/. The website requires a secure login, but it has a demo account (login: Robert, and password: 123). Now we need to translate the entire computerized program into an Electronic Medical Record system. Please talk to the instructor to sort out this project if you would like to work on it.
- Nursing Informatics Platform for peri-natal care. Please check our website for Nursing Informatics Platform http://www.labhealthinfo.uconn.edu/Nursing/ This is an unfinished website. Description on the Home page needs to re-stated and clarified. The research component is the Live Nursing Aid component on the website where we would like to simulate a real nurse’s reaction in a phone call made by a patient who needs urgent help, and to automate the nurse’s suggested steps to respond patient urgent nursing needs. Students who would like to work on this project will need to talk to one of our invited speakers from Nursing School, and look at our current software design. Then they need to implement more software components for Live Nursing Aid.
REVIEW MATERIALS:
Paper selection by students can be found here, and the order of presentations will be randomly generated.
Paper Review 1:
(Please select one paper from the following biomedical informatics topics to present)
- Mining Electronic Health Records For Adverse Drug Effects Using Regression
Rave Harpaz, Krystl Haerian, herbert chase and Carol Friedman - From Netflix to Heart Attacks: Collaborative Filtering in Medical Datasets
Shahzaib Hassan and Zeeshan Syed - Analysis of an Online Health Social Network
Xiaoxiao Ma and Guanling Chen - Conditional Random Fields for Activity Recognition in Smart Environments
Ehsan Nazerfard, Barnan Das, Lawrence Holder and Diane Cook - The Effect of Different Context Representations on Word Sense Discrimination in Biomedical Texts
Ted Pedersen - A Machine Learning Approach for Identifying Subtypes of Austism
Rocio Guillen and Curtis Jensen - Trust in eHealth: A Review of an the Emerging Field
Laurian Vega, Enid Montague and Tom DeHart - Boosting-based discovery of multi-component physiological indicators: Applications to express diagnostics and personalized treatment optimization
Valeriy Gavrishchaka, Mark Koepke and Olga Ulyanova - Automatic integration of drug indications from multiple health resources
Aurelie Neveol and Zhiyong Lu - Mobile Interface Design for Low-Literacy Populations
Beenish Chaudry, Katherine Connelly, Katie Siek and Janet Welch - Deploying an Interactive Machine Learning System in an Evidence-Based Practice Center
Byron Wallace, Kevin Small, Carla Brodley, Joseph Lau and Thomas Trikalinos - Robust Discovery of Local Patterns: Subsets and Stratification In Adverse Drug Reaction Surveillance
Johan Hopstadius and G. Niklas Noren - A Pattern Mining Approach for Classifying Multivariate Temporal Data
Iyad Batal, Hamed Valizadegan, Gregory Cooper, and Milos Hauskrecht - Reconstruction of Large-scale Gene Regulatory Networks using Bayesian Model Averagin
Haseong Kim and Erol Gelenbe - Ontology Graph based Query Expansion for Biomedical Information Retrieval
Liang Dong and James Z. Wang - Promoting Ranking Diversity for Biomedical Information Retrieval based on LDA
Yan Chen, Xiaoshi Yin, Zhoujun li, and Xiaohua Tony Hu - Automated Medical Decision Support System
Rebeck Carvalho, Rahul Isola and Ameya Tripahty - Leveraging Semantic Networks for Personalized Content in Health Recommender Systems
Martin Wiesner, Stefan Rotter and Daniel Pfeifer - A Cloud Computing Framework for Real-time Rural and Remote Service of Critical Care
Carolyn McGregor. - An Informatics Architecture for the Virtual Pediatric Intensive Care Unit
Daniel Crichton, Chris Mattman, Andrew Hart, David Kale, Robinder Khemani, Patrick Ross, Sarah Rubin, Paul Veeravatanayothin, Amy Braverman, Cameron Goodale and Randall Wetzel. - The Freetext Matching Algorithm: a computer program to extract diagnoses and causes of death from nstructured text in electronic health records
Anoop D Shah, Carlos Martinez, Harry Hemingway - Decision-making in healthcare: a practical application of partial least square path modelling to coverage of newborn screening programmes
Katharina E Fischer - Telemedicine: Technology mediated service relationship, encounter, or something else?
Cynthia LeRouge | Monica J. Garfield | Rosann Webb Collin - Telecardiology through ubiquitous Internet services
Carlos Costa | Jose Luis Oliveira - Information Quality of a Nursing Information System depends on the nurses: A combined quantitative and qualitative evaluation
Margreet B. Michel-Verkerke - A support vector machine based test for incongruence between sets of trees in tree space
David C Haws, Peter Huggins, Eric M O’Neill, David W Weisrock and Ruriko Yoshida (BMC Bioinformatics) - A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools
Karin M Verspoor, Kevin B Cohen, Arrick Lanfranchi, Colin Warner, Helen L Johnson, Christophe Roeder, Jinho D Choi, Christopher Funk, Yuriy Malenkiy, Miriam Eckert, Nianwen Xue, William A Baumgartner, Michael Bada, Martha Palmer and Lawrence E Hunter (BMC Bioinformatics) - Mandatory alcoholintervention for alcohol-abusing collegestudents: A systematic review
Nancy P. Barnett, Jennifer P. and Read, P.
Paper Review 2:
(Please select one paper from the following machine learning / data mining papers to present)
The following papers are from International Conference on Machine Learning 2011-2012.
1. A Co-training Approach for Multi-view Spectral Clustering
Abhishek Kumar, Hal Daume III, University of Maryland
2. Information-Theoretic Co-clustering
Inderjit S. Dhillon, Subramanyam Mallela and Dharmendra S. Modha
3. Learning with Whom to Share in Multi-task Feature Learning
Zhuoliang Kang, Kristen Grauman, Fei Sha
4. Automatic Feature Decomposition for Single View Co-training
Minmin Chen, Kilian Weinberger, Yixin Chen
5. A Unified Probabilistic Model for Global and Local Unsupervised Feature Selection
Yue Guan, Jennifer Dy, Michael Jordan
6. On Random Weights and Unsupervised Feature Learning
Andrew Saxe, pang Wei Koh, Zhenghao Chen, Maneesh Bhand, Bipin Suresh, Andrew Ng
7. The Constrained Weight Space SVM: Learning with Ranked Features
Kevin Small, Byron Wallace, Carla Brodley, Thomas Trikalinos
8. Support Vector Machines as Probabilistic Models
Vojtech Franc, Alexander Zien, Bernhard Scholkopf
9. A Graph-based Framework for Multi-Task Multi-View Learning
Jingrui He, Rick Lawrence
10. TrueLabel + Confusions: A Spectrum of Probabilistic Models in Analyzing Multiple Ratings
Chao Liu, Yi-Min Wang
11. Convex Multitask Learning with Flexible Task Clusters
Wenliang Zhong, James Kwok
12. Multi-level Lasso for Sparse Multi-task Regression
Aurelie Lozano, Grzegorz Swirszcz
13. Output Space Search for Structured Prediction
Janardhan Rao Doppa, Alan Fern, Prasad Tadepalli
14. A Complete Analysis of the l_1,p Group-Lasso
Julia Vogt, Volker Roth
15. Information-Theoretical Learning of Discriminative Clusters for Unsupervised Domain Adaptation
Yuan Shi, Fei Sha
16. A convex relaxation for weakly supervised classifiers
Armand Joulin, Francis Bach
17. Inferring Latent Structure From Mixed Real and Categorical Relational Data
Esther Salazar, Lawrence Carin
18. Learning Task Grouping and Overlap in Multi-task Learning
Abhishek Kumar, Hal Daume III
The following papers are from ACM Special Interest Group on Knowledge Discovery and Data Mining 2011-2012
19. A Sparsity-Inducing Formulation for Evolutionary Co-Clustering
Shuiwang Ji*, Old Dominion Univ; Wenlu Zhang, Old Dominion University; Jun Liu, Siemens Corporate Research at Princeton
20. On Socio-Spatial Group Query for Location-Based Social Networks
De-Nian Yang, Academia Sinica; Chih-Ya Shen*, National Taiwan University; Wang-Chien Lee, Pennsylvania State University; Ming-Syan Chen, NTU
21. Robust Multi-Task Feature Learning
Pinghua Gong*, Tsinghua University; Jieping Ye, Arizona State University; Changshui Zhang, Tsinghua University
22. Unsupervised Feature Selection for Linked Social Media Data
Jiliang Tang*, Arizona State University; Huan Liu, Arizona State University
23. Learning from Crowds in the Presence of Schools of Thought
Yuandong Tian*, Carnegie Mellon University; Jun Zhu, Tsinghua University
24. Event-based Social Networks: Linking the Online and Offline Social Worlds
Xingjie Liu*, The Pennsalvania State Univ; QI HE, IBM Almaden Research Center; Yuanyuan Tian, IBM Almaden Research; Wang-Chien Lee, Pennsylvania State University; John McPherson, IBM Almaden Research Center; Jiawei Han, University of Illinois at Urbana-Champaign
25. On the Semantic Annotation of Places in Location-based Social Networks
Mao Ye, Dong Shou, ; Wang-Chien Lee, ; Peifeng Yin, ; Krzysztof Janowicz,
26. Two-locus association mapping in subquadratic runtime
Panagiotis Achlioptas, ; Bernhard Scholkopf, Max Planck Institute; Karsten Borgwardt, Max Planck Institutes
27. Differentially Private Data Release for Data Mining
Noman Mohammed*, Concordia University; Rui Chen, Concordia University; Benjamin Fung, Concordia University; Mourad Debbabi, Concordia University; Philip Yu, University of Illinois at Chicago
28. Collaborative Topic Models for Recommending Scientific Articles
Chong Wang*, Princeton University; David Blei, Princeton Univ
TOOLS:
Tools that may help with course projects (to be complete)
- Matlab Optimization Toolbox
- SVM_Light (support vector machines)
- LIBSVM (support vector machines)
- Bayesian Knowledge Discoverer (BKD): computer program able to learn Bayesian Belief Networks from databases
- Bayes net toolbox for Matlab
- TSP Demo
- LeNet (neural networks)
- Neural networks demo
- Neural networks flash demo
- GAUL (genetic algorithm)
- Java genetic algorithm demo
- A complete notebook GA
- A system for distributing statistical software, datasets, and information by electronic mail, FTP and WWW
- Tools for mining large databases C5.0 and See5
- Description of the SLIPPER rule learner, that is a system that learns sets of rules from data based on original RIPPER rule learner
- Information about Data Mining and knowledge discovery in Databases
- Clustering Algorithms
ACADEMIC INTEGRITY:
You are expected to adhere to the highest standards of academic honesty. Unless otherwise specified, collaboration on assignments is not allowed. Use of published materials is allowed, but the sources should be explicitly stated in your solutions. Violations will be reviewed and sanctioned according to the University Policy on Academic Integrity. Collaborations among team members are only allowed for the final term projects that are selected.
Academic integrity is the pursuit of scholarly activity free from fraud and deception and is an educational objective of this institution. Academic dishonesty includes, but is not limited to, cheating, plagiarizing, fabricating of information or citations, facilitating acts of academic dishonesty by others, having unauthorized possession of examinations, submitting work for another person or work previously used without informing the instructor, or tampering with the academic work of other students.
DISABILITY STATEMENT:
If you have a documented disability for which you are or may be requesting an accommodation, you are encouraged to contact the instructor and the Center for Students with Disabilities or the University Program for College Students with Learning Disabilities as soon as possible to better ensure that such accommodations are implemented in a timely fashion.
Jinbo Bi 2012/8-2012/12
Last revised: 8/27/2012