[IMG]

Introduction to Machine Learning

Ethem ALPAYDIN


The MIT Press, October 2004, ISBN 0-262-01211-1

 


 

I am no longer maintaining this page, please refer to the second edition.

 

 


The book can be ordered through The MIT Press, Amazon (CA, DE, FR, JP, UK, US), Barnes&Noble (US), Pandora (TR).

Find in a Library


Foreign Editions:

  • Prentice-Hall of India (IN) has published a reprint of the book in the Eastern Economy Edition series. ISBN: 81-203-2791-8 (2005).
  • Maschinelles Lernen (German language edition, translated by Simone Linke), Oldenbourg Verlag, ISBN-13: 978-3-486-58114-0, May 2008.
  • 机器学习导论 (Chinese simplified character edition, translated by Ming Fan), Huazhang Press, ISBN: 7111265246/9787111265245, June 2009 (Amazon China).
  • Turkish language edition will be published by Bogazici University Press.

Description, Reviews, Table of Contents, Courses, Figures, Lecture Slides, Errata, Solutions to Exercises

Description: The goal of machine learning is to program computers to use example data or past experience to solve a given problem. Many successful applications of machine learning exist already, including systems that analyze past sales data to predict customer behavior, recognize faces or spoken speech, optimize robot behavior so that a task can be completed using minimum resources, and extract knowledge from bioinformatics data. Introduction to Machine Learning is a comprehensive textbook on the subject, covering a broad array of topics not usually included in introductory machine learning texts. It discusses many methods based in different fields, including statistics, pattern recognition, neural networks, artificial intelligence, signal processing, control, and data mining, in order to present a unified treatment of machine learning problems and solutions. All learning algorithms are explained so that the student can easily move from the equations in the book to a computer program. The book can be used by advanced undergraduates and graduate students who have completed courses in computer programming, probability, calculus, and linear algebra. It will also be of interest to engineers in the field who are concerned with the application of machine learning methods.

After an introduction that defines machine learning and gives examples of machine learning applications, the book covers supervised learning, Bayesian decision theory, parametric methods, multivariate methods, dimensionality reduction, clustering, nonparametric methods, decision trees, linear discrimination, multilayer perceptrons, local models, hidden Markov models, assessing and comparing classification algorithms, combining multiple learners, and reinforcement learning.


Reviews:


Table of Contents

  • Series Foreword xiii
  • Figures xv
  • Tables xxiii
  • Preface xxv
  • Acknowledgments xxvii
  • Notations xxix
  • 1 Introduction 1
    • 1.1 What Is Machine Learning? 1
    • 1.2 Examples of Machine Learning Applications 3
      • 1.2.1 Learning Associations 3
      • 1.2.2 Classification 4
      • 1.2.3 Regression 8
      • 1.2.4 Unsupervised Learning 10
      • 1.2.5 Reinforcement Learning 11
    • 1.3 Notes 12
    • 1.4 Relevant Resources 14
    • 1.5 Exercises 15
    • 1.6 References 16
  • 2 Supervised Learning 17
    • 2.1 Learning a Class from Examples 17
    • 2.2 Vapnik-Chervonenkis (VC) Dimension 22
    • 2.3 Probably Approximately Correct (PAC) Learning 24
    • 2.4 Noise 25
    • 2.5 Learning Multiple Classes 27
    • 2.6 Regression 29
    • 2.7 Model Selection and Generalization 32
    • 2.8 Dimensions of a Supervised Machine Learning Algorithm 35
    • 2.9 Notes 36
    • 2.10 Exercises 37
    • 2.11 References 38
  • 3 Bayesian Decision Theory 39
    • 3.1 Introduction 39
    • 3.2 Classification 41
    • 3.3 Losses and Risks 43
    • 3.4 Discriminant Functions 45
    • 3.5 Utility Theory 46
    • 3.6 Value of Information 47
    • 3.7 Bayesian Networks 48
    • 3.8 Influence Diagrams 55
    • 3.9 Association Rules 56
    • 3.10 Notes 57
    • 3.11 Exercises 57
    • 3.12 References 58
  • 4 Parametric Methods 61
    • 4.1 Introduction 61
    • 4.2 Maximum Likelihood Estimation 62
      • 4.2.1 Bernoulli Density 62
      • 4.2.2 Multinomial Density 63
      • 4.2.3 Gaussian (Normal) Density 64
    • 4.3 Evaluating an Estimator: Bias and Variance 64
    • 4.4 The Bayes' Estimator 67
    • 4.5 Parametric Classification 69
    • 4.6 Regression 73
    • 4.7 Tuning Model Complexity: Bias/Variance Dilemma 76
    • 4.8 Model Selection Procedures 79
    • 4.9 Notes 82
    • 4.10 Exercises 82
    • 4.11 References 83
  • 5 Multivariate Methods 85
    • 5.1 Multivariate Data 85
    • 5.2 Parameter Estimation 86
    • 5.3 Estimation of Missing Values 87
    • 5.4 Multivariate Normal Distribution 88
    • 5.5 Multivariate Classification 92
    • 5.6 Tuning Complexity 98
    • 5.7 Discrete Features 99
    • 5.8 Multivariate Regression 100
    • 5.9 Notes 102
    • 5.10 Exercises 102
    • 5.11 References 103
  • 6 Dimensionality Reduction 105
    • 6.1 Introduction 105
    • 6.2 Subset Selection 106
    • 6.3 Principal Components Analysis 108
    • 6.4 Factor Analysis 116
    • 6.5 Multidimensional Scaling 121
    • 6.6 Linear Discriminant Analysis 124
    • 6.7 Notes 127
    • 6.8 Exercises 130
    • 6.9 References 130
  • 7 Clustering 133
    • 7.1 Introduction 133
    • 7.2 Mixture Densities 134
    • 7.3 k-Means Clustering 135
    • 7.4 Expectation-Maximization Algorithm 139
    • 7.5 Mixtures of Latent Variable Models 144
    • 7.6 Supervised Learning after Clustering 145
    • 7.7 Hierarchical Clustering 146
    • 7.8 Choosing the Number of Clusters 149
    • 7.9 Notes 149
    • 7.10 Exercises 150
    • 7.11 References 150
  • 8 Nonparametric Methods 153
    • 8.1 Introduction 153
    • 8.2 Nonparametric Density Estimation 154
      • 8.2.1 Histogram Estimator 155
      • 8.2.2 Kernel Estimator 157
      • 8.2.3 k-Nearest Neighbor Estimator 158
    • 8.3 Generalization to Multivariate Data 159
    • 8.4 Nonparametric Classification 161
    • 8.5 Condensed Nearest Neighbor 162
    • 8.6 Nonparametric Regression: Smoothing Models 164
      • 8.6.1 Running Mean Smoother 165
      • 8.6.2 Kernel Smoother 166
      • 8.6.3 Running Line Smoother 167
    • 8.7 How to Choose the Smoothing Parameter 168
    • 8.8 Notes 169
    • 8.9 Exercises 170
    • 8.10 References 170
  • 9 Decision Trees 173
    • 9.1 Introduction 173
    • 9.2 Univariate Trees 175
      • 9.2.1 Classification Trees 176
      • 9.2.2 Regression Trees 180
    • 9.3 Pruning 182
    • 9.4 Rule Extraction from Trees 185
    • 9.5 Learning Rules from Data 186
    • 9.6 Multivariate Trees 190
    • 9.7 Notes 192
    • 9.8 Exercises 195
    • 9.9 References 195
  • 10 Linear Discrimination 197
    • 10.1 Introduction 197
    • 10.2 Generalizing the Linear Model 199
    • 10.3 Geometry of the Linear Discriminant 200
      • 10.3.1 Two Classes 200
      • 10.3.2 Multiple Classes 202
    • 10.4 Pairwise Separation 204
    • 10.5 Parametric Discrimination Revisited 205
    • 10.6 Gradient Descent 206
    • 10.7 Logistic Discrimination 208
      • 10.7.1 Two Classes 208
      • 10.7.2 Multiple Classes 211
    • 10.8 Discrimination by Regression 216
    • 10.9 Support Vector Machines 218
      • 10.9.1 Optimal Separating Hyperplane 218
      • 10.9.2 The Nonseparable Case: Soft Margin Hyperplane 221
      • 10.9.3 Kernel Functions 223
      • 10.9.4 Support Vector Machines for Regression 225
    • 10.10 Notes 227
    • 10.11 Exercises 227
    • 10.12 References 228
  • 11 Multilayer Perceptrons 229
    • 11.1 Introduction 229
      • 11.1.1 Understanding the Brain 230
      • 11.1.2 Neural Networks as a Paradigm for Parallel Processing 231
    • 11.2 The Perceptron 233
    • 11.3 Training a Perceptron 236
    • 11.4 Learning Boolean Functions 239
    • 11.5 Multilayer Perceptrons 241
    • 11.6 MLP as a Universal Approximator 244
    • 11.7 Backpropagation Algorithm 245
      • 11.7.1 Nonlinear Regression 246
      • 11.7.2 Two-Class Discrimination 248
      • 11.7.3 Multiclass Discrimination 250
      • 11.7.4 Multiple Hidden Layers 252
    • 11.8 Training Procedures 252
      • 11.8.1 Improving Convergence 252
        • Momentum 253
        • Adaptive Learning Rate 253
      • 11.8.2 Overtraining 253
      • 11.8.3 Structuring the Network 254
      • 11.8.4 Hints 257
    • 11.9 Tuning the Network Size 259
    • 11.10 Bayesian View of Learning 262
    • 11.11 Dimensionality Reduction 263
    • 11.12 Learning Time 266
      • 11.12.1 Time Delay Neural Networks 266
      • 11.12.2 Recurrent Networks 267
    • 11.13 Notes 268
    • 11.14 Exercises 270
    • 11.15 References 271
  • 12 Local Models 275
    • 12.1 Introduction 275
    • 12.2 Competitive Learning 276
      • 12.2.1 Online k-Means 276
      • 12.2.2 Adaptive Resonance Theory 281
      • 12.2.3 Self-Organizing Maps 282
    • 12.3 Radial Basis Functions 284
    • 12.4 Incorporating Rule-Based Knowledge 290
    • 12.5 Normalized Basis Functions 291
    • 12.6 Competitive Basis Functions 293
    • 12.7 Learning Vector Quantization 296
    • 12.8 Mixture of Experts 296
      • 12.8.1 Cooperative Experts 299
      • 12.8.2 Competitive Experts 300
    • 12.9 Hierarchical Mixture of Experts 300
    • 12.10 Notes 301
    • 12.11 Exercises 302
    • 12.12 References 302
  • 13 Hidden Markov Models 305
    • 13.1 Introduction 305
    • 13.2 Discrete Markov Processes 306
    • 13.3 Hidden Markov Models 309
    • 13.4 Three Basic Problems of HMMs 311
    • 13.5 Evaluation Problem 311
    • 13.6 Finding the State Sequence 315
    • 13.7 Learning Model Parameters 317
    • 13.8 Continuous Observations 320
    • 13.9 The HMM with Input 321
    • 13.10 Model Selection in HMM 322
    • 13.11 Notes 323
    • 13.12 Exercises 325
    • 13.13 References 325
  • 14 Assessing and Comparing Classification Algorithms 327
    • 14.1 Introduction 327
    • 14.2 Cross-Validation and Resampling Methods 330
      • 14.2.1 K-Fold Cross-Validation 331
      • 14.2.2 5x2 Cross-Validation 331
      • 14.2.3 Bootstrapping 332
    • 14.3 Measuring Error 333
    • 14.4 Interval Estimation 334
    • 14.5 Hypothesis Testing 338
    • 14.6 Assessing a Classification Algorithm's Performance 339
      • 14.6.1 Binomial Test 340
      • 14.6.2 Approximate Normal Test 341
      • 14.6.3 Paired t Test 341
    • 14.7 Comparing Two Classification Algorithms 341
      • 14.7.1 McNemar's Test 342
      • 14.7.2 K-Fold Cross-Validated Paired t Test 342
      • 14.7.3 5x2 cv Paired t Test 343
      • 14.7.4 5x2 cv Paired F Test 344
    • 14.8 Comparing Multiple Classification Algorithms: Analysis of Variance 345
    • 14.9 Notes 348
    • 14.10 Exercises 349
    • 14.11 References 350
  • 15 Combining Multiple Learners 351
    • 15.1 Rationale 351
    • 15.2 Voting 354
    • 15.3 Error-Correcting Output Codes 357
    • 15.4 Bagging 360
    • 15.5 Boosting 360
    • 15.6 Mixture of Experts Revisited 363
    • 15.7 Stacked Generalization 364
    • 15.8 Cascading 366
    • 15.9 Notes 368
    • 15.10 Exercises 369
    • 15.11 References 370
  • 16 Reinforcement Learning 373
    • 16.1 Introduction 373
    • 16.2 Single State Case: K-Armed Bandit 375
    • 16.3 Elements of Reinforcement Learning 376
    • 16.4 Model-Based Learning 379
      • 16.4.1 Value Iteration 379
      • 16.4.2 Policy Iteration 380
    • 16.5 Temporal Difference Learning 380
      • 16.5.1 Exploration Strategies 381
      • 16.5.2 Deterministic Rewards and Actions 382
      • 16.5.3 Nondeterministic Rewards and Actions 383
      • 16.5.4 Eligibility Traces 385
    • 16.6 Generalization 387
    • 16.7 Partially Observable States 389
    • 16.8 Notes 391
    • 16.9 Exercises 393
    • 16.10 References 394
  • A Probability 397
    • A.1 Elements of Probability 397
      • A.1.1 Axioms of Probability 398
      • A.1.2 Conditional Probability 398
    • A.2 Random Variables 399
      • A.2.1 Probability Distribution and Density Functions 399
      • A.2.2 Joint Distribution and Density Functions 400
      • A.2.3 Conditional Distributions 400
      • A.2.4 Bayes' Rule 401
      • A.2.5 Expectation 401
      • A.2.6 Variance 402
      • A.2.7 Weak Law of Large Numbers 403
    • A.3 Special Random Variables 403
      • A.3.1 Bernoulli Distribution 403
      • A.3.2 Binomial Distribution 404
      • A.3.3 Multinomial Distribution 404
      • A.3.4 Uniform Distribution 404
      • A.3.5 Normal (Gaussian) Distribution 405
      • A.3.6 Chi-Square Distribution 406
      • A.3.7 t Distribution 407
      • A.3.8 F Distribution 407
    • A.4 References 407
  • Index 409

Courses: The book is used in the following courses, either as the main textbook, or as a reference book. I will be happy to be told of others.


Figures: The complete set of figures can be retrieved as a pdf file (2 MB). Instructors using the book are welcome to use these figures in their lecture slides as long as the use is non-commercial and the source is cited.


Lecture Slides: The following lecture slides (pdf and ppt) are made available for instructors using the book.


Errata:

  • (p. 20-22): S and G need not be unique. (Luc de Raedt)

Depending on the training set and the hypothesis class, there may be several S_i and G_j which respectively make up the S-set and the G-set. Every member of the S-set is consistent with all the instances and there are no consistent hypotheses that are more specific. Similarly, every member of the G-set is consistent with all the instances and there are no consistent hypotheses that are more general. These two make up the boundary sets and any hypothesis between them is consistent and is part of the version space. There is an algorithm called candidate elimination that incrementally updates the S- and G-sets as it sees training instances one by one. See (Mitchell, 1997; Russell and Norvig; 1995).

  • (p. 30): Eq. 2.15: w_1 x + w_0 should be w_1 x^t + w_0 (Mike Colagrosso)
  • (p. 30): Eq. 2.15: Not needed, but the summation should be multiplied by 1/N to match Eq. 2.12 (Mike Colagrosso)
  • (p. 35): Eq. 2.19: Missing closing ')' (Mike Colagrosso)
  • (p. 58): Ref (Agrawal et al., 1996): The second author's name should be "Mannila." (Kai Puolamäki)
  • (p. 59): Ref. The title of Pearl's 1988 book is "Probabilistic Reasoning in INTELLIGENT Systems." (Yang-feng Ji)
  • (p. 62): Eq. 4.1: l(\theta) should be l(\theta|X) (Chris Mansley)
  • (p. 63): Eq. 4.5: p(x_1, x_2, \dots, x_K) should be P(x_1, x_2, \dots, x_K). That is, P should be uppercase. (Mike Colagrosso)
  • (p. 86): Eq. 5.3: '[' missing after the first 'E'. (Hakan Haberdar)
  • (p. 89): Eq at the bottom of the page: +(plus) before 2 should be – (minus) (Barış Can Daylık).
  • (p. 90): Figure 5.2: Upper-left figure should be a circle, but the plot is squashed. x_1 axis is longer than the x_2 axis. (Mike Colagrosso)
  • (p. 118): Equation at the bottom: In the second equality, the last C is to transposed. (Chulhong Min)
  • (p. 124): Eq. 6.31: It should be x^t. (Stijn Vanderlooy)
  • (p. 157): Figure 8.2: h values are twice the actual values. The titles should read 2h=2, 2h=1 and 2h=0.5. (Onder Eker, Alex Kogan)
  • (p. 160): The first sentence of the second paragraph should read: "For example, the use of the Euclidean norm in equation 8.11 implies that ..." (Stijn Vanderlooy)
  • (p. 176): Second line of fourth paragraph should read: "... number of bits needed to encode the class code of an instance" (Stijn Vanderlooy)
  • (p. 178): Eq. 9.8: log should be base 2. (Ismail Ari)
  • (p. 187 and 196): The name of the author for the Irep reference is F{\"u}rnkranz. (Stijn Vanderlooy)
  • (p. 189): Third paragraph, line 5 from top: "instances of all other classes are taken as [negative] examples." (Ismail Ari)
  • (p. 191): Figure 9.8: w_{11} x_1 + w_{12} x_2 + w_{10} = 0 should be w_{11} x_1 + w_{12} x_2 + w_{10} > 0 (Mike Colagrosso)
  • (p. 198): Fourth line from the bottom of the page: "magnitude" is misspelt. (Michael Dominguez)
  • (p. 203): Eq. 10.7: w_{i0} shouldn't be bold. It's a scalar, not a vector, as in the sentence above and Eq. 10.6. (Mike Colagrosso)
  • (p. 209): Eq. 10.23: E(w, w_o | X) should be E(w, w_0 | X). That is, the subscript should be a zero, not an "oh." (Mike Colagrosso)
  • (p. 210): Fig 10.6. The 11th line (that starts with \Delta w_j) should also be enclosed in a for loop of j=0,\ldots,d. (Didem Unat)
  • (p. 222): Seventh line from the bottom of the page: The number of misclassifications is \#\{|xi^t \ge 1\}. (Ming Fan)
  • (p. 227): First sentence of 10.10: Change "discriminant" to "discrimination" (Mike Colagrosso)
  • (p. 227): Exercise 1: change "function" to "functions" (Mike Colagrosso)
  • (p. 235): Fig. 11.2 caption mentions w_{ij} but there is no w_{ij} in the figure. w_{ij} is the weight of the connection from input x_j to output y_i. w_i (the vector of weights to output y_i) are shown in the figure. (Stijn Vanderlooy)
  • (p. 236): The first line after eq. 11.6: ..., the linear model can also be used ... (Ming Fan, Michael Orlov)
  • (p. 238): In the first cross-entropy eq on the top of the page, the summation over i and all i subscripts should be omitted. (David Warde-Farley)
  • (p. 239): First word in the Figure 11.3 narrative should be "Perceptron" instead of "Percepton." (Mike Colagrosso)
  • (p. 240): In the line below the equation, it should read: Note that y=s(x_1+x_2-1.5) satisfies ..." (Ming Fan)
  • (p. 245): On the third line from the bottom of the page, it should read z_h and not h_j. (Joel Kammet)
  • (p. 252): sigmoid() missing in the second terms to the right of eqs defining z_1h and z_2l.
  • (p. 257): Insert "is" before "as" in the last sentence of the first paragraph to read "..., it is as if the training set ..." (Tunga Gungor)
  • (p. 267): Fig. 11.20: The input units x^{t-\tau},...,x^{t-1},x^t should be labeled in the opposite order; or equivalently, the arrows should point to the left. x^t is the current input seen (the latest) and x^{t-\tau} is the input seen \tau steps in the past (delayed \tau times).
  • (p. 279): Fig 12.2: On line 5 of the psudocode, m_j should be m_i (Murat Semerci)
  • (p. 282): Eq. 12.9: On the third line, x should be x^t. (Ismail Ari)
  • (p. 288): Remove the extra "the" in the first sentence. (Tunga Gungor)
  • (p. 308): Eq. 13.8: The denominator should read "#{sequences}"; "number of" in the curly brackets is redundant. (Can Kavaklioglu)
  • (p. 313): Fig 13.3, legend: "...computation of \alpha_{t+1}(j)..." (Ismail Ari)
  • (p.317): Fig. 13.4: Below the node for state j, '1' should follow the line O_{t+}; that is, the observation is named O_{t+1}.
  • (p.319): Eq. 13.32: In estimating b_j(m), t should range from 1 to T_k (and not T_k-1) in both the numerator and the denominator. (Cem Keskin)
  • (p. 320): Eq. 13.35: Drop j in P(G_{jl}). (Cem Keskin)
  • (p. 327): On the second line from the bottom of the page, “to” is missing before “say which one …” (Hussein Issa).
  • (p. 329): On line 6 from the bottom of the page, “of” is missing between “both” and “these.” (Hussein Issa).
  • (p. 330): "than" on line 16 should be changed to "then." (Tunga Gungor)
  • (p. 340): Eq. 14.12: The summation should start from j=0. (Alex Kogan)
  • (p. 343): Eq. 14.17: In the first term to the right, division by \sigma is missing in the numerator. It should be changed to: "\frac{p^{(1)}_1 / \sigma} / {\sqrt{M/5}}. (Alex Kogan)
  • (p. 362): Fig 15.2: On line 11, "Then" is missing after the condition. It should read: If y^t_j=r^t Then p^t_{j+1}\leftarrow \beta_j p^t_j Else p^t_{j+1}\leftarrow p^t_j (Stijn Vanderlooy)
  • (p. 375): First paragraph of 16.2: classification is misspelled. (Mike Colagrosso)
  • (p. 379): Eq. 16.9: V*(s_t) should be changed to V*(s_{t+1}). (Omer Korcak)
  • (p. 380): Fig 16.3, first line: Initialize a policy \Pi' arbitrarily (Stijn Vanderlooy)
  • (p. 381): Eqs. 16.10 and 16.11: Replace the denominators as \sum_{b\in{\cal A}} \exp ... (Stijn Vanderlooy)

Solutions to Exercises: Available as a gzipped tar or compressed zipped folder file for instructors who have adopted the book for course use. Please contact The MIT Press for user name and password. The manual contains solutions to exercises and example Matlab programs.


Created on Oct 24, 2004 by E. Alpaydin (my_last_name AT boun DOT edu DOT tr)

  • Jan 14, 2005: Added links to more online booksellers.
  • Jan 31, 2005: Added link to the pdf file of figures.
  • Apr, 3, 2005: Added Errata.
  • June 1, 2005: Further errata.
  • July 12, 2005: Added more bookseller link.
  • July 20, 2005: Added more bookseller links and the lecture slides of Chapters 1, 2 and 11.
  • July 28, 2005: Added all lecture slides.
  • Sep 26, 2005: Added ppt of all lecture slides. This new version (V1-1) is the same as the previously available V1-0 except that I retyped all equations using Microsoft Equation Editor.
  • Oct 25, 2005: Further errata.
  • Nov 12, 2005: Added reviews and courses.
  • Dec 14, 2005: Added links to MIT Press for sample pdfs of Foreword, Preface, and Chapter 1.
  • Feb 1, 2006: Added links to 2006 courses.
  • Apr 27, 2006: Added new course links and errata.
  • Jul 4, 2006: Added errata.pdf
  • Sep 1, 2006: Added links to Fall 2006 courses.
  • Nov 14, 2006: Added info on Foreign Editions.
  • Jan 12, 2007: Added Solutions to Exercises. Thanks to Oya Aran, our web admin, for her help in making the file protected.
  • Feb 5, 2007: Added links to Find-In-A-Library and new courses.
  • Jul 16, 2007: Some more errata.
  • Sep 18, 2007: Added links to 2007 courses.
  • May 1, 2008: Added an erratum and a review.
  • August 20, 2009: Added info about the Chinese edition.