!["View large cover" [IMG]](0262012111-medium.jpg)
The MIT Press, October 2004, ISBN 0-262-01211-1
The book can be ordered
through The MIT
Press, Amazon (CA, DE, FR, JP, UK, US), Barnes&Noble (US), Pandora (TR).
Find in a Library
Foreign
Editions:
- Prentice-Hall of India (IN) has published
a reprint of the book in the Eastern Economy Edition series. ISBN:
81-203-2791-8 (2005).
- Maschinelles Lernen
(German language edition, translated by Simone Linke),
Oldenbourg Verlag,
ISBN-13: 978-3-486-58114-0, May 2008.
- Chinese simplified character edition is in
preparation and will be published by China Machine Press in conjunction
with Beijing Huazhang Graphics & Information
Co., Ltd.
- Turkish language edition will be published by Bogazici University Press.
Description, Reviews, Table of Contents, Courses, Figures, Lecture Slides, Errata,
Solutions to Exercises
Description:
The goal of machine learning is to program computers to use example data or past
experience to solve a given problem. Many successful applications of machine
learning exist already, including systems that analyze past sales data to
predict customer behavior, recognize faces or spoken speech, optimize robot
behavior so that a task can be completed using minimum resources, and extract
knowledge from bioinformatics data. Introduction to Machine Learning is
a comprehensive textbook on the subject, covering a broad array of topics not
usually included in introductory machine learning texts. It discusses many
methods based in different fields, including statistics, pattern recognition,
neural networks, artificial intelligence, signal processing, control, and data
mining, in order to present a unified treatment of machine learning problems and
solutions. All learning algorithms are explained so that the student can easily
move from the equations in the book to a computer program. The book can be used
by advanced undergraduates and graduate students who have completed courses in
computer programming, probability, calculus, and linear algebra. It will also
be of interest to engineers in the field who are concerned
with the application of machine learning methods.
After an introduction that
defines machine learning and gives examples of machine learning applications,
the book covers supervised learning, Bayesian decision theory, parametric
methods, multivariate methods, dimensionality reduction, clustering,
nonparametric methods, decision trees, linear discrimination, multilayer perceptrons, local models, hidden Markov models, assessing
and comparing classification algorithms, combining multiple learners, and
reinforcement learning.
Reviews:
Table
of Contents
- Series
Foreword xiii
- Figures xv
- Tables xxiii
- Preface
xxv
- Acknowledgments xxvii
- Notations xxix
- 1
Introduction 1
- 1.1 What Is Machine Learning? 1
- 1.2 Examples of Machine Learning
Applications 3
- 1.2.1 Learning Associations 3
- 1.2.2 Classification 4
- 1.2.3 Regression 8
- 1.2.4 Unsupervised Learning 10
- 1.2.5 Reinforcement Learning 11
- 1.3 Notes 12
- 1.4 Relevant Resources 14
- 1.5 Exercises 15
- 1.6 References 16
- 2 Supervised Learning 17
- 2.1 Learning a Class from
Examples 17
- 2.2 Vapnik-Chervonenkis
(VC) Dimension 22
- 2.3 Probably Approximately
Correct (PAC) Learning 24
- 2.4 Noise 25
- 2.5 Learning Multiple Classes 27
- 2.6 Regression 29
- 2.7 Model Selection and
Generalization 32
- 2.8 Dimensions of a Supervised
Machine Learning Algorithm 35
- 2.9 Notes 36
- 2.10 Exercises 37
- 2.11 References 38
- 3 Bayesian Decision Theory 39
- 3.1 Introduction 39
- 3.2 Classification 41
- 3.3 Losses and Risks 43
- 3.4 Discriminant
Functions 45
- 3.5 Utility Theory 46
- 3.6 Value of Information 47
- 3.7 Bayesian Networks 48
- 3.8 Influence Diagrams 55
- 3.9 Association Rules 56
- 3.10 Notes 57
- 3.11 Exercises 57
- 3.12 References 58
- 4 Parametric Methods 61
- 4.1 Introduction 61
- 4.2 Maximum Likelihood Estimation
62
- 4.2.1 Bernoulli Density 62
- 4.2.2 Multinomial Density 63
- 4.2.3 Gaussian (Normal) Density 64
- 4.3 Evaluating an Estimator: Bias
and Variance 64
- 4.4 The Bayes'
Estimator 67
- 4.5 Parametric Classification 69
- 4.6 Regression 73
- 4.7 Tuning Model Complexity:
Bias/Variance Dilemma 76
- 4.8 Model Selection Procedures 79
- 4.9 Notes 82
- 4.10 Exercises 82
- 4.11 References 83
- 5 Multivariate Methods 85
- 5.1 Multivariate Data 85
- 5.2 Parameter Estimation 86
- 5.3 Estimation of Missing Values
87
- 5.4 Multivariate Normal
Distribution 88
- 5.5 Multivariate Classification
92
- 5.6 Tuning Complexity 98
- 5.7 Discrete Features 99
- 5.8 Multivariate Regression 100
- 5.9 Notes 102
- 5.10 Exercises 102
- 5.11 References 103
- 6 Dimensionality Reduction 105
- 6.1 Introduction 105
- 6.2 Subset Selection 106
- 6.3 Principal Components Analysis
108
- 6.4 Factor Analysis 116
- 6.5 Multidimensional Scaling 121
- 6.6 Linear Discriminant
Analysis 124
- 6.7 Notes 127
- 6.8 Exercises 130
- 6.9 References 130
- 7 Clustering 133
- 7.1 Introduction 133
- 7.2 Mixture Densities 134
- 7.3 k-Means Clustering 135
- 7.4 Expectation-Maximization
Algorithm 139
- 7.5 Mixtures of Latent Variable
Models 144
- 7.6 Supervised Learning after
Clustering 145
- 7.7 Hierarchical Clustering 146
- 7.8 Choosing the Number of
Clusters 149
- 7.9 Notes 149
- 7.10 Exercises 150
- 7.11 References 150
- 8 Nonparametric Methods 153
- 8.1 Introduction 153
- 8.2 Nonparametric Density
Estimation 154
- 8.2.1 Histogram Estimator 155
- 8.2.2 Kernel Estimator 157
- 8.2.3 k-Nearest Neighbor
Estimator 158
- 8.3 Generalization to
Multivariate Data 159
- 8.4 Nonparametric Classification
161
- 8.5 Condensed Nearest Neighbor
162
- 8.6 Nonparametric Regression:
Smoothing Models 164
- 8.6.1 Running Mean Smoother 165
- 8.6.2 Kernel Smoother 166
- 8.6.3 Running Line Smoother 167
- 8.7 How to Choose the Smoothing
Parameter 168
- 8.8 Notes 169
- 8.9 Exercises 170
- 8.10 References 170
- 9 Decision Trees 173
- 9.1 Introduction 173
- 9.2 Univariate
Trees 175
- 9.2.1 Classification Trees 176
- 9.2.2 Regression Trees 180
- 9.3 Pruning 182
- 9.4 Rule Extraction from Trees
185
- 9.5 Learning Rules from Data 186
- 9.6 Multivariate Trees 190
- 9.7 Notes 192
- 9.8 Exercises 195
- 9.9 References 195
- 10 Linear Discrimination 197
- 10.1 Introduction 197
- 10.2 Generalizing the Linear
Model 199
- 10.3 Geometry of the Linear Discriminant 200
- 10.3.1 Two Classes 200
- 10.3.2 Multiple Classes 202
- 10.4 Pairwise
Separation 204
- 10.5 Parametric Discrimination
Revisited 205
- 10.6 Gradient Descent 206
- 10.7 Logistic Discrimination 208
- 10.7.1 Two Classes 208
- 10.7.2 Multiple Classes 211
- 10.8 Discrimination by Regression
216
- 10.9 Support Vector Machines 218
- 10.9.1 Optimal Separating Hyperplane 218
- 10.9.2 The Nonseparable
Case: Soft Margin Hyperplane 221
- 10.9.3 Kernel Functions 223
- 10.9.4 Support Vector Machines
for Regression 225
- 10.10 Notes 227
- 10.11 Exercises 227
- 10.12 References 228
- 11 Multilayer Perceptrons
229
- 11.1 Introduction 229
- 11.1.1 Understanding the Brain
230
- 11.1.2 Neural Networks as a
Paradigm for Parallel Processing 231
- 11.2 The Perceptron
233
- 11.3 Training a Perceptron 236
- 11.4 Learning Boolean Functions
239
- 11.5 Multilayer Perceptrons 241
- 11.6 MLP as a Universal Approximator 244
- 11.7 Backpropagation
Algorithm 245
- 11.7.1 Nonlinear Regression 246
- 11.7.2 Two-Class Discrimination
248
- 11.7.3 Multiclass
Discrimination 250
- 11.7.4 Multiple Hidden Layers
252
- 11.8 Training Procedures 252
- 11.8.1 Improving Convergence 252
- Momentum 253
- Adaptive Learning Rate 253
- 11.8.2 Overtraining 253
- 11.8.3 Structuring the Network
254
- 11.8.4 Hints 257
- 11.9 Tuning the Network Size 259
- 11.10 Bayesian View of Learning
262
- 11.11 Dimensionality Reduction
263
- 11.12 Learning Time 266
- 11.12.1 Time Delay Neural
Networks 266
- 11.12.2 Recurrent Networks 267
- 11.13 Notes 268
- 11.14 Exercises 270
- 11.15 References 271
- 12 Local Models 275
- 12.1 Introduction 275
- 12.2 Competitive Learning 276
- 12.2.1 Online k-Means 276
- 12.2.2 Adaptive Resonance Theory
281
- 12.2.3 Self-Organizing Maps 282
- 12.3 Radial Basis Functions 284
- 12.4 Incorporating Rule-Based
Knowledge 290
- 12.5 Normalized Basis Functions
291
- 12.6 Competitive Basis Functions
293
- 12.7 Learning Vector Quantization
296
- 12.8 Mixture of Experts 296
- 12.8.1 Cooperative Experts 299
- 12.8.2 Competitive Experts 300
- 12.9 Hierarchical Mixture of
Experts 300
- 12.10 Notes 301
- 12.11 Exercises 302
- 12.12 References 302
- 13 Hidden Markov Models 305
- 13.1 Introduction 305
- 13.2 Discrete Markov Processes
306
- 13.3 Hidden Markov Models 309
- 13.4 Three Basic Problems of HMMs 311
- 13.5 Evaluation Problem 311
- 13.6 Finding the State Sequence
315
- 13.7 Learning Model Parameters
317
- 13.8 Continuous Observations 320
- 13.9 The HMM with Input 321
- 13.10 Model Selection in HMM 322
- 13.11 Notes 323
- 13.12 Exercises 325
- 13.13 References 325
- 14 Assessing and Comparing Classification Algorithms
327
- 14.1 Introduction 327
- 14.2 Cross-Validation and Resampling Methods 330
- 14.2.1 K-Fold
Cross-Validation 331
- 14.2.2 5x2 Cross-Validation 331
- 14.2.3 Bootstrapping 332
- 14.3 Measuring Error 333
- 14.4 Interval Estimation 334
- 14.5 Hypothesis Testing 338
- 14.6 Assessing a Classification
Algorithm's Performance 339
- 14.6.1 Binomial Test 340
- 14.6.2 Approximate Normal Test 341
- 14.6.3 Paired t Test 341
- 14.7 Comparing Two Classification
Algorithms 341
- 14.7.1 McNemar's
Test 342
- 14.7.2 K-Fold
Cross-Validated Paired t Test 342
- 14.7.3 5x2 cv
Paired t Test 343
- 14.7.4 5x2 cv
Paired F Test 344
- 14.8 Comparing Multiple
Classification Algorithms: Analysis of Variance 345
- 14.9 Notes 348
- 14.10 Exercises 349
- 14.11 References 350
- 15 Combining Multiple Learners 351
- 15.1 Rationale 351
- 15.2 Voting 354
- 15.3 Error-Correcting Output
Codes 357
- 15.4 Bagging 360
- 15.5 Boosting 360
- 15.6 Mixture of Experts Revisited
363
- 15.7 Stacked Generalization 364
- 15.8 Cascading 366
- 15.9 Notes 368
- 15.10 Exercises 369
- 15.11 References 370
- 16 Reinforcement Learning 373
- 16.1 Introduction 373
- 16.2 Single State
Case: K-Armed Bandit 375
- 16.3 Elements of Reinforcement
Learning 376
- 16.4 Model-Based Learning 379
- 16.4.1 Value Iteration 379
- 16.4.2 Policy Iteration 380
- 16.5 Temporal Difference Learning
380
- 16.5.1 Exploration Strategies
381
- 16.5.2 Deterministic Rewards and
Actions 382
- 16.5.3 Nondeterministic Rewards
and Actions 383
- 16.5.4 Eligibility Traces 385
- 16.6 Generalization 387
- 16.7 Partially Observable States
389
- 16.8 Notes 391
- 16.9 Exercises 393
- 16.10 References 394
- A Probability 397
- A.1 Elements of Probability 397
- A.1.1 Axioms of Probability 398
- A.1.2 Conditional Probability
398
- A.2 Random Variables 399
- A.2.1 Probability Distribution
and Density Functions 399
- A.2.2 Joint Distribution and
Density Functions 400
- A.2.3 Conditional Distributions
400
- A.2.4 Bayes'
Rule 401
- A.2.5 Expectation 401
- A.2.6 Variance 402
- A.2.7 Weak Law of Large Numbers
403
- A.3 Special Random Variables 403
- A.3.1 Bernoulli Distribution 403
- A.3.2 Binomial Distribution 404
- A.3.3 Multinomial Distribution
404
- A.3.4 Uniform Distribution 404
- A.3.5 Normal (Gaussian) Distribution 405
- A.3.6 Chi-Square Distribution
406
- A.3.7 t Distribution 407
- A.3.8 F Distribution 407
- A.4 References 407
- Index
409
Courses:
The book is used in the following courses, either as the main textbook, or as a
reference book. I will be happy to be told of others.
- Textbook:
- Reference book:
Figures:
The complete set of figures can be retrieved as a pdf file (2 MB). Instructors using the book are welcome
to use these figures in their lecture slides as long as the use is
non-commercial and the source is cited.
Lecture
Slides: The following lecture slides (pdf and ppt) are made available
for instructors using the book.
Errata:
- (p. 20-22): S and G need not be unique. (Luc de Raedt)
Depending
on the training set and the hypothesis class, there may be several S_i and G_j which respectively
make up the S-set and the G-set. Every member of the S-set is consistent with
all the instances and there are no consistent hypotheses that are more
specific. Similarly, every member of the G-set is consistent with all the
instances and there are no consistent hypotheses that are more general. These
two make up the boundary sets and any hypothesis between them is consistent and
is part of the version space. There is an algorithm called candidate
elimination that incrementally updates the S- and G-sets as it sees training
instances one by one. See (Mitchell, 1997; Russell and Norvig;
1995).
- (p. 30): Eq. 2.15: w_1 x +
w_0 should be w_1 x^t + w_0 (Mike Colagrosso)
- (p. 30): Eq. 2.15: Not
needed, but the summation should be multiplied by 1/N to match Eq. 2.12 (Mike Colagrosso)
- (p. 35): Eq. 2.19: Missing
closing ')' (Mike Colagrosso)
- (p. 58): Ref (Agrawal et
al., 1996): The second author's name should be "Mannila."
(Kai Puolamäki)
- (p. 59): Ref. The title of Pearl's 1988 book is "Probabilistic
Reasoning in INTELLIGENT Systems." (Yang-feng
Ji)
- (p. 62): Eq. 4.1: l(\theta)
should be l(\theta|X) (Chris Mansley)
- (p. 63): Eq. 4.5: p(x_1, x_2, \dots, x_K)
should be P(x_1, x_2, \dots, x_K). That is, P
should be uppercase. (Mike Colagrosso)
- (p. 86): Eq. 5.3: '['
missing after the first 'E'. (Hakan Haberdar)
- (p. 90): Figure 5.2: Upper-left figure should be a
circle, but the plot is squashed. x_1 axis is
longer than the x_2 axis. (Mike Colagrosso)
- (p. 118): Equation at the bottom: In the second
equality, the last C is to transposed. (Chulhong Min)
- (p. 124): Eq. 6.31: It
should be x^t. (Stijn Vanderlooy)
- (p. 157): Figure 8.2: h values are twice the actual
values. The titles should read 2h=2, 2h=1 and 2h=0.5. (Onder
Eker, Alex Kogan)
- (p. 160): The first sentence of the second paragraph
should read: "For example, the use of the Euclidean norm in equation
8.11 implies that ..." (Stijn Vanderlooy)
- (p. 176): Second line of fourth paragraph should
read: "... number of bits needed to encode the class code of an
instance" (Stijn Vanderlooy)
- (p. 178): Eq. 9.8: log
should be base 2. (Ismail Ari)
- (p. 187 and 196): The name of the author for the Irep reference is F{\"u}rnkranz. (Stijn Vanderlooy)
- (p. 189): Third paragraph, line 5 from top:
"instances of all other classes are taken as [negative]
examples." (Ismail Ari)
- (p. 191): Figure 9.8: w_{11} x_1 + w_{12} x_2 +
w_{10} = 0 should be w_{11} x_1 + w_{12} x_2 + w_{10} > 0 (Mike Colagrosso)
- (p. 198): Fourth line from the bottom of the page:
"magnitude" is misspelt. (Michael
Dominguez)
- (p. 203): Eq. 10.7: w_{i0} shouldn't be bold. It's a scalar, not a vector,
as in the sentence above and Eq. 10.6. (Mike Colagrosso)
- (p. 209): Eq. 10.23: E(w, w_o | X) should be E(w,
w_0 | X). That is, the subscript should be a zero, not an "oh."
(Mike Colagrosso)
- (p. 210): Fig 10.6. The 11th line (that starts with
\Delta w_j) should also be enclosed in a for
loop of j=0,\ldots,d. (Didem Unat)
- (p. 222): Seventh line from the bottom of the page:
The number of misclassifications is \#\{|xi^t \ge 1\}. (Ming Fan)
- (p. 227): First sentence of 10.10: Change "discriminant" to "discrimination" (Mike
Colagrosso)
- (p. 227): Exercise 1: change "function" to
"functions" (Mike Colagrosso)
- (p. 235): Fig. 11.2 caption mentions w_{ij} but there is no w_{ij} in the figure. w_{ij} is the weight of the connection from input x_j to output y_i. w_i (the vector of weights to output y_i) are shown in the figure. (Stijn
Vanderlooy)
- (p. 236): The first line after eq.
11.6: ..., the linear model can also be used ... (Ming Fan, Michael Orlov)
- (p. 238): In the first cross-entropy eq on the top of the page, the summation over i and all i subscripts
should be omitted. (David Warde-Farley)
- (p. 239): First word in the Figure 11.3 narrative
should be "Perceptron" instead of
"Percepton." (Mike Colagrosso)
- (p. 240): In the line below the equation, it should
read: Note that y=s(x_1+x_2-1.5) satisfies ..." (Ming Fan)
- (p. 245): On the third line from the bottom of the
page, it should read z_h and not h_j. (Joel Kammet)
- (p. 252): sigmoid() missing
in the second terms to the right of eqs defining
z_1h and z_2l.
- (p. 257): Insert "is" before "as"
in the last sentence of the first paragraph to read "..., it is as if
the training set ..." (Tunga Gungor)
- (p. 267): Fig. 11.20: The input units x^{t-\tau},...,x^{t-1},x^t should
be labeled in the opposite order; or equivalently, the arrows should point
to the left. x^t is the
current input seen (the latest) and x^{t-\tau}
is the input seen \tau steps in the past
(delayed \tau times).
- (p. 279): Fig 12.2: On line 5 of the psudocode, m_j should be m_i (Murat Semerci)
- (p. 282): Eq. 12.9: On the
third line, x should be x^t. (Ismail Ari)
- (p. 288): Remove the extra "the" in the
first sentence. (Tunga Gungor)
- (p. 308): Eq. 13.8: The
denominator should read "#{sequences}";
"number of" in the curly brackets is redundant. (Can Kavaklioglu)
- (p. 313): Fig 13.3, legend: "...computation of
\alpha_{t+1}(j)..." (Ismail
Ari)
- (p.317): Fig. 13.4: Below the node for state j, '1'
should follow the line O_{t+}; that is, the
observation is named O_{t+1}.
- (p.319): Eq. 13.32: In
estimating b_j(m), t should range from 1 to T_k
(and not T_k-1) in both the numerator and the denominator. (Cem Keskin)
- (p. 320): Eq. 13.35: Drop j
in P(G_{jl}). (Cem Keskin)
- (p. 330): "than" on line 16 should be
changed to "then." (Tunga Gungor)
- (p. 340): Eq. 14.12: The summation
should start from j=0. (Alex Kogan)
- (p. 343): Eq. 14.17: In the
first term to the right, division by \sigma is missing in the numerator.
It should be changed to: "\frac{p^{(1)}_1 / \sigma} / {\sqrt{M/5}}.
(Alex Kogan)
- (p. 362): Fig 15.2: On line 11, "Then" is
missing after the condition. It should read: If y^t_j=r^t Then p^t_{j+1}\leftarrow \beta_j p^t_j Else p^t_{j+1}\leftarrow p^t_j (Stijn Vanderlooy)
- (p. 375): First paragraph of 16.2: classification is
misspelled. (Mike Colagrosso)
- (p. 379): Eq. 16.9: V*(s_t) should be changed to V*(s_{t+1}).
(Omer Korcak)
- (p. 380): Fig 16.3, first line: Initialize a policy
\Pi' arbitrarily (Stijn Vanderlooy)
- (p. 381): Eqs. 16.10 and
16.11: Replace the denominators as \sum_{b\in{\cal A}} \exp ... (Stijn Vanderlooy)
Solutions
to Exercises: Available as a gzipped tar or compressed
zipped folder file for instructors who have adopted the book for course
use. Please contact The MIT Press for user name and password. The manual
contains solutions to exercises and example Matlab
programs.
Created on
Oct 24, 2004 by E. Alpaydin
(my_last_name AT boun DOT edu DOT tr)
- Jan 14, 2005: Added links to more online booksellers.
- Jan 31, 2005: Added link to the pdf
file of figures.
- Apr, 3, 2005: Added Errata.
- June 1, 2005: Further errata.
- July 12, 2005: Added more bookseller link.
- July 20, 2005: Added more bookseller links and the
lecture slides of Chapters 1, 2 and 11.
- July 28, 2005: Added all lecture slides.
- Sep 26, 2005: Added ppt of
all lecture slides. This new version (V1-1) is the same as the previously
available V1-0 except that I retyped all equations using Microsoft
Equation Editor.
- Oct 25, 2005: Further errata.
- Nov 12, 2005: Added reviews and courses.
- Dec 14, 2005: Added links to MIT Press for sample pdfs of Foreword, Preface, and Chapter 1.
- Feb 1, 2006: Added links to 2006 courses.
- Apr 27, 2006: Added new course links and errata.
- Jul 4, 2006: Added errata.pdf
- Sep 1, 2006: Added links to Fall
2006 courses.
- Nov 14, 2006: Added info on Foreign Editions.
- Jan 12, 2007: Added Solutions to Exercises. Thanks to
Oya Aran, our web
admin, for her help in making the file protected.
- Feb 5, 2007: Added links to Find-In-A-Library and new
courses.
- Jul 16, 2007: Some more errata.
- Sep 18, 2007: Added links to 2007 courses.
- May 1, 2008: Added an erratum and a review.