
6. Compare this with the notion of ensemble methods described in Chapter 12.
original features (e.g., squared features, products of features). A data scientist would
become familiar with the different alternatives for kernel functions (linear, polynomial,
and others).
Neural networks also implement complex nonlinear numeric functions, based on the
fundamental concepts of this chapter. Neural networks offer an intriguing twist. One
can think of a neural network as a “stack” of models. On the bottom of the stack are the
original features. From these features are learned a variety of relatively simple models.
Let’s say these are logistic regressions. Then, each subsequent layer in the stack applies
a simple model (let’s say, another logistic regression) to the outputs of the next layer
down. So in a two-layer stack, we would learn a set of logistic regressions from the
original features, and then learn a logistic regression using as features the outputs of the
first set of logistic regressions. We could think of this very roughly as first creating a set
of “experts” in different facets of the problem (the first-layer models), and then learning
how to weight the opinions of these different experts (the second-layer model).
6
The idea of neural networks gets even more intriguing. We might ask: if we are learning
those lower-layer logistic regressions—the different experts—what would be the tar‐
get variable for each? While some practitioners build stacked models where the lower-
layer experts are built to represent specific things using specific target variables (e.g.,
Perlich et al., 2013), more generally with neural networks target labels for training are
provided only for the final layer (the actual target variable). So how are the lower-layer
logistic regressions trained? We can understand by returning to the fundamental con‐
cept of this chapter. The stack of models can be represented by one big parameterized
numeric function. The paramenters now are the coefficients of all the models, taken
together. So once we have decided on an objective function representing what we want
to optimize (e.g., the fit to the training data, based on some fitting function), we can
then apply an optimization procedure to find the best parameters to this very complex
numeric function. When we’re done, we have the parameters to all the models, and
thereby have learned the “best” set of lower-level experts and also the best way to com‐
bine them, all simultaneously.
Note: Neural networks are useful for many tasks
This section describes neural networks for classification and regres‐
sion. The field of neural networks is broad and deep, with a long
history. Neural networks have found wide application throughout
data mining. They are commonly used for many other tasks men‐
tioned in Chapter 2, such as clustering, time series analysis, profil‐
ing, and so on.
Nonlinear Functions, Support Vector Machines, and Neural Networks | 107
www.it-ebooks.info