ml5.js Classification and Regression Analysis

 

GitHub Repository: https://github.com/Ashot72/ml5-spfx-extension

Video link: https://youtu.be/NbO_ZIVHdus

 

ml5.js https://ml5js.org/ aims to make machine learning approachable for a broad audience of artists, creative coders, and students. The library provides access to machine learning

algorithms and models in the browser, building on top of TensorFlow.js with no other external dependencies.

I built an ml5 SPFx extension for classification and regression analysis. I already built a KNN app (k-nearest-neighbor algorithm, machine learning algorithm)

https://github.com/Ashot72/knn-tensorflowjs-spfx-extension for classification and regression analysis using TensorFlow.js  https://www.tensorflow.org/js.

You may read KNN first as I described how to import .csv data into SharePoint lists, displayed some of TensorFlow.js operations, normalization, talked about

features and labels which are also present in this app.

Actually, there are 2 sets of TensorFlow APIs. First one is low level linear algebra API that I used in KNN app and the second one is higher level API that makes

pretty easy to make some more advanced machine learning algorithms. We use ml5.js which was built on top of the TensorFlow.js which makes even much easier to

work with tensorflow.js.

 

Figure 1

Deep Learning is a subset of machine learning in which artificial neural networks adapt and learn from vast amounts of data. With tensorflow.js you can build a neural network with the

help of layers and model.

Model is a data structure that consists of layers and defines inputs and outputs.

In the context of neural networks, a layer is a transformation of the data representation. It behaves like a mathematical function: given an input, it emits an output. A layer can have

state captured by its weights. The weights can be altered during the training of the neural network. Usually we do not define layers with ml5.js though it is possible.

Before running analysis, we should understand if we solve classification or regression problem.

A classification problem has a discrete value as its output. It is not necessary to have 0/1 or true/false output in classification case. The output can for example be colors such as,

red, green, blue, yellow etc. some discrete values.

A regression problem does not deal with discrete values. For example, to find a price of a house based on its location, bedroom etc. Price is a continues value. It can be $20000 or $200001 etc.

 

Multi linear Regression Analysis

Figure 2

This is our cars SharePoint list.

Mpg - miles per gallon is the efficiency of the car telling you how many miles it can travel or how much distance it can travel per gallon.

Displacement - more or less is the size of the engine.

Horsepower - the power of the engine.

We are going to figure out some relationship between some dependent variables and independent variables using a linear regression approach. With linear regression we are going to do

some initial training of our model which takes some amount of time (seconds, minutes, hours depending of the size of the dataset). Once we have the model that has

been trained, we can then use it to make a prediction very quickly.

So, what is the goal of linear regression?

 

Figure 3

We are going to find an equation to relate some independent variable to a variable that we are going to predict usually referred as the dependent variable. In this case we might try to figure out some type of

mathematical equation; for some particular horsepower value we predict mpg (miles per gallon). We are going to have some input data and use some algorithm to predict some output data.

 

 

Figure 4

The goal of linear regression is to predict a relationship between those variables which takes the form of an equation like MPG = m * HorsePower + b (e.g. MPG = 200 * HorsePower + 20).

We should figure out m and b. m is the slope of the line and b is the bias. This equation is represented by a line that might be used to kind of predict or fit between all of our different data points.

 

 

Figure 5

With linear regression we are not restricting to having just one independent variable. We can very easily have multiple independent variables.

 

Figure 6

I did linear regression analysis using Excel scatterplot chart. I plotted out some independent variable (horsepower) and dependent variable (mpg) and found some mathematical relation

between two y = -0.1578x + 39.936.

Before going ahead, I would like to tell you just a few words about Residual analysis. The analysis of residuals plays important role in validating the regression model.

Residual refers to the difference between observed value vs. predicted value and every data point has one residual.

 

Figure 7

Right here we have a regression line and its corresponding residual plot (you can produce a residual plot in Excel).

It looks like these residuals are pretty evenly scattered above and below the line. We could say that a linear model here, the regression line, is a good model for this data.

 

Figure 8

When you look at just the residual plot, it does not look like they are evenly scattered. It looks like there is some type of trend here.

 

Figure 9

Where you see something like this, where on the residual plot you are going below x-axis and then above then the linear model might not be appropriate. Maybe some type of non-linear model.

Some type of non-linear curve might better fit the data and the relationships between y and x are nonlinear.

 

How to solve a linear regression problem?

There are many different approaches to solve linear regression problems; Ordinary Least squares, Generalized Least Squares, Gradient Descend etc. Gradient Descent

is an approach that is used in many other very complicated machine learning algorithms. I do not want to go into the details as there are tons of articles about it but in general

Gradient Descend is a general function for minimizing a function, in regression case, the Mean Squared Error cost function.

 

Figure 10

Means squared equation (MSE) essentially produces a value that tells you how wrong or how bad your guess is. It basically tells you how close a regression line is to a set of points.

It does this by the distances from the points to the regression line (these distances are the 'errors') and squaring them. The squaring is necessary to remove any negative signs.

It also gives more weight to larger differences. It is called mean squared error as you are finding the average of a set of errors. (Guess - Actual can be Actual - Guess as it is squared).

Σ summation symbol means take every one of your guesses and every one of your actual values find the difference between the two, square the result and then

some them altogether.

 

Figure 11

Let's imagine that some actual values are (43,41), (44,45) etc., the green ones are guessed values and the equation is 0.8 * x + 9.2. Guess - Actual is essentially the

dotted line distance, the 'error'. Let's calculate MSE for the following set of actual values: (43,41), (44,45), (45,49), (46,47), (47,44)

Find the guesses based on y = 0.8 * x + 9,2 equation.

0.8 * 43 + 9.2 = 43.6

0.8 * 44 + 9.2 = 44.4

0.8 * 45 + 9.2 = 45.2

0.8 * 46 + 9.2 = 46.8

Figure 12

Now, calculating Actual - Guess (calculating Actual - Guess instead of Guess - Actual as it should be squared at the next step).

41 - 43.6 = -2.6

45 - 44.4 = 0.6

49 - 45.2 = 3.8

47 - 46 = 1

44 - 46.8 = -2.8

 

Figure 13

Figure 14

Add all of the squared errors up: 6.76 + 0.36 + 14.44 + 1 + 7.84 = 30.4

Figure 15

Find the mean square error 30.4 / 5 = 6.08

 

You might be curious, 6.08 what? What is that number actually mean. MSE in isolation is not actually that useful. We can not actually look at this number

and say that it is a good guess or this is a bad guess. In order to quantify this guess and say that 6.08 is good or bad guess we have to run the MSE again

with some other guess. We have to come up with a new equation for the relationship between X and Y variables and run that equation again.

Once we have the second value for MSE we can say whether or not 6.08 was good. In other words, MSE is only producing a value that we can compare

in relation to other values to say whether or not a particular guess is good or bad.

For the first y = 0.8 * x + 9,2 equation MSE is 6.08. Just imagine for another equation y = -1.8 * x - 20,2 MSE is 4.2.

MSE 4.2 is smaller than 6.08 then we are closer the line of best fit.

You must be thinking that if we could ever get our MSE down to zero then we must have like a perfect guess. Even if we come up with an extremely good guess

(Figure 11, it looks like it is probably as good a guess as we could possibly get with a straight line) there is still some distances in there (the dotted lines).

MSE is unlikely to ever be exactly zero. If we could find a very low value of MSE we would have a very good equation or a very good guess.

 

Natural Binary Classification - Logistic Regression

With linear regression we are prediction continues values. For example, given the horsepower of a vehicle we predict its mile per gallon (mpg) value which can be 13,24, 45 etc.

With logistic regression we are going to use the algorithm to predict discrete values. Logistic regression is used for classification type problems. Binary classification problem is when we

take an observation and then put it into one of two categories. For example, given some users age are they likely to prefer an Apple phone or an Android phone. Either A or B option, no other options.

 

Figure 16

Here is the problem we want to solve. Given a person's age, do they wear M T Shirt or L? Note, we have just one feature Age, and there are only two possible label values that we

could apply to a single person; a person either wears M shirt or L. This means this is a binary classification problem.

Our goal is to find a mathematical relationship (a formula) that relates a person's age to whether they like to wear an M shirt or L shirt.

We have a dataset of 6 items. Let's assume that people from age of 58 - 60 wear M shirt while people over 60 wear L size, just an assumption.

In preferred size = m * age + b formula­, preferred size should be a number and for that reason we replace M with 0 and L with 1 as there is no way to

multiply a number by the string M or L.

 

Figure 17

I did linear regression analysis with the independent variable Age, and dependent variable T Shirt Size and plotted it out in Microsoft Excel.

The equation is y = 0.2571 * x - 15.057 where x is the Age. You may notice that someone who is 58 years old was predicted to have T Shirt Size value of -0.1452.

Someone who is 63 was predicted to have a T Shirt Size value of 1.1403. In some cases, we have values greater than one (1.1403), in some cases we got predicted values that are

negative (-0.1452). In some other locations we got values that are between 0 and 1 such as 0.369.

Let's assume for a moment that predicted values between (0 - 1) are OK, outside (0 -1) are not good results. It has no meaning to us. The thing is that if we use an equation m*x + b we

will never be able satisfy the requirement and predicted values should be inside 0 -1 range. We need to find out some other equation that is going somehow to give us a relationship

between Age and our predicted T Shirt Size that is not of the form of m *x + b.

 

Figure 18

We are going to guess a different value of m and b by putting different x values to this equation. e in this equation is Euler's constant number that is approximately equal to 2.718

Figure 19

The equation looks like this if you replace e with 2.718 in the formula.

 

Figure 20

General form of this equation is what we call sigmoid equation which always produces a value between 0 and 1. It really fits our problem when we were saying that only values between 0 and 1

give us some meaningful output.

 

 

Figure 21

If we plug in some value of z right in the formula (Figure 20), what we are always going to get out is some value that ranges from 0 to positive 1.

 

Figure 22

With our sigmoid equation I plugged in different values of ages and got predicted T Shirt Size. You see that this time predicted values are between 0 and positive 1. For the predicted value of age 25 I got

0.000178722 which is close to zero which indicates that someone with the age of 25 will wear M Shirt Size.

Let's understand a little bit more why we only care about values coming out of that equation that range between 0 and 1.

 

Figure 23

The closer the line gets down to zero the more likely that someone will wear M Shirt and as the line gets closer to one it is more likely that someone wears S Shirt.

 

Figure 24

There is an area where the line crosses over from 0 to 1 and it is a relatively gradual shift.

 

Figure 26

People started with predicted value very close to zero which means they are likely to wear M shirt then as they are getting older

57, 59 etc. years old people then the predicted values are 0.40, 0.52 etc. which means a gradual shift; wanting to wear L shirt. What the values

that are not close to zero or one really mean for us? The output of sigmoid equation in this example is probability of someone wearing L shirt.

Now, we can say someone who is 25 years old has a probability of 0.000178722 (or 0.0178722 percent) of wearing L shirt. No chance to wear L shirt.

When someone is getting 51 years old their probability of wearing L shirt is 12.51 percent.

The point of the logistic regression is not to get a probability but care about classification. We should just say these people wear M shirt and those ones L shirt.

We should say that with a probability less than this amount is going to be assigned to label zero otherwise to one.

We refer to that as decision boundary.

 

Figure 27

 

A common decision boundary to use would be .5. This mean that people with a probability of greater than 0.5 will be assigned to one label classification and the others to zero.

For some problems the decision boundary may not be 0.5 say .99 might make a lot of sense. An example of this can be safety analysis, cause a threat to human life

versus not to cause a threat to human life. You want to make sure you are not causing threat to human life 99% of the time.

 

Multi-Value Classification - Multinomial Logistic Regression

With multinomial logistic regression we have the ability to apply multiple different label values to classify a given observation. For example, given a person's age what type of phone

do they prefer - Android, Apple or Windows phone. There is no single either or (e.g. Android or Apple) it might be the wide range of phones. So, a binary classification problem could be

turned into multiple classification type of problem with additional options or label values.

 

Figure 28

This is a dataset similar to what we already have for binary classification (age, shirt). People ages and their preferred phones.

 

Figure 29

We are going to encode the values differently. We are going to do it three separate times and take a look at all the different possible classification values. There are three distinct possible

label values Android phones, Apple phones and Windows phones. We use all these encoded values to produce a new different encoded label set.

This type of encoding is called one hot encoding which is a process by which categorial variables are converted into a form that could be provided to

Machine Learning algorithms to do a better job in prediction.

 

Marginal vs Conditional Probability

Marginal Probability Distribution is when we have probabilities that considers each possible output case in isolation.

Conditional Probability Distribution is going to consider all possible output cases together when putting together a probability.

We are calculating Marginal Probability Distribution when we make use of Sigmoid function.

Suppose we do our analysis with different data set and we get as a result a .35 probability for someone using Android Phone, .30 for Apple and .40 for Windows Phone.

We will take the highest probability .45 and could say that this person uses Windows Phone.

 

Figure 30

What those probabilities are telling us in terms of sigmoid function?

Probability values that we see are probabilities of some observation using, say, .35 percent Android phone but it is no claiming about this person's probability

of using Apple or Windows phones.

With a marginal probability distribution, we get these probabilities that are only informing the probability of a single output; a single characteristic.

In general, it is possible that someone is using both Android, Apple and Windows phone but in the classification analysis we do not really want to see a person

capable of doing across a wide spectrum. We do not care that someone wants use an Android phone, Apple or Windows. We only care about the phone

that he is most likely going to use. We do not want a marginal probability distribution for these probabilities we are calculating, because a marginal distribution

is going to essentially give us probabilities of just a single outcome in isolation; just using Android phone, just using Apple phone, just using Windows phone.

When you are working with marginal probability distribution you can sum all these probabilities together.

.30 + .40 + .45 = 1.15

If you sum these probabilities and see that the total is not just up to 1 then it possible means you are working with a marginal probability distribution.

Now, just imagine that a probability of using Android phone is 0.3, Apple is 0.5 and Windows is 0.2.

0.3 + 0.5 + 0.2 = 1

This possibly means we are working with a conditional probability distribution as the sum is up to 1, and there is some interconnected meaning with each other.

 

Sigmoid vs Softmax

Sigmoid equation is always going to result in a marginal probability distribution. If we want to move over to get into a conditional all we have to do is use a slightly different equation.

Instead of Sigmoid we are going to move over to a different equation called the Softmax equation. The Softmax equation is specifically written to essentially consider different output classes

and give us probabilities that kind of do not isolate one output by itself. Instead, it is going to take a couple of different probabilities and relate them (in relation of each other).

 

Figure 31

On the denominator we are going to take mx + b value of all our different classifications and sum them all together.

 

Figure 32

For using Android phone, we have an equation mx + b = 75. We are taking mx + b values and put them into the equation (Figure 31). With this Softmax equation we are considering

all the other possible outcomes because it uses the outputs of those different classification values.

If we total up the probabilities it should be just 1.

0.0011 + 0.0002 + 0.988 = 1

 

Figure 33

I would like to show the difference between Sigmoid and Softmax in Excel.

You see sigmoid function (Figure 30) in the formula bar. For Dataset 1, mx + b part is 0.2571 * $A2 - 15.057 where $A2 is the absolute cell reference.

 

Figure 34

For Dataset 2 we assigned different values for m and b in mx + b equation and another one is assigned for Dataset 3.

 

Figure 35

The sum of those three probabilities is not 1.

 

Figure 36

We have the same mx + b equation this time plugged into Softmax equation (Figure 31).

 

Figure 37

For Dataset 2.

 

Figure 38

The sum is always 1.

 

Neural Networks

Figure 39

Machine learning is a huge topic and it is simply not possible to cover everything.

Option A illustrates a linear regression model. Option B is a reduced two-layer network and the difference is that option B illustrates the nonlinear

(e.g. sigmoid that you already know) activation function.

Activation functions are mathematical equations that determine the output of a neural network and used at the last stage of a neural network layer. Activation function can be linear

and nonlinear. Nonlinear activation functions can be used to increase the representation power of a neural network. Examples of nonlinear activations include sigmoid, hyperbolic

tangent (tanh) as well as the rectified linear unit (ReLU) function.

The sigmoid function is a squashing nonlinearity, in the sense that it squashes all real values from -Infinity to +Infinity into a much smaller range -1 to 1.

 

 

Figure 40

On the left it is the sigmoid function S(x) = 1 / (1 + e ^ -x) and on the right it is relu function relu(x) = { 0: x< 0,x: x>= 0}

How nonlinearity improves the accuracy of the model? Many relations in the world are linear. For example, a linear relationship between production hours and output in a factory

means that a 10 percent increase or decrease in hours will result in 10 percent increase or decrease in the output. Many others are not linear; relation between a person's height

and his/ger age. The height varies roughly linearly with age only up to a certain point. A purely linear model cannot accurately model height/age relation, while the sigmoid

nonlinearity is much better suites to model relation. In order to prove it you can create a, say, 2 layered model in tensorflow.js and run the app with a sigmoid activation function

or without it (just comment activation: 'sigmoid' line). You will see that the one without sigmoid activation leads to higher final loss values on the training.

Another question you may ask is by replacing a linear activation with a nonlinear one like sigmoid, do we lose the ability to learn any linear relations that might be present in the data?

The answer is no.

 

Figure 41

The part of the sigmoid function (the part closer to the center) is fairly close to being a straight line. Other frequently used nonlinear activation functions such as tanh and relu

also contain linear or close to linear parts. If the relation between certain elements of the input and output are approximately linear, it is entirely possible for a layer with a nonlinear

activation to learn the proper weights and biases to utilize the nonlinear part of the activation function.

Another important thing to understand is that nonlinear functions are different from linear ones is that passing the output of one function as the input to another

function (cascading) leads to richer sets of nonlinear functions.

Let's see it in action.

 

Figure 42

Navigate to https://www.wolframalpha.com/

 

Figure 43

Input f(x) = 2 * x linear function to see the plot.

 

Figure 44

g(x) = 1 - x is the second function.

 

Figure 45

Let's cascade these two functions to define a new function.

First function is f(x) = 2 * x, the second one is g(x) = 1 - x

New, h(x) = g(f(x)) = 1 - 2 * x

As you can see, h is still a linear function. It just has a different slope and bias. Cascading any number of linear functions always results in a linear function.

 

Figure 46

Now, cascading two nonlinear relu functions.

The first function is f(x) = relu(2 * x)

 

Figure 47

The second one is g(x) = relu(1 - x)

Figure 48

First function is f(x) = relu(2 * x), the second one is g(x) = relu(1 - x)

New, h(x) = g(f(x)) = relu(1- relu(2 * x))

By cascading two scaled relu functions, we get a function that does not look like relu at all. It has a new shape. Further cascading the step function with other relu functions

will give you more diverse set of functions. In essence, neural networks are cascading functions. Each layer of a neural network can be viewed as a function. Nonlinear activation

functions increase the range of input-output relations the model is capable of learning.

 

Application

 

Figure 49

We select three features displacement, horsepower and weight and passedemissions as the label.

Given a vehicle's weight, horsepower, and engine displacement, will it PASS or NOT PASS a smog emissions check?

That is the problem we should solve.

Figure 50

Notice, here for smog emissions we have two possible label values. It is definitely a binary classification problem.

A given car can either pass or not pass a smog emissions check.

 

Figure 51

SharePoint Cars list has been imported from cars.csv file and SharePoint convert a true/false value to Yes/No.

 

Figure 52

You may notice that there are some options we specified which requires for training. These are called hyperparameters.

Hyperparameters - are adjustable parameters in the model that must tuned in order to obtain a model with optimal performance

(lowest validation loss after training). The process of selecting good hyperparameter values is referred to as hyperparameter optimization or

hyperparameter tuning. Unfortunately, there is currently no definitive algorithm that can determine the best hyperparameters given a dataset and

the machine-learning task involved.

Learning rate - is a hyperparameter that controls how much to change the model in response to the estimated error each time the model weights are updated.

Choosing the learning rate is challenging as a value too small may result in a long training process that could get stuck, whereas a value too large may result in

learning a sub-optimal set of weights too fast or an unstable learning process. The learning rate may be the most important hyperparameter when configuring

your neural network.

Batch size - hyperparameter defines the number of samples that will be propagated through the network. The higher the batch size, the more memory space

you will need.

Epochs­ - One epoch is when an entire dataset is passed both forward and backward through the neural network only once

 

 

Figure 53

Once you click the train button you will see loss vs epochs plot.

A loss function is an error measurement. This is how network measures its performance on the training data and steers itself in the right direction. Lower loss is better. As we train,

we should be able to plot the loss over time and see it going down. If our model trains for a long while and the loss is not decreasing, it could mean that our model is not learning to

fit the data.

The loss function that we use for binary classification task is binary cross entropy, which corresponds to the binaryCrossEntropy configuration in the model.

Our problem is a binary classification problem as we stated. Vehicles will either PASS or NOT PASS a smog emissions check.

The configuration in multiclass or multinomial classification case is the categoricalCrossentropy which categorizes data points more than two options (blue, green, yellow).

 

 

Figure 54

ml5.js prefers categoricalCrossentropy option for modelLoss instead of binaryCrossEntropy one for our binary classification problem.

 

Figure 55

It knows that there are 2 options but categoricalCrossentropy is the choice anyway.

 

Figure 56

Given a vehicle's weight, horsepower, and engine displacement, we want to predict mile per gallon mpg value. It is a multilinear regression problem.

A regression problem does not deal with discrete values as the mpg here can be 13, 14, 25 etc.

 

 

Figure 57

For the regression case the modelLoss is set to meanSquaredError which we already discussed.

 

Figure 58

 

If you navigate to ml5 web site you will see some default options that is required for neural networks. You can modify them which we already did for learning rate, batch size and epochs.