This content originally appeared on DEV Community and was authored by Aman Kr Pandey
Linear regression is an important algorithms in machine learning. It models the relationship between a dependent variable (target) and one or more independent variables (features) by fitting a linear equation (for example a line) to observed data. This blog will walk through the mathematics behind linear regression models.
What is Linear Regression?
Linear regression finds a linear function that best predicts the target variable y based on predictor variables x. The model equation is:
where m0m_0 m0 is the intercept and m1,m2,…,mnm_1, m_2, \ldots, m_n m1,m2,…,mn are the coefficients (weights) for features x1,x2,…,xnx_1, x_2, \ldots, x_n x1,x2,…,xn . In linear regression we aim to find optimal value for m0,m1,m2,…,mnm_0, m_1, m_2, \ldots, m_n m0,m1,m2,…,mn to converge the cost function.
But, What is Cost function?
The cost function in linear regression is a mathematical measure of how well the model's predicted values match the actual target values in the training data. It quantifies the difference (error) between the predicted values (y^)(\hat{y}) (y^) and the true values (y)(y) (y) , representing this error as a single number that the learning algorithm tries to minimize. Now, how does learning algorithm minimizes it? The answer is using Gradient descent.
Gradient Descent
Gradient descent is used to find the global minima of the cost function, lower the cost better the model fits into the data set. But, how it finds the global minima? Remember functions, differentiation, partial derivatives etc. ? We use these mathematical concepts to achieve global minima. Lets understand the mathematics behind it.
Let's consider a simple cost function, a parabolic function y = x2 −4x − 3y\ =\ x^{2\ }-4x\ -\ 3 y = x2 −4x − 3
Now from the graph it is clearly visible that at
x=2,y=−7x = 2, y = -7 x=2,y=−7
, we have the global minima. But, it is not possible to trace graph and observe the global minima for all cost functions because cost functions can be as complex as Mean squared error function with y depending on not just x but multiple independent set of variable like
x0,x1,x2,…,xnx_0, x_1, x_2, \ldots, x_n x0,x1,x2,…,xn
. For example,
where, the predicted values is (yi^)(\hat{y_i}) (yi^) , the true values is (yi)(y_i) (yi) , J(yi)J(y_i) J(yi) is the cost function, NN N is number of data points and n is no of feature in each data point. You see, it is very difficult to plot this cost function and observe the local minima, now here comes mathematics to rescue us.
In gradient descent, the core idea is to move in the direction of steepest slope based on calculus (gradient at a point or the slope). So, by walking down the hill step by step we will reach the global minima. Now, how do we find this mathematically. So lets take example of Mean Squared Error (MSE) function, our cost function for linear regression,
for regression line y=mx+by = mx + b y=mx+b , our goal here is to find optimal values of m (slope) and b (intercept) to reduce the cost function. So lets take partial derivative of J(m,b)J(m,b) J(m,b) . So, substituting yi=mxi+by_i = mx_i + b yi=mxi+b and taking partial derivative with respect to m, we get,
again, taking partial derivative of J(m,b)J(m,b) J(m,b) with respect to b, we get,
Let mcurrm_{\text{curr}} mcurr and bcurrb_{\text{curr}} bcurr be current values of slope and intercept and α\alpha α be the rate of learning, then out new slope and intercept will be,
We keep iterating over this process till our cost function converges, and once the cost function converges we get our optimal value for slope (m) and intercept (b).
Now, this was just to fit a simple straight line containing one variable xx x , what is we have n number of variables? well in that case we make use of linear algebra with multi-variate calculus.
The General Case
In real life you will not get data whose outcome will depend on a single parameter. In real life, we will have n number of independent parameters on which out outcome would depend. So, how to express these in terms of mathematical equation? Here comes linear algebra and vectors. So, lets say for our regression line
y=mx+by = mx+b y=mx+b
, we will try to express each terms in form of matrix for example,
and Y being our outcome matrix, we can now express our line as a dot product of two matrix,
Expanding out idea further we can now express a multi-vatiate regression line y=m0+m1x1+m2x2+…+mnxn y = m_0 + m_1 x_1 + m_2 x_2 + \ldots + m_n x_n y=m0+m1x1+m2x2+…+mnxn as,
and to express our line, we can take dot product of M and X same as Eq.1 mentioned above. Now lets try to express all our calculation in terms of matrix.
where,
∇MJ(M)\nabla_M J(M) ∇MJ(M)
is Jacobian expression for
∂J(m,b)∂m\frac{\partial J(m,b)}{\partial m} ∂m∂J(m,b)
and
∂J(m,b)∂b\frac{\partial J(m,b)}{\partial b} ∂b∂J(m,b)
. Similarly our new M would be
We keep iterating over this process till our cost function converges, and once the cost function converges we get our optimal value for M.
So, this is it, we have covered the mathematics behind gradient descent and how to apply it in optimizing the cost function of models. We will further discuss its implementation using python, till then take care!
This content originally appeared on DEV Community and was authored by Aman Kr Pandey

Aman Kr Pandey | Sciencx (2025-09-27T06:11:39+00:00) All about Gradient Descent. Retrieved from https://www.scien.cx/2025/09/27/all-about-gradient-descent/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.