Basic terminology for better understanding of Dimension Reduction tech. PCA and t-sne
In this we will learn about Dimension reduction and why should we learn about and where to use this ?
What is Dimension reduction ?
From wiki, Dimension reduction or Dimensionality reduction is the transformation of data from a high dimensional space into low dimension. since we can visualize data in 2-d,3-d using scatter plot and also 4-d,5-d,6-d using pair plot . But as d increases pair plot doesn’t work well . Dimension reduction is common in the fields that deals with large number of observation and/or large number of variable, such as signals processing, speech etc.
There are various methods used for dimensionality reduction include, here we will discuss the two most used techniques:
- Principal Component Analysis(PCA)
- T-distributed Stochastic Neighbor Embedding(t-sne)
Before learning Dimension reduction techniques lets understand some basic termenology:
- Row vectors and Column Vectors: In linear algebra , a column vector is n x 1 matrix, that is , matrix consisting of a single column of n element.
similarly a row vector is 1xn matrix ,that is ,a matrix consisting of a single row of m elements.
f it is not explicitly tell what vector is ,then bydefault it is Column Vector.
The transpose of a column vector is a row vector.
2. Represent Dataset as a data-Matrix:
One of the common way of representing a Dataset is Matrix
Each datapoint represent :row and Each feature is Column
3. Data Preprocessing:
Pre-processing refers to mathematics operation and transformation that we have to do on data itself before we build models or something else.
(A) Column-Normalization: Normalization is technique of reducing measurement to standard scale.
Q. How to column normalization ?
Showing for 1 feature and for rest it follow the same procedure.
step1: take all the values corresponding to feature ‘fj’.
a1,a2,a3,a4,a5,a6,…………aᵢ,……………an — → n values of feature fj.
Compute :
let : max[aᵢ’s]=aₘₐₓ and min[aᵢ’s]=aₘᵢₙ
now a1,a2,a3,a4,a5,a6,…………aᵢ,……………an : after column normalization will written as a1',a2',a3',a4',a5',a6',…………ai’,……………an’
such that : (ai_normalised) aᵢ’=aᵢ-aₘᵢₙ/aₘₐₓ-aₘᵢₙ and also aᵢ ∈[0,1]
Q. Why to do column-normalization?
Suppose we have two feature f1=’height’ collected its datapoints in cm and f2=’weight’ collected its datapoints in kg. Now what if the height is collected in inches and weight in pounds , then there will be different relation between height and weight so, after column normalization we don’t care weather our data is collected in feet , inches or pound .
By Column normalization we are getting both feature into standard format where all the values lies between [0,1].
By column -normalization : We are getting rid of scale
Geometrically: Data points anywhere in n-d space after doing column-normalization is squash into unit hypercube in n-d space
(B) Column-Standardization : It is similiar technique as Column-Normalization. But with a change.
Column Standardization: It will transform feature coming from any distribution so that , it will have zero mean and unit variance or standard deviation
Q. How to column Standardization?
Showing for 1 feature and for rest it follow the same procedure.
step1: take all the values corresponding to feature ‘fj’.
a1,a2,a3,a4,a5,a6,…………aᵢ,……………an — → n values of feature fj.
Compute : ā =mean[aᵢ] (also called mean vector) and s = std_dev[aᵢ]
now a1,a2,a3,a4,a5,a6,…………aᵢ,……………an : after column-standardization will written as a1',a2',a3',a4',a5',a6',…………ai’,……………an’
such that : aᵢ’=(aᵢ — ā)/s where ā= mean[aᵢ’]=0 (also called mean vector)and s=std_dev[aᵢ’]=1=
Geometric intuition of column standardization:
- Moving the mean vector to origin .
- Squashing/expanding data points with variance =1
So column-standardization is mean centering+scaling(std_dev=1) for all features.
4. Co-Variance Matrix:
Acc. to Wiki, a covariance matrix is square matrix giving the covariance between each pair of elements of given random vector and its also symmetric matrix and its main diagonal contains variances
Let X and Y are 2 random vetors or features then covariance is denoted as cov(X,Y)or Cov(F1,F2)
If F1and F2are column standardised implies mean(X)=0
and std-dev(F1)=1 ,simlarly Y
then Cov(F1,F2)= (F1T *F2)/n iff F1 and F2 are column standardised.
Since we learn some basic terminology for dimension reduction .Let perform and learn about PCA and t-sne technique used for dimension reduction in next part
and std-dev(F1)=1 ,simlarly Y
then Cov(F1,F2)= (F1T *F2)/n iff F1 and F2 are column standardised.
Since we learn some basic terminology for dimension reduction .Let perform and learn about PCA and t-sne technique used for dimension reduction in next part