{"id":16544,"date":"2020-07-30T20:14:10","date_gmt":"2020-07-30T14:44:10","guid":{"rendered":"https:\/\/www.mygreatlearning.com\/blog\/data-preprocessing\/"},"modified":"2024-09-03T18:00:17","modified_gmt":"2024-09-03T12:30:17","slug":"data-preprocessing","status":"publish","type":"post","link":"https:\/\/www.mygreatlearning.com\/blog\/data-preprocessing\/","title":{"rendered":"Data Preprocessing Introduction, Concepts and Definition?"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\" id=\"what-is-data-preprocessing\"><strong>What is data preprocessing?<\/strong><\/h2>\n\n\n\n<p>For machine learning, we need data. Lots of it. The more we have, the better our model. Machine learning algorithms are data-hungry. But there\u2019s a catch. They need data in a specific format.<\/p>\n\n\n\n<p>In the real world, several terabytes of data is generated by multiple sources. But all of it is not directly usable. Audio, video, images, text, charts, logs all of them contain data. But this data needs to be cleaned in a usable format for the machine learning algorithms to produce meaningful results.<\/p>\n\n\n\n<p>The process of cleaning raw data for it to be used for machine learning activities is known as data preprocessing. It\u2019s the first and foremost step while doing a machine learning project. It\u2019s the phase that is generally most time-taking as well.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"why-data-preprocessing\"><strong>Why data \u2013 preprocessing?<\/strong><\/h2>\n\n\n\n<p>Real-world data is often noisy, incomplete with missing entries, and more often than not unsuitable for direct use for building models or solving complex data-related problems. There might be erroneous data, or the data might be unordered, unstructured, and unformatted.&nbsp;<\/p>\n\n\n\n<p>The above reasons render the collected data unusable for machine learning purposes. It\u2019s seen that the same data when formatted and cleaned produces more accurate and reliable results when used by machine learning models other than their unprocessed counterparts.<\/p>\n\n\n\n<p>Data pre-processing steps<\/p>\n\n\n\n<p>In data pre-processing several stages or steps are there. All the steps are listed below -<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Data Collection<\/li>\n\n\n\n<li>Data import<\/li>\n\n\n\n<li>Data Inspection<\/li>\n\n\n\n<li>Data Encoding<\/li>\n\n\n\n<li>Data interpolation<\/li>\n\n\n\n<li>Data splitting into train and test sets<\/li>\n\n\n\n<li>Feature scaling<\/li>\n<\/ul>\n\n\n\n<p>Check: <a href=\"https:\/\/www.mygreatlearning.com\/academy\/learn-for-free\/courses\/data-preprocessing\">Free Data Preprocessing Course<\/a><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"data-collection\"><strong>Data Collection<\/strong><\/h2>\n\n\n\n<p>Data collection is the stage when we collect data from various sources. Data might be laying across several storages or several servers and we need to get all that data collected in one single location for the ease of access.<\/p>\n\n\n\n<p>Data is present in many formats. So we need to devise a common format for data collection. All the data required should be changed to a specific format for common operations to be done on them. Data of chat servers is in JSON, data of business applications is generally tabular. So, if we want to use both kinds of data we need to either convert all data into JSON, or all data into CSV or xlsx. Sometimes data is also present in the form of HTML text, so such texts also need to be cleaned.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"data-import\"><strong>Data Import&nbsp;<\/strong><\/h2>\n\n\n\n<p>Data import is the process of importing data into the software such as R or python for data cleaning purposes. Sometimes the data is so huge in size that we have to take special care for importing it into the processing server\/software. Tools like pandas, dask, NumPy, and matplotlib are handy when operating on such huge volumes of data.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"pandas\"><strong>Pandas&nbsp;<\/strong><\/h3>\n\n\n\n<p>pandas is a fast, powerful, flexible, and easy to use open-source data analysis and manipulation tool, built on top of the Python programming language.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"installing-pandas\"><strong>Installing pandas<\/strong><\/h4>\n\n\n\n<p>Download <a href=\"https:\/\/www.anaconda.com\/distribution\/\">Anaconda<\/a> for your operating system and the latest Python version, run the installer, and follow the steps. Please note:<\/p>\n\n\n\n<p>It is not needed (and discouraged) to install Anaconda as a root or administrator. When asked if you wish to initialize Anaconda3, answer yes. Restart the terminal after completing the installation. Detailed instructions on how to install Anaconda can be found in the <a href=\"https:\/\/docs.anaconda.com\/anaconda\/install\/\">Anaconda documentation<\/a>.<\/p>\n\n\n\n<p>In the Anaconda prompt (or terminal in Linux or macOS), start JupyterLab:<br><\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"importing-pandas\"><strong>Importing pandas<\/strong><\/h4>\n\n\n\n<p>In JupyterLab, create a new (Python 3) notebook:<br><\/p>\n\n\n\n<p>In the first cell of the notebook, you can import pandas and check the version with:<br><\/p>\n\n\n\n<p>Now we are ready to use pandas, and you can write your code in the next cells.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"numpy\"><strong>Numpy<\/strong><\/h3>\n\n\n\n<p>It is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"importing-numpy\"><strong>Importing Numpy<\/strong><\/h4>\n\n\n\n<p>To import NumPy and check if it\u2019s installed use the following code.<\/p>\n\n\n\n<p>Here we imported NumPy and gave it an alias np. The alias np is further used to refer to NumPy.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"matplotlib\"><strong>Matplotlib<\/strong><\/h3>\n\n\n\n<p>Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits like Tkinter, wxPython, Qt, or GTK+. There is also a procedural \"pylab\" interface based on a state machine (like OpenGL), designed to closely resemble that of MATLAB, though its use is discouraged.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"importing-matplotlib\"><strong>Importing matplotlib<\/strong><\/h4>\n\n\n\n<p>Here we imported matplotlib and printed the version. This is good check to verify if matplotlib got installed.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"dask\"><strong>Dask<\/strong><\/h3>\n\n\n\n<p>Analysts often use tools like Pandas, Scikit-Learn, Numpy, and the rest of the Python ecosystem to analyze data on their personal computers. They like these tools because they are efficient, intuitive, and widely trusted. However, when they choose to apply their analyses to larger datasets, they find that these tools were not designed to scale beyond a single machine. And so, the analyst rewrites their computation using a more scalable tool, often in another language altogether. This rewrite process slows down discovery and causes frustration.<\/p>\n\n\n\n<p>Dask provides ways to scale Pandas, Scikit-Learn, and Numpy workflows more natively, with minimal rewriting. It integrates well with these tools so that it copies most of their API and uses its data structures internally.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"dask-installation\"><strong>Dask Installation<\/strong><\/h3>\n\n\n\n<p>To install dask on our existing conda environment. We open anaconda prompt like earlier and execute the following command. W need to explicitly install dask, as it does not come pre-installed with Anaconda.<\/p>\n\n\n\n<p><strong>conda install dask<\/strong><\/p>\n\n\n\n<p><strong>Having installed all these libraries lets load a sample dataset and see how it\u2019s done in python.<\/strong><\/p>\n\n\n\n<p><strong>We use a common dataset&nbsp; Boston.csv<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import pandas as pd\nimport numpy as np\ndf=pd.read_csv(\"Boston.csv\")\nprint(df)<\/code><\/pre>\n\n\n\n<p><strong>Output:<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"> 505 &nbsp; &nbsp; &nbsp; &nbsp; 506&nbsp; 0.04741 &nbsp; 0.0&nbsp; 11.93 &nbsp; &nbsp; 0&nbsp; ...&nbsp; 273 &nbsp; &nbsp; 21.0&nbsp; 396.90 &nbsp; 7.88&nbsp; 11.9Unnamed: 0 &nbsp; &nbsp; crim&nbsp; &nbsp; zn&nbsp; indus&nbsp; chas&nbsp; ...&nbsp; tax&nbsp; ptratio &nbsp; black&nbsp; lstat&nbsp; medv\n0 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1&nbsp; 0.00632&nbsp; 18.0 &nbsp; 2.31 &nbsp; &nbsp; 0&nbsp; ...&nbsp; 296 &nbsp; &nbsp; 15.3&nbsp; 396.90 &nbsp; 4.98&nbsp; 24.0\n1 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2&nbsp; 0.02731 &nbsp; 0.0 &nbsp; 7.07 &nbsp; &nbsp; 0&nbsp; ...&nbsp; 242 &nbsp; &nbsp; 17.8&nbsp; 396.90 &nbsp; 9.14&nbsp; 21.6\n2 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 3&nbsp; 0.02729 &nbsp; 0.0 &nbsp; 7.07 &nbsp; &nbsp; 0&nbsp; ...&nbsp; 242 &nbsp; &nbsp; 17.8&nbsp; 392.83 &nbsp; 4.03&nbsp; 34.7\n3 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 4&nbsp; 0.03237 &nbsp; 0.0 &nbsp; 2.18 &nbsp; &nbsp; 0&nbsp; ...&nbsp; 222 &nbsp; &nbsp; 18.7&nbsp; 394.63 &nbsp; 2.94&nbsp; 33.4\n4 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 5&nbsp; 0.06905 &nbsp; 0.0 &nbsp; 2.18 &nbsp; &nbsp; 0&nbsp; ...&nbsp; 222 &nbsp; &nbsp; 18.7&nbsp; 396.90 &nbsp; 5.33&nbsp; 36.2\n..&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ...&nbsp; &nbsp; &nbsp; ... &nbsp; ...&nbsp; &nbsp; ... &nbsp; ...&nbsp; ...&nbsp; ...&nbsp; &nbsp; &nbsp; ... &nbsp; &nbsp; ...&nbsp; &nbsp; ... &nbsp; ...\n501 &nbsp; &nbsp; &nbsp; &nbsp; 502&nbsp; 0.06263 &nbsp; 0.0&nbsp; 11.93 &nbsp; &nbsp; 0&nbsp; ...&nbsp; 273 &nbsp; &nbsp; 21.0&nbsp; 391.99 &nbsp; 9.67&nbsp; 22.4\n502 &nbsp; &nbsp; &nbsp; &nbsp; 503&nbsp; 0.04527 &nbsp; 0.0&nbsp; 11.93 &nbsp; &nbsp; 0&nbsp; ...&nbsp; 273 &nbsp; &nbsp; 21.0&nbsp; 396.90 &nbsp; 9.08&nbsp; 20.6\n503 &nbsp; &nbsp; &nbsp; &nbsp; 504&nbsp; 0.06076 &nbsp; 0.0&nbsp; 11.93 &nbsp; &nbsp; 0&nbsp; ...&nbsp; 273 &nbsp; &nbsp; 21.0&nbsp; 396.90 &nbsp; 5.64&nbsp; 23.9\n504 &nbsp; &nbsp; &nbsp; &nbsp; 505&nbsp; 0.10959 &nbsp; 0.0&nbsp; 11.93 &nbsp; &nbsp; 0&nbsp; ...&nbsp; 273 &nbsp; &nbsp; 21.0&nbsp; 393.45 &nbsp; 6.48&nbsp; 22.0<\/pre>\n\n\n\n<p>[506 rows x 15 columns]<\/p>\n\n\n\n<p>First, we import pandas. Then we use the read_csv() function of pandas to read the file in computer memory.<\/p>\n\n\n\n<p>Inside the read_csv function, we have passed the dataset name as an argument. This is because the dataset is in the same directory as that of the python file. Had they been in different locations, we would have passed the entire path to the file.<\/p>\n\n\n\n<p>Once we execute the line containing read_csv() the file is read and the contents of the boston.csv are loaded into a data frame called df according to the code.<\/p>\n\n\n\n<p>To verify that the file has been loaded correctly, we use the df.head() function, which displays the top 10 rows of the dataset.<\/p>\n\n\n\n<p>In a similar manner, padas.read_json() can be used to read a dataset in the JSON format, pandas.read_text() can be used to read a dataset in text format.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"data-inspection\"><strong>Data Inspection<\/strong><\/h2>\n\n\n\n<p>After the data is imported, data is inspected for missing values and several sanity checks are done for ensuring the consistency of data. Domain knowledge comes in handy in such scenarios.<\/p>\n\n\n\n<p>Checking for missing data<\/p>\n\n\n\n<p>To check for missing data, we lookout for rows and columns which are having null or no data.<\/p>\n\n\n\n<p>If any such scenarios are found we have to make decisions based on scenarios and intuitions.<\/p>\n\n\n\n<p>Again the domain knowledge comes in handy in deciding the importance of certain columns.<\/p>\n\n\n\n<p>If a column has more than 40 percent of data missing then the column is discarded completely and is considered good practice.<\/p>\n\n\n\n<p>If the percentage of data missing is less than that, then various interpolation and replacement techniques can be employed to fill in the missing data. The most common of them is the replacement of nulls by measures of central tendency, or median\/mode\/mode.<\/p>\n\n\n\n<p>Statistical significance tests can also be used to determine which columns to keep and what to not keep while model building, but that\u2019s a story for another time.<\/p>\n\n\n\n<p><strong>Implementation<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>isna() function is used to check the null values in pandas.\nimport pandas as pd\nimport numpy as np\narray = np.array(&#91;&#91;1, np.nan, 3], &#91;4, 5, np.nan]])\nprint(array)\nprint(pd.isna(array))<\/code><\/pre>\n\n\n\n<p><strong>Output<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"> [[ 1. nan&nbsp; 3.]\n&nbsp;[ 4.&nbsp; 5. nan]]\n[[False&nbsp; True False]\n&nbsp;[False False&nbsp; True]]<\/pre>\n\n\n\n<p><\/p>\n\n\n\n<p>For indexes, and array of booleans is returned.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>index = pd.DatetimeIndex(&#91;\"2017-07-05\", \"2017-07-06\", None,\n                          \"2017-07-08\"])\nprint(index)\n<\/code><\/pre>\n\n\n\n<p><strong>Output<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"> DatetimeIndex(['2017-07-05', '2017-07-06', 'NaT', '2017-07-08'],\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;dtype='datetime64[ns]', freq=None)<\/pre>\n\n\n\n<p>#checking for nulls in indexes<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>pd.isna(index)<\/code><\/pre>\n\n\n\n<p><strong>Output<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"> array([False, False,&nbsp; True, False])<\/pre>\n\n\n\n<p>#checking for nulls in series<\/p>\n\n\n\n<p>For Series and DataFrame, the same type is returned, containing booleans.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>df = pd.DataFrame(&#91;&#91;'ant', 'bee', 'cat'], &#91;'dog', None, 'fly']])\nprint(df)<\/code><\/pre>\n\n\n\n<p><strong>Output<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"> <em>&nbsp;&nbsp;0 &nbsp; &nbsp; 1&nbsp; &nbsp; 2<\/em>\n<em>0&nbsp; ant &nbsp; bee&nbsp; cat<\/em>\n<em>1&nbsp; dog&nbsp; None&nbsp; fly<\/em><\/pre>\n\n\n\n<p>#use of isna() in df<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>pd.isna(df)<\/code><\/pre>\n\n\n\n<p><strong>Output<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"> <em>&nbsp;0&nbsp; &nbsp; &nbsp; 1&nbsp; &nbsp; &nbsp; 2<\/em>\n<em>0&nbsp; False&nbsp; False&nbsp; False<\/em>\n<em>1&nbsp; False &nbsp; True&nbsp; False<\/em><\/pre>\n\n\n\n<p>#checking for nulls in first column<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>pd.isna(df&#91;1])<\/code><\/pre>\n\n\n\n<p><strong>Output<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"> 0&nbsp; &nbsp; False\n1 &nbsp; &nbsp; True\nName: 1, dtype: bool<\/pre>\n\n\n\n<p>So now let's try and inspect the given dataset :<br><\/p>\n\n\n\n<p>First of all, we do a quantitative analysis. The descriptive statistics of each column gives us a good idea about the dataset. We use the describe() function of pandas for the following.<br><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import pandas as pd\nimport numpy as np\n\ndf=pd.read_csv(\"Boston.csv\")\nprint(df)\nprint (df.describe())<\/code><\/pre>\n\n\n\n<p>Output:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"> Unnamed: 0&nbsp; &nbsp; &nbsp; &nbsp; crim&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; zn&nbsp; ... &nbsp; &nbsp; &nbsp; black &nbsp; &nbsp; &nbsp; lstat&nbsp; &nbsp; &nbsp; &nbsp; medv\ncount&nbsp; 506.000000&nbsp; 506.000000&nbsp; 506.000000&nbsp; ...&nbsp; 506.000000&nbsp; 506.000000&nbsp; 506.000000\nmean &nbsp; 253.500000&nbsp; &nbsp; 3.613524 &nbsp; 11.363636&nbsp; ...&nbsp; 356.674032 &nbsp; 12.653063 &nbsp; 22.532806\nstd&nbsp; &nbsp; 146.213884&nbsp; &nbsp; 8.601545 &nbsp; 23.322453&nbsp; ... &nbsp; 91.294864&nbsp; &nbsp; 7.141062&nbsp; &nbsp; 9.197104\nmin&nbsp; &nbsp; &nbsp; 1.000000&nbsp; &nbsp; 0.006320&nbsp; &nbsp; 0.000000&nbsp; ...&nbsp; &nbsp; 0.320000&nbsp; &nbsp; 1.730000&nbsp; &nbsp; 5.000000\n25%&nbsp; &nbsp; 127.250000&nbsp; &nbsp; 0.082045&nbsp; &nbsp; 0.000000&nbsp; ...&nbsp; 375.377500&nbsp; &nbsp; 6.950000 &nbsp; 17.025000\n50%&nbsp; &nbsp; 253.500000&nbsp; &nbsp; 0.256510&nbsp; &nbsp; 0.000000&nbsp; ...&nbsp; 391.440000 &nbsp; 11.360000 &nbsp; 21.200000\n75%&nbsp; &nbsp; 379.750000&nbsp; &nbsp; 3.677082 &nbsp; 12.500000&nbsp; ...&nbsp; 396.225000 &nbsp; 16.955000 &nbsp; 25.000000\nmax&nbsp; &nbsp; 506.000000 &nbsp; 88.976200&nbsp; 100.000000&nbsp; ...&nbsp; 396.900000 &nbsp; 37.970000 &nbsp; 50.000000\n\n[8 rows x 15 columns]<\/pre>\n\n\n\n<p>The data we get count for each column is 506, which means there are 506 values in each column.<\/p>\n\n\n\n<p>Similarly the standard deviation, minimum value, maximum values and the first , second and third quartile values for each column are also printed.<br><\/p>\n\n\n\n<p>We can use these values for manual elimination of values as well. For example, if we know the ranges of values for each column beforehand, we can check for consistency values and eliminate the erroneous values.<br><\/p>\n\n\n\n<p>But these will have to be done manually. In our case boston dataset is a standard dataset, and hence we can take and use the given values without worrying about the quality or correctness of the data. But real world datasets are more complex and all these measures will have to be taken care of.<br><\/p>\n\n\n\n<p>In this dataset, there are no missing values. Had there been missing values, we can do something like this:<br><\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>&nbsp;#Filling null values with a single value<\/li>\n<\/ol>\n\n\n\n<pre class=\"wp-block-code\"><code># importing pandas as pd \nimport pandas as pd \n  \n# importing numpy as np \nimport numpy as np \n  \n# dictionary of lists \ndict = {'A':&#91;100, 90, np.nan, 95], \n        'B': &#91;30, 45, 56, np.nan], \n        'C':&#91;np.nan, 40, 80, 98]} \n  \n# creating a dataframe from dictionary \ndf = pd.DataFrame(dict) \n\nprint (df)\n# filling missing value using fillna()   \ndf.fillna(0) \n<\/code><\/pre>\n\n\n\n<p><strong>Output:<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"> &nbsp;A &nbsp; &nbsp; B &nbsp; &nbsp; C\n0&nbsp; 100.0&nbsp; 30.0 &nbsp; NaN\n1 &nbsp; 90.0&nbsp; 45.0&nbsp; 40.0\n2&nbsp; &nbsp; NaN&nbsp; 56.0&nbsp; 80.0\n3 &nbsp; 95.0 &nbsp; NaN&nbsp; 98.0\n\nOut[9]:&nbsp;\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;A &nbsp; &nbsp; B &nbsp; &nbsp; C\n0&nbsp; 100.0&nbsp; 30.0 &nbsp; 0.0\n1 &nbsp; 90.0&nbsp; 45.0&nbsp; 40.0\n2&nbsp; &nbsp; 0.0&nbsp; 56.0&nbsp; 80.0\n3 &nbsp; 95.0 &nbsp; 0.0&nbsp; 98.0<\/pre>\n\n\n\n<p>In the above example, we have a dictionary with three keys A, B and C. We use the dictionary to create a dataframe.<br><\/p>\n\n\n\n<p>We see that the data frame has null values as NAN. So we replace all values with 0.<br><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"data-encoding\"><strong>Data Encoding<\/strong><\/h2>\n\n\n\n<p>Data is in general of two types, quantitative and qualitative.<\/p>\n\n\n\n<p>Quantitative data is used to deal with numbers and things used to measure:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>&nbsp;dimensions (height, width, and length).&nbsp;<\/li>\n\n\n\n<li>Temperature&nbsp;<\/li>\n\n\n\n<li>Humidity<\/li>\n\n\n\n<li>Prices<\/li>\n\n\n\n<li>&nbsp;Area and volume<\/li>\n<\/ul>\n\n\n\n<p>There are many more examples where data of quantitative nature is used.<br><\/p>\n\n\n\n<p>Qualitative data deals with characteristics and descriptors that can't be easily measured, but can be observed subjectively\u2014such as smells, tastes, textures, attractiveness, and color.&nbsp;<br><\/p>\n\n\n\n<p>Broadly speaking, when we measure something and give it a numeric value, we generate quantitative data. When we classify or judge something, we generate qualitative data.<\/p>\n\n\n\n<p>There are also different types of quantitative and qualitative data.<br><\/p>\n\n\n\n<p>The type of data we are concerned with is categorical data. Categorical data is such data which is used to categorize different categories to differentiate between classes by assigning labels to them.&nbsp;<br><\/p>\n\n\n\n<p>Since we know machine learning algorithms work on numeric data we have to convert these labels into numerics. This can be done in primarily two ways:<br><\/p>\n\n\n\n<p><strong>Label Encoding<\/strong> - Label Encoding is such encoding in which we assign numeric labels to categories. There would be as many labels as there are categories.<\/p>\n\n\n\n<p>One Hot Encoding - One hot encoding creates extra columns for each category and is a multi column presence absence marker vector.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import pandas as pd\nimport numpy as np\n\ndf_iris = pd.read_csv(\"iris.csv\")\nprint(df_iris.columns)\n\n\nfrom sklearn.preprocessing import LabelEncoder, OneHotEncoder  \n\n#Encoding for dummy variables  \nonehot_encoder= OneHotEncoder()    \nX=onehot_encoder.fit_transform(df_iris&#91;\"species\"].values.reshape(-1,1))\nprint(X)\n\nlabel_encoder_x= LabelEncoder()  \ndf_iris&#91;\"species\"]= label_encoder_x.fit_transform(df_iris&#91;\"species\"])  \nprint(df_iris)<\/code><\/pre>\n\n\n\n<p><strong>Output:<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"> Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;'species'],\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;dtype='object')\n&nbsp;&nbsp;(0, 0)&nbsp; &nbsp; &nbsp; &nbsp; 1.0\n&nbsp;&nbsp;(1, 0)&nbsp; &nbsp; &nbsp; &nbsp; 1.0\n&nbsp;&nbsp;(2, 0)&nbsp; &nbsp; &nbsp; &nbsp; 1.0\n&nbsp;&nbsp;(3, 0)&nbsp; &nbsp; &nbsp; &nbsp; 1.0\n&nbsp;&nbsp;(4, 0)&nbsp; &nbsp; &nbsp; &nbsp; 1.0\n&nbsp;&nbsp;(5, 0)&nbsp; &nbsp; &nbsp; &nbsp; 1.0\n&nbsp;&nbsp;(6, 0)&nbsp; &nbsp; &nbsp; &nbsp; 1.0\n&nbsp;&nbsp;(7, 0)&nbsp; &nbsp; &nbsp; &nbsp; 1.0\n&nbsp;&nbsp;(8, 0)&nbsp; &nbsp; &nbsp; &nbsp; 1.0\n&nbsp;&nbsp;(9, 0)&nbsp; &nbsp; &nbsp; &nbsp; 1.0\n&nbsp;&nbsp;(10, 0) &nbsp; &nbsp; &nbsp; 1.0\n&nbsp;&nbsp;(11, 0) &nbsp; &nbsp; &nbsp; 1.0\n&nbsp;&nbsp;(12, 0) &nbsp; &nbsp; &nbsp; 1.0\n&nbsp;&nbsp;(13, 0) &nbsp; &nbsp; &nbsp; 1.0\n&nbsp;&nbsp;(14, 0) &nbsp; &nbsp; &nbsp; 1.0\n&nbsp;&nbsp;(15, 0) &nbsp; &nbsp; &nbsp; 1.0\n&nbsp;&nbsp;(16, 0) &nbsp; &nbsp; &nbsp; 1.0\n&nbsp;&nbsp;(17, 0) &nbsp; &nbsp; &nbsp; 1.0\n&nbsp;&nbsp;(18, 0) &nbsp; &nbsp; &nbsp; 1.0\n&nbsp;&nbsp;(19, 0) &nbsp; &nbsp; &nbsp; 1.0\n&nbsp;&nbsp;(20, 0) &nbsp; &nbsp; &nbsp; 1.0\n&nbsp;&nbsp;(21, 0) &nbsp; &nbsp; &nbsp; 1.0\n&nbsp;&nbsp;(22, 0) &nbsp; &nbsp; &nbsp; 1.0\n&nbsp;&nbsp;(23, 0) &nbsp; &nbsp; &nbsp; 1.0\n&nbsp;&nbsp;(24, 0) &nbsp; &nbsp; &nbsp; 1.0\n&nbsp;&nbsp;: &nbsp; &nbsp; :\n&nbsp;&nbsp;(125, 2)&nbsp; &nbsp; &nbsp; 1.0\n&nbsp;&nbsp;(126, 2)&nbsp; &nbsp; &nbsp; 1.0\n&nbsp;&nbsp;(127, 2)&nbsp; &nbsp; &nbsp; 1.0\n&nbsp;&nbsp;(128, 2)&nbsp; &nbsp; &nbsp; 1.0\n&nbsp;&nbsp;(129, 2)&nbsp; &nbsp; &nbsp; 1.0\n&nbsp;&nbsp;(130, 2)&nbsp; &nbsp; &nbsp; 1.0\n&nbsp;&nbsp;(131, 2)&nbsp; &nbsp; &nbsp; 1.0\n&nbsp;&nbsp;(132, 2)&nbsp; &nbsp; &nbsp; 1.0\n&nbsp;&nbsp;(133, 2)&nbsp; &nbsp; &nbsp; 1.0\n&nbsp;&nbsp;(134, 2)&nbsp; &nbsp; &nbsp; 1.0\n&nbsp;&nbsp;(135, 2)&nbsp; &nbsp; &nbsp; 1.0\n&nbsp;&nbsp;(136, 2)&nbsp; &nbsp; &nbsp; 1.0\n&nbsp;&nbsp;(137, 2)&nbsp; &nbsp; &nbsp; 1.0\n&nbsp;&nbsp;(138, 2)&nbsp; &nbsp; &nbsp; 1.0\n&nbsp;&nbsp;(139, 2)&nbsp; &nbsp; &nbsp; 1.0\n&nbsp;&nbsp;(140, 2)&nbsp; &nbsp; &nbsp; 1.0\n&nbsp;&nbsp;(141, 2)&nbsp; &nbsp; &nbsp; 1.0\n&nbsp;&nbsp;(142, 2)&nbsp; &nbsp; &nbsp; 1.0\n&nbsp;&nbsp;(143, 2)&nbsp; &nbsp; &nbsp; 1.0\n&nbsp;&nbsp;(144, 2)&nbsp; &nbsp; &nbsp; 1.0\n&nbsp;&nbsp;(145, 2)&nbsp; &nbsp; &nbsp; 1.0\n&nbsp;&nbsp;(146, 2)&nbsp; &nbsp; &nbsp; 1.0\n&nbsp;&nbsp;(147, 2)&nbsp; &nbsp; &nbsp; 1.0\n&nbsp;&nbsp;(148, 2)&nbsp; &nbsp; &nbsp; 1.0\n&nbsp;&nbsp;(149, 2)&nbsp; &nbsp; &nbsp; 1.0\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;sepal_length&nbsp; sepal_width&nbsp; petal_length&nbsp; petal_width&nbsp; species\n0 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 5.1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 3.5 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1.4&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0.2&nbsp; &nbsp; &nbsp; &nbsp; 0\n1 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 4.9&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 3.0 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1.4&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0.2&nbsp; &nbsp; &nbsp; &nbsp; 0\n2 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 4.7&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 3.2 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1.3&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0.2&nbsp; &nbsp; &nbsp; &nbsp; 0\n3 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 4.6&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 3.1 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1.5&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0.2&nbsp; &nbsp; &nbsp; &nbsp; 0\n4 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 5.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 3.6 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1.4&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0.2&nbsp; &nbsp; &nbsp; &nbsp; 0\n..&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ...&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ... &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ...&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ...&nbsp; &nbsp; &nbsp; ...\n145 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 6.7&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 3.0 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 5.2&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2.3&nbsp; &nbsp; &nbsp; &nbsp; 2\n146 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 6.3&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2.5 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 5.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1.9&nbsp; &nbsp; &nbsp; &nbsp; 2\n147 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 6.5&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 3.0 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 5.2&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2.0&nbsp; &nbsp; &nbsp; &nbsp; 2\n148 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 6.2&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 3.4 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 5.4&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2.3&nbsp; &nbsp; &nbsp; &nbsp; 2\n149 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 5.9&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 3.0 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 5.1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1.8&nbsp; &nbsp; &nbsp; &nbsp; 2\n\n[150 rows x 5 columns]<\/pre>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"data-interpolation\"><strong>Data Interpolation<\/strong><\/h2>\n\n\n\n<p>Interpolation is the process of using known data values to estimate unknown data values. Various interpolation techniques are often used in the atmospheric sciences. One of the simplest methods, linear interpolation, requires knowledge of two points and the constant rate of change between them.<\/p>\n\n\n\n<p>Data interpolation is used for adding missing values to the columns with cells having missing values.<\/p>\n\n\n\n<p>There are many different strategies which can be used to do interpolation, most prominent is average interpolation, knn- interpolation etc.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># importing pandas as pd \nimport pandas as pd \n  \n# importing numpy as np \nimport numpy as np \n  \n# dictionary of lists \ndict = {'A':&#91;100, 90, np.nan, 95], \n        'B': &#91;30, 45, 56, np.nan], \n        'C':&#91;np.nan, 40, 80, 98]} \n  \n# creating a dataframe from dictionary \ndf = pd.DataFrame(dict) \n\ntrainingData = df.iloc&#91;:, :].values\ndataset = df.iloc&#91;:, :].values\n\nfrom sklearn.preprocessing import Imputer\nimputer = Imputer(missing_values=\"NaN\", strategy=\"mean\", axis = 0)\nimputer = imputer.fit(trainingData&#91;:, 1:2])\ndataset&#91;:, 1:2] = imputer.transform(dataset&#91;:, 1:2])\n\nprint(dataset)<\/code><\/pre>\n\n\n\n<p><strong>Output&nbsp;<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"> [[100.&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 30.&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; nan]\n&nbsp;[ 90.&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 45.&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 40.&nbsp; &nbsp; &nbsp; &nbsp; ]\n&nbsp;[ &nbsp; &nbsp; &nbsp; &nbsp; nan&nbsp; 56.&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 80.&nbsp; &nbsp; &nbsp; &nbsp; ]\n&nbsp;[ 95.&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 43.66666667&nbsp; 98.&nbsp; &nbsp; &nbsp; &nbsp; ]]<\/pre>\n\n\n\n<p>&nbsp;#Filling null values with a single value<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># importing pandas as pd \nimport pandas as pd \n  \n# importing numpy as np \nimport numpy as np \n  \n# dictionary of lists \ndict = {'A':&#91;100, 90, np.nan, 95], \n        'B': &#91;30, 45, 56, np.nan], \n        'C':&#91;np.nan, 40, 80, 98]} \n  \n# creating a dataframe from dictionary \ndf = pd.DataFrame(dict) \nimport numpy as np\nfrom sklearn.impute import SimpleImputer\nimp = SimpleImputer(missing_values=np.nan, strategy='mean')\nimp.fit(df)\nSimpleImputer()\nprint(imp.transform(df))<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\"> [[100.&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 30.&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 72.66666667]\n&nbsp;[ 90.&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 45.&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 40.&nbsp; &nbsp; &nbsp; &nbsp; ]\n&nbsp;[ 95.&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 56.&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 80.&nbsp; &nbsp; &nbsp; &nbsp; ]\n&nbsp;[ 95.&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 43.66666667&nbsp; 98.&nbsp; &nbsp; &nbsp; &nbsp; ]]<\/pre>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"data-splitting\"><strong>Data Splitting<\/strong><\/h2>\n\n\n\n<p>Data before being fed into machine learning algorithms is divided into train and validation sets.<\/p>\n\n\n\n<p>Sklearn library of python provides a special function train-test-split for it. We can specify the percentage of data we want as a test and the function divides the given data into train and test sets.<\/p>\n\n\n\n<p>It returns four arguments which are training independent variable, training dependent variable, testing independent variables and testing dependent variable.<br><\/p>\n\n\n\n<p>We would do an example here: -<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import pandas as pd\nimport numpy as np\n \ndf_iris = pd.read_csv(\"iris.csv\")\nprint(df_iris.columns)\n \nx=df_iris&#91;&#91;'sepal_length', 'sepal_width', 'petal_length', 'petal_width']]\ny=df_iris&#91;&#91;'species']]\n \nfrom sklearn.model_selection import train_test_split  \nx_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0) \n \nprint(x_train,y_train)\n \nprint(x_test,y_test)<\/code><\/pre>\n\n\n\n<p><strong>Output<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"> Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;'species'],\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;dtype='object')\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;sepal_length&nbsp; sepal_width&nbsp; petal_length&nbsp; petal_width\n137 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 6.4&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 3.1 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 5.5&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1.8\n84&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 5.4&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 3.0 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 4.5&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1.5\n27&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 5.2&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 3.5 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1.5&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0.2\n127 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 6.1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 3.0 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 4.9&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1.8\n132 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 6.4&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2.8 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 5.6&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2.2\n..&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ...&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ... &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ...&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ...\n9 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 4.9&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 3.1 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1.5&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0.1\n103 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 6.3&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2.9 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 5.6&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1.8\n67&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 5.8&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2.7 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 4.1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1.0\n117 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 7.7&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 3.8 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 6.7&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2.2\n47&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 4.6&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 3.2 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1.4&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0.2\n&nbsp;\n[120 rows x 4 columns] &nbsp; &nbsp; &nbsp; &nbsp; species\n137 &nbsp; virginica\n84 &nbsp; versicolor\n27 &nbsp; &nbsp; &nbsp; setosa\n127 &nbsp; virginica\n132 &nbsp; virginica\n..&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ...\n9&nbsp; &nbsp; &nbsp; &nbsp; setosa\n103 &nbsp; virginica\n67 &nbsp; versicolor\n117 &nbsp; virginica\n47 &nbsp; &nbsp; &nbsp; setosa\n&nbsp;\n[120 rows x 1 columns]\n\n runfile('C:\/Users\/VAGISH\/.spyder-py3\/temp.py', wdir='C:\/Users\/VAGISH\/.spyder-py3')\nIndex(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;'species'],\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;dtype='object')\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;sepal_length&nbsp; sepal_width&nbsp; petal_length&nbsp; petal_width\n137 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 6.4&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 3.1 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 5.5&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1.8\n84&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 5.4&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 3.0 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 4.5&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1.5\n27&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 5.2&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 3.5 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1.5&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0.2\n127 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 6.1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 3.0 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 4.9&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1.8\n132 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 6.4&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2.8 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 5.6&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2.2\n..&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ...&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ... &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ...&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ...\n9 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 4.9&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 3.1 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1.5&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0.1\n103 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 6.3&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2.9 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 5.6&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1.8\n67&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 5.8&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2.7 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 4.1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1.0\n117 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 7.7&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 3.8 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 6.7&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2.2\n47&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 4.6&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 3.2 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1.4&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0.2\n&nbsp;\n[120 rows x 4 columns] &nbsp; &nbsp; &nbsp; &nbsp; species\n137 &nbsp; virginica\n84 &nbsp; versicolor\n27 &nbsp; &nbsp; &nbsp; setosa\n127 &nbsp; virginica\n132 &nbsp; virginica\n..&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ...\n9&nbsp; &nbsp; &nbsp; &nbsp; setosa\n103 &nbsp; virginica\n67 &nbsp; versicolor\n117 &nbsp; virginica\n47 &nbsp; &nbsp; &nbsp; setosa\n&nbsp;\n[120 rows x 1 columns]\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;sepal_length&nbsp; sepal_width&nbsp; petal_length&nbsp; petal_width\n114 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 5.8&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2.8 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 5.1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2.4\n62&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 6.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2.2 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 4.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1.0\n33&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 5.5&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 4.2 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1.4&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0.2\n107 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 7.3&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2.9 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 6.3&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1.8\n7 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 5.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 3.4 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1.5&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0.2\n100 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 6.3&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 3.3 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 6.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2.5\n40&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 5.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 3.5 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1.3&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0.3\n86&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 6.7&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 3.1 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 4.7&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1.5\n76&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 6.8&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2.8 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 4.8&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1.4\n71&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 6.1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2.8 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 4.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1.3\n134 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 6.1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2.6 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 5.6&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1.4\n51&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 6.4&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 3.2 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 4.5&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1.5\n73&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 6.1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2.8 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 4.7&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1.2\n54&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 6.5&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2.8 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 4.6&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1.5\n63&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 6.1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2.9 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 4.7&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1.4\n37&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 4.9&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 3.1 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1.5&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0.1\n78&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 6.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2.9 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 4.5&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1.5\n90&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 5.5&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2.6 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 4.4&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1.2\n45&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 4.8&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 3.0 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1.4&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0.3\n16&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 5.4&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 3.9 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1.3&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0.4\n121 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 5.6&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2.8 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 4.9&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2.0\n66&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 5.6&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 3.0 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 4.5&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1.5\n24&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 4.8&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 3.4 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1.9&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0.2\n8 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 4.4&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2.9 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1.4&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0.2\n126 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 6.2&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2.8 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 4.8&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1.8\n22&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 4.6&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 3.6 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0.2\n44&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 5.1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 3.8 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1.9&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0.4\n97&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 6.2&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2.9 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 4.3&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1.3\n93&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 5.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2.3 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 3.3&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1.0\n26&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 5.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 3.4 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1.6&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0.4 &nbsp; &nbsp; &nbsp; &nbsp; species\n114 &nbsp; virginica\n62 &nbsp; versicolor\n33 &nbsp; &nbsp; &nbsp; setosa\n107 &nbsp; virginica\n7&nbsp; &nbsp; &nbsp; &nbsp; setosa\n100 &nbsp; virginica\n40 &nbsp; &nbsp; &nbsp; setosa\n86 &nbsp; versicolor\n76 &nbsp; versicolor\n71 &nbsp; versicolor\n134 &nbsp; virginica\n51 &nbsp; versicolor\n73 &nbsp; versicolor\n54 &nbsp; versicolor\n63 &nbsp; versicolor\n37 &nbsp; &nbsp; &nbsp; setosa\n78 &nbsp; versicolor\n90 &nbsp; versicolor\n45 &nbsp; &nbsp; &nbsp; setosa\n16 &nbsp; &nbsp; &nbsp; setosa\n121 &nbsp; virginica\n66 &nbsp; versicolor\n24 &nbsp; &nbsp; &nbsp; setosa\n8&nbsp; &nbsp; &nbsp; &nbsp; setosa\n126 &nbsp; virginica\n22 &nbsp; &nbsp; &nbsp; setosa\n44 &nbsp; &nbsp; &nbsp; setosa\n97 &nbsp; versicolor\n93 &nbsp; versicolor\n26 &nbsp; &nbsp; &nbsp; setosa<\/pre>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"feature-scaling\"><strong>Feature Scaling<\/strong><\/h2>\n\n\n\n<p>Feature scaling is standard normalization of data. This is done so that no independent variable has more importance than any other independent variable.<\/p>\n\n\n\n<p>All columns are standardized individually so that they follow the same distribution. This is the last step in data preprocessing.<\/p>\n\n\n\n<p>from sklearn.preprocessing import StandardScaler&nbsp;&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>st_x= StandardScaler()  \nx_train= st_x.fit_transform(x_train)  \nx_test= st_x.transform(x_test)  \nprint(x_train,y_train)\nprint(x_test,y_test)<\/code><\/pre>\n\n\n\n<p>Output<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"> [[ 0.61303014&nbsp; 0.10850105&nbsp; 0.94751783&nbsp; 0.73603967]\n&nbsp;[-0.56776627 -0.12400121&nbsp; 0.38491447&nbsp; 0.34808318]\n&nbsp;[-0.80392556&nbsp; 1.03851009 -1.30289562 -1.3330616 ]\n&nbsp;[ 0.25879121 -0.12400121&nbsp; 0.60995581&nbsp; 0.73603967]\n&nbsp;[ 0.61303014 -0.58900572&nbsp; 1.00377816&nbsp; 1.25331499]\n&nbsp;[-0.80392556 -0.82150798&nbsp; 0.04735245&nbsp; 0.21876435]\n&nbsp;[-0.21352735&nbsp; 1.73601687 -1.19037495 -1.20374277]\n&nbsp;[ 0.14071157 -0.82150798&nbsp; 0.72247648&nbsp; 0.47740201]\n&nbsp;[ 0.02263193 -0.12400121&nbsp; 0.21613346&nbsp; 0.34808318]\n&nbsp;[-0.09544771 -1.05401024&nbsp; 0.10361279 -0.03987331]\n&nbsp;[ 1.0853487&nbsp; -0.12400121&nbsp; 0.94751783&nbsp; 1.12399616]\n&nbsp;[-1.39432376&nbsp; 0.34100331 -1.41541629 -1.3330616 ]\n&nbsp;[ 1.20342834&nbsp; 0.10850105&nbsp; 0.72247648&nbsp; 1.38263382]\n&nbsp;[-1.04008484&nbsp; 1.03851009 -1.24663528 -0.81578628]\n&nbsp;[-0.56776627&nbsp; 1.50351461 -1.30289562 -1.3330616 ]\n&nbsp;[-1.04008484 -2.4490238&nbsp; -0.1776889&nbsp; -0.29851096]\n&nbsp;[ 0.73110978 -0.12400121&nbsp; 0.94751783&nbsp; 0.73603967]\n&nbsp;[ 0.96726906&nbsp; 0.57350557&nbsp; 1.0600385 &nbsp; 1.64127148]\n&nbsp;[ 0.14071157 -1.98401928&nbsp; 0.66621615&nbsp; 0.34808318]\n&nbsp;[ 0.96726906 -1.2865125 &nbsp; 1.11629884&nbsp; 0.73603967]\n&nbsp;[-0.33160699 -1.2865125 &nbsp; 0.04735245 -0.16919214]\n&nbsp;[ 2.14806547 -0.12400121&nbsp; 1.28507985&nbsp; 1.38263382]\n&nbsp;[ 0.49495049&nbsp; 0.57350557&nbsp; 0.49743514&nbsp; 0.47740201]\n&nbsp;[-0.44968663 -1.51901476 -0.00890789 -0.16919214]\n&nbsp;[ 0.49495049 -0.82150798&nbsp; 0.60995581&nbsp; 0.73603967]\n&nbsp;[ 0.49495049 -0.58900572&nbsp; 0.72247648&nbsp; 0.34808318]\n&nbsp;[-1.15816448 -1.2865125 &nbsp; 0.38491447&nbsp; 0.60672084]\n&nbsp;[ 0.49495049 -1.2865125 &nbsp; 0.66621615&nbsp; 0.8653585 ]\n&nbsp;[ 1.32150798&nbsp; 0.34100331&nbsp; 0.49743514&nbsp; 0.21876435]\n&nbsp;[ 0.73110978 -0.12400121&nbsp; 0.77873682&nbsp; 0.99467733]\n&nbsp;[ 0.14071157&nbsp; 0.80600783&nbsp; 0.38491447&nbsp; 0.47740201]\n&nbsp;[-1.27624412&nbsp; 0.10850105 -1.24663528 -1.3330616 ]\n&nbsp;[-0.09544771 -0.82150798&nbsp; 0.72247648&nbsp; 0.8653585 ]\n&nbsp;[-0.33160699 -0.82150798&nbsp; 0.21613346&nbsp; 0.08944552]\n&nbsp;[-0.33160699 -0.35650346 -0.12142856&nbsp; 0.08944552]\n&nbsp;[-0.44968663 -1.2865125 &nbsp; 0.10361279&nbsp; 0.08944552]\n&nbsp;[ 0.25879121 -0.12400121&nbsp; 0.4411748 &nbsp; 0.21876435]\n&nbsp;[ 1.55766726&nbsp; 0.34100331&nbsp; 1.22881951&nbsp; 0.73603967]\n&nbsp;[-0.68584591&nbsp; 1.50351461 -1.30289562 -1.3330616 ]\n&nbsp;[-1.86664232 -0.12400121 -1.52793696 -1.46238043]\n&nbsp;[ 0.61303014 -0.82150798&nbsp; 0.83499716&nbsp; 0.8653585 ]\n&nbsp;[-0.21352735 -0.12400121&nbsp; 0.21613346 -0.03987331]\n&nbsp;[-0.56776627&nbsp; 0.80600783 -1.19037495 -1.3330616 ]\n&nbsp;[-0.21352735&nbsp; 3.13103043 -1.30289562 -1.07442394]\n&nbsp;[ 1.20342834&nbsp; 0.10850105&nbsp; 0.60995581&nbsp; 0.34808318]\n&nbsp;[-1.5124034 &nbsp; 0.10850105 -1.30289562 -1.3330616 ]\n&nbsp;[ 0.02263193 -0.12400121&nbsp; 0.72247648&nbsp; 0.73603967]\n&nbsp;[-0.9220052&nbsp; -1.2865125&nbsp; -0.45899058 -0.16919214]\n&nbsp;[-1.5124034 &nbsp; 0.80600783 -1.35915595 -1.20374277]\n&nbsp;[ 0.37687085 -1.98401928&nbsp; 0.38491447&nbsp; 0.34808318]\n&nbsp;[ 1.55766726&nbsp; 1.27101235&nbsp; 1.28507985&nbsp; 1.64127148]\n&nbsp;[-0.21352735 -0.35650346&nbsp; 0.21613346&nbsp; 0.08944552]\n&nbsp;[-1.27624412 -0.12400121 -1.35915595 -1.46238043]\n&nbsp;[ 1.43958762 -0.12400121&nbsp; 1.17255917&nbsp; 1.12399616]\n&nbsp;[ 1.20342834&nbsp; 0.34100331&nbsp; 1.0600385 &nbsp; 1.38263382]\n&nbsp;[ 0.73110978 -0.12400121&nbsp; 1.11629884&nbsp; 1.25331499]\n&nbsp;[ 0.61303014 -0.58900572&nbsp; 1.00377816&nbsp; 1.12399616]\n&nbsp;[-0.9220052 &nbsp; 1.73601687 -1.24663528 -1.3330616 ]\n&nbsp;[-1.27624412&nbsp; 0.80600783 -1.24663528 -1.3330616 ]\n&nbsp;[ 0.73110978&nbsp; 0.34100331&nbsp; 0.72247648&nbsp; 0.99467733]\n&nbsp;[ 0.96726906&nbsp; 0.57350557&nbsp; 1.0600385 &nbsp; 1.12399616]\n&nbsp;[-1.63048304 -1.75151702 -1.41541629 -1.20374277]\n&nbsp;[ 0.37687085&nbsp; 0.80600783&nbsp; 0.89125749&nbsp; 1.38263382]\n&nbsp;[-1.15816448 -0.12400121 -1.35915595 -1.3330616 ]\n&nbsp;[-0.21352735 -1.2865125 &nbsp; 0.66621615&nbsp; 0.99467733]\n&nbsp;[ 1.20342834&nbsp; 0.10850105&nbsp; 0.89125749&nbsp; 1.12399616]\n&nbsp;[-1.74856268&nbsp; 0.34100331 -1.41541629 -1.3330616 ]\n&nbsp;[-1.04008484&nbsp; 1.27101235 -1.35915595 -1.3330616 ]\n&nbsp;[ 1.55766726 -0.12400121&nbsp; 1.11629884&nbsp; 0.47740201]\n&nbsp;[-0.9220052 &nbsp; 1.03851009 -1.35915595 -1.20374277]\n&nbsp;[-1.74856268 -0.12400121 -1.41541629 -1.3330616 ]\n&nbsp;[-0.56776627&nbsp; 1.96851913 -1.19037495 -1.07442394]\n&nbsp;[-0.44968663 -1.75151702&nbsp; 0.10361279&nbsp; 0.08944552]\n&nbsp;[ 1.0853487 &nbsp; 0.34100331&nbsp; 1.17255917&nbsp; 1.38263382]\n&nbsp;[ 2.02998583 -0.12400121&nbsp; 1.56638153&nbsp; 1.12399616]\n&nbsp;[-0.9220052 &nbsp; 1.03851009 -1.35915595 -1.3330616 ]\n&nbsp;[-1.15816448&nbsp; 0.10850105 -1.30289562 -1.46238043]\n&nbsp;[-0.80392556&nbsp; 0.80600783 -1.35915595 -1.3330616 ]\n&nbsp;[-0.21352735 -0.58900572&nbsp; 0.38491447&nbsp; 0.08944552]\n&nbsp;[ 0.84918942 -0.12400121&nbsp; 0.32865413&nbsp; 0.21876435]\n&nbsp;[-1.04008484&nbsp; 0.34100331 -1.47167663 -1.3330616 ]\n&nbsp;[-0.9220052 &nbsp; 0.57350557 -1.19037495 -0.94510511]\n&nbsp;[ 0.61303014 -0.35650346&nbsp; 0.27239379&nbsp; 0.08944552]\n&nbsp;[-0.56776627&nbsp; 0.80600783 -1.30289562 -1.07442394]\n&nbsp;[ 2.14806547 -1.05401024&nbsp; 1.73516253&nbsp; 1.38263382]\n&nbsp;[-1.15816448 -1.51901476 -0.29020957 -0.29851096]\n&nbsp;[ 2.38422475&nbsp; 1.73601687&nbsp; 1.45386085&nbsp; 0.99467733]\n&nbsp;[ 0.96726906&nbsp; 0.10850105&nbsp; 0.32865413&nbsp; 0.21876435]\n&nbsp;[-0.80392556&nbsp; 2.43352365 -1.30289562 -1.46238043]\n&nbsp;[ 0.14071157 -0.12400121&nbsp; 0.55369548&nbsp; 0.73603967]\n&nbsp;[-0.09544771&nbsp; 2.20102139 -1.47167663 -1.3330616 ]\n&nbsp;[ 2.14806547 -0.58900572&nbsp; 1.62264186&nbsp; 0.99467733]\n&nbsp;[-0.9220052 &nbsp; 1.73601687 -1.30289562 -1.20374277]\n&nbsp;[-1.39432376&nbsp; 0.34100331 -1.24663528 -1.3330616 ]\n&nbsp;[ 1.79382654 -0.58900572&nbsp; 1.28507985&nbsp; 0.8653585 ]\n&nbsp;[-1.04008484&nbsp; 0.57350557 -1.35915595 -1.3330616 ]\n&nbsp;[ 0.49495049&nbsp; 0.80600783&nbsp; 1.00377816&nbsp; 1.51195265]\n&nbsp;[-0.21352735 -0.58900572&nbsp; 0.15987312&nbsp; 0.08944552]\n&nbsp;[-0.09544771 -0.82150798&nbsp; 0.04735245 -0.03987331]\n&nbsp;[-0.21352735 -1.05401024 -0.1776889&nbsp; -0.29851096]\n&nbsp;[ 0.61303014&nbsp; 0.34100331&nbsp; 0.83499716&nbsp; 1.38263382]\n&nbsp;[ 0.96726906 -0.12400121&nbsp; 0.77873682&nbsp; 1.38263382]\n&nbsp;[ 0.49495049 -1.2865125 &nbsp; 0.60995581&nbsp; 0.34808318]\n&nbsp;[ 0.96726906 -0.12400121&nbsp; 0.66621615&nbsp; 0.60672084]\n&nbsp;[-1.04008484 -0.12400121 -1.24663528 -1.3330616 ]\n&nbsp;[-0.44968663 -1.51901476 -0.06516822 -0.29851096]\n&nbsp;[ 0.96726906&nbsp; 0.10850105&nbsp; 1.00377816&nbsp; 1.51195265]\n&nbsp;[-0.09544771 -0.82150798&nbsp; 0.72247648&nbsp; 0.8653585 ]\n&nbsp;[-0.9220052 &nbsp; 0.80600783 -1.30289562 -1.3330616 ]\n&nbsp;[ 0.84918942 -0.35650346&nbsp; 0.4411748 &nbsp; 0.08944552]\n&nbsp;[-0.33160699 -0.12400121&nbsp; 0.15987312&nbsp; 0.08944552]\n&nbsp;[ 0.02263193&nbsp; 0.34100331&nbsp; 0.55369548&nbsp; 0.73603967]\n&nbsp;[ 0.49495049 -1.75151702&nbsp; 0.32865413&nbsp; 0.08944552]\n&nbsp;[-0.44968663&nbsp; 1.03851009 -1.41541629 -1.3330616 ]\n&nbsp;[-0.9220052 &nbsp; 1.50351461 -1.30289562 -1.07442394]\n&nbsp;[-1.15816448&nbsp; 0.10850105 -1.30289562 -1.46238043]\n&nbsp;[ 0.49495049 -0.35650346&nbsp; 1.00377816&nbsp; 0.73603967]\n&nbsp;[-0.09544771 -0.82150798&nbsp; 0.15987312 -0.29851096]\n&nbsp;[ 2.14806547&nbsp; 1.73601687&nbsp; 1.62264186&nbsp; 1.25331499]\n&nbsp;[-1.5124034 &nbsp; 0.34100331 -1.35915595 -1.3330616 ]]&nbsp; &nbsp; &nbsp; species\n137&nbsp; &nbsp; &nbsp; &nbsp; 2\n84 &nbsp; &nbsp; &nbsp; &nbsp; 1\n27 &nbsp; &nbsp; &nbsp; &nbsp; 0\n127&nbsp; &nbsp; &nbsp; &nbsp; 2\n132&nbsp; &nbsp; &nbsp; &nbsp; 2\n.. &nbsp; &nbsp; &nbsp; ...\n9&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0\n103&nbsp; &nbsp; &nbsp; &nbsp; 2\n67 &nbsp; &nbsp; &nbsp; &nbsp; 1\n117&nbsp; &nbsp; &nbsp; &nbsp; 2\n47 &nbsp; &nbsp; &nbsp; &nbsp; 0\n&nbsp;\n[120 rows x 1 columns]\n[[-0.09544771 -0.58900572&nbsp; 0.72247648&nbsp; 1.51195265]\n&nbsp;[ 0.14071157 -1.98401928&nbsp; 0.10361279 -0.29851096]\n&nbsp;[-0.44968663&nbsp; 2.66602591 -1.35915595 -1.3330616 ]\n&nbsp;[ 1.6757469&nbsp; -0.35650346&nbsp; 1.39760052&nbsp; 0.73603967]\n&nbsp;[-1.04008484&nbsp; 0.80600783 -1.30289562 -1.3330616 ]\n&nbsp;[ 0.49495049&nbsp; 0.57350557&nbsp; 1.22881951&nbsp; 1.64127148]\n&nbsp;[-1.04008484&nbsp; 1.03851009 -1.41541629 -1.20374277]\n&nbsp;[ 0.96726906&nbsp; 0.10850105&nbsp; 0.49743514&nbsp; 0.34808318]\n&nbsp;[ 1.0853487&nbsp; -0.58900572&nbsp; 0.55369548&nbsp; 0.21876435]\n&nbsp;[ 0.25879121 -0.58900572&nbsp; 0.10361279&nbsp; 0.08944552]\n&nbsp;[ 0.25879121 -1.05401024&nbsp; 1.00377816&nbsp; 0.21876435]\n&nbsp;[ 0.61303014&nbsp; 0.34100331&nbsp; 0.38491447&nbsp; 0.34808318]\n&nbsp;[ 0.25879121 -0.58900572&nbsp; 0.49743514 -0.03987331]\n&nbsp;[ 0.73110978 -0.58900572&nbsp; 0.4411748 &nbsp; 0.34808318]\n&nbsp;[ 0.25879121 -0.35650346&nbsp; 0.49743514&nbsp; 0.21876435]\n&nbsp;[-1.15816448&nbsp; 0.10850105 -1.30289562 -1.46238043]\n&nbsp;[ 0.14071157 -0.35650346&nbsp; 0.38491447&nbsp; 0.34808318]\n&nbsp;[-0.44968663 -1.05401024&nbsp; 0.32865413 -0.03987331]\n&nbsp;[-1.27624412 -0.12400121 -1.35915595 -1.20374277]\n&nbsp;[-0.56776627&nbsp; 1.96851913 -1.41541629 -1.07442394]\n&nbsp;[-0.33160699 -0.58900572&nbsp; 0.60995581&nbsp; 0.99467733]\n&nbsp;[-0.33160699 -0.12400121&nbsp; 0.38491447&nbsp; 0.34808318]\n&nbsp;[-1.27624412&nbsp; 0.80600783 -1.07785427 -1.3330616 ]\n&nbsp;[-1.74856268 -0.35650346 -1.35915595 -1.3330616 ]\n&nbsp;[ 0.37687085 -0.58900572&nbsp; 0.55369548&nbsp; 0.73603967]\n&nbsp;[-1.5124034 &nbsp; 1.27101235 -1.5841973&nbsp; -1.3330616 ]\n&nbsp;[-0.9220052 &nbsp; 1.73601687 -1.07785427 -1.07442394]\n&nbsp;[ 0.37687085 -0.35650346&nbsp; 0.27239379&nbsp; 0.08944552]\n&nbsp;[-1.04008484 -1.75151702 -0.29020957 -0.29851096]\n&nbsp;[-1.04008484&nbsp; 0.80600783 -1.24663528 -1.07442394]]&nbsp; &nbsp; &nbsp; species\n114&nbsp; &nbsp; &nbsp; &nbsp; 2\n62 &nbsp; &nbsp; &nbsp; &nbsp; 1\n33 &nbsp; &nbsp; &nbsp; &nbsp; 0\n107&nbsp; &nbsp; &nbsp; &nbsp; 2\n7&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0\n100&nbsp; &nbsp; &nbsp; &nbsp; 2\n40 &nbsp; &nbsp; &nbsp; &nbsp; 0\n86 &nbsp; &nbsp; &nbsp; &nbsp; 1\n76 &nbsp; &nbsp; &nbsp; &nbsp; 1\n71 &nbsp; &nbsp; &nbsp; &nbsp; 1\n134&nbsp; &nbsp; &nbsp; &nbsp; 2\n51 &nbsp; &nbsp; &nbsp; &nbsp; 1\n73 &nbsp; &nbsp; &nbsp; &nbsp; 1\n54 &nbsp; &nbsp; &nbsp; &nbsp; 1\n63 &nbsp; &nbsp; &nbsp; &nbsp; 1\n37 &nbsp; &nbsp; &nbsp; &nbsp; 0\n78 &nbsp; &nbsp; &nbsp; &nbsp; 1\n90 &nbsp; &nbsp; &nbsp; &nbsp; 1\n45 &nbsp; &nbsp; &nbsp; &nbsp; 0\n16 &nbsp; &nbsp; &nbsp; &nbsp; 0\n121&nbsp; &nbsp; &nbsp; &nbsp; 2\n66 &nbsp; &nbsp; &nbsp; &nbsp; 1\n24 &nbsp; &nbsp; &nbsp; &nbsp; 0\n8&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0\n126&nbsp; &nbsp; &nbsp; &nbsp; 2\n22 &nbsp; &nbsp; &nbsp; &nbsp; 0\n44 &nbsp; &nbsp; &nbsp; &nbsp; 0\n97 &nbsp; &nbsp; &nbsp; &nbsp; 1\n93 &nbsp; &nbsp; &nbsp; &nbsp; 1\n26 &nbsp; &nbsp; &nbsp; &nbsp; 0<\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>import pandas as pd\nimport numpy as np\n \ndf_iris = pd.read_csv(\"iris.csv\")\nprint(df_iris.columns)<\/code><\/pre>\n\n\n\n<p>from sklearn.preprocessing import LabelEncoder, OneHotEncoder&nbsp;&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#Encoding for dummy variables  \nonehot_encoder= OneHotEncoder()    \nX=onehot_encoder.fit_transform(df_iris&#91;\"species\"].values.reshape(-1,1))\nprint(X)\n \nlabel_encoder_x= LabelEncoder()  \ndf_iris&#91;\"species\"]= label_encoder_x.fit_transform(df_iris&#91;\"species\"])  \n \nx=df_iris&#91;&#91;'sepal_length', 'sepal_width', 'petal_length', 'petal_width']]\ny=df_iris&#91;&#91;'species']]<\/code><\/pre>\n\n\n\n<p>from sklearn.model_selection import train_test_split&nbsp;&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0) \n \n \n \n#Feature Scaling of datasets  \nfrom sklearn.preprocessing import StandardScaler  \nst_x= StandardScaler()  \nx_train= st_x.fit_transform(x_train)  \nx_test= st_x.transform(x_test)  \nprint(x_train,y_train)\nprint(x_test,y_test)<\/code><\/pre>\n\n\n\n<p>We hope you find the above code reusable for all your future endeavours in machine learning.<\/p>\n\n\n\n<p>To conclude, data preprocessing is a very important step in machine learning and should be performed very diligently<\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>What is data preprocessing? For machine learning, we need data. Lots of it. The more we have, the better our model. Machine learning algorithms are data-hungry. But there\u2019s a catch. They need data in a specific format. In the real world, several terabytes of data is generated by multiple sources. But all of it is [&hellip;]<\/p>\n","protected":false},"author":41,"featured_media":17844,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"_uag_custom_page_level_css":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"default","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","ast-disable-related-posts":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"set","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"footnotes":""},"categories":[2],"tags":[],"content_type":[],"class_list":["post-16544","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-artificial-intelligence"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v27.3 (Yoast SEO v27.3) - https:\/\/yoast.com\/product\/yoast-seo-premium-wordpress\/ -->\n<title>Data Preprocessing Introduction, Concepts and Definition?<\/title>\n<meta name=\"description\" content=\"Data Preprocesing: the process of cleaning raw data for it to be used for machine learning activities is known as data preprocessing. Know all about it in this article.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.mygreatlearning.com\/blog\/data-preprocessing\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Data Preprocessing Introduction, Concepts and Definition?\" \/>\n<meta property=\"og:description\" content=\"Data Preprocesing: the process of cleaning raw data for it to be used for machine learning activities is known as data preprocessing. Know all about it in this article.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.mygreatlearning.com\/blog\/data-preprocessing\/\" \/>\n<meta property=\"og:site_name\" content=\"Great Learning Blog: Free Resources what Matters to shape your Career!\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/GreatLearningOfficial\/\" \/>\n<meta property=\"article:published_time\" content=\"2020-07-30T14:44:10+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2024-09-03T12:30:17+00:00\" \/>\n<meta property=\"og:image\" content=\"http:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/iStock-1186776025.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1183\" \/>\n\t<meta property=\"og:image:height\" content=\"887\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Great Learning Editorial Team\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@https:\/\/twitter.com\/Great_Learning\" \/>\n<meta name=\"twitter:site\" content=\"@Great_Learning\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Great Learning Editorial Team\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"10 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/data-preprocessing\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/data-preprocessing\\\/\"},\"author\":{\"name\":\"Great Learning Editorial Team\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#\\\/schema\\\/person\\\/6f993d1be4c584a335951e836f2656ad\"},\"headline\":\"Data Preprocessing Introduction, Concepts and Definition?\",\"datePublished\":\"2020-07-30T14:44:10+00:00\",\"dateModified\":\"2024-09-03T12:30:17+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/data-preprocessing\\\/\"},\"wordCount\":2070,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/data-preprocessing\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2020\\\/07\\\/iStock-1186776025.jpg\",\"articleSection\":[\"AI and Machine Learning\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/data-preprocessing\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/data-preprocessing\\\/\",\"url\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/data-preprocessing\\\/\",\"name\":\"Data Preprocessing Introduction, Concepts and Definition?\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/data-preprocessing\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/data-preprocessing\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2020\\\/07\\\/iStock-1186776025.jpg\",\"datePublished\":\"2020-07-30T14:44:10+00:00\",\"dateModified\":\"2024-09-03T12:30:17+00:00\",\"description\":\"Data Preprocesing: the process of cleaning raw data for it to be used for machine learning activities is known as data preprocessing. Know all about it in this article.\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/data-preprocessing\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/data-preprocessing\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/data-preprocessing\\\/#primaryimage\",\"url\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2020\\\/07\\\/iStock-1186776025.jpg\",\"contentUrl\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2020\\\/07\\\/iStock-1186776025.jpg\",\"width\":1183,\"height\":887,\"caption\":\"Digital background depicting innovative technologies in (AI) artificial systems, neural interfaces and internet machine learning technologies\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/data-preprocessing\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Blog\",\"item\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"AI and Machine Learning\",\"item\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/artificial-intelligence\\\/\"},{\"@type\":\"ListItem\",\"position\":3,\"name\":\"Data Preprocessing Introduction, Concepts and Definition?\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#website\",\"url\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/\",\"name\":\"Great Learning Blog\",\"description\":\"Learn, Upskill &amp; Career Development Guide and Resources\",\"publisher\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#organization\"},\"alternateName\":\"Great Learning\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#organization\",\"name\":\"Great Learning\",\"url\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2022\\\/06\\\/GL-Logo.jpg\",\"contentUrl\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2022\\\/06\\\/GL-Logo.jpg\",\"width\":900,\"height\":900,\"caption\":\"Great Learning\"},\"image\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/www.facebook.com\\\/GreatLearningOfficial\\\/\",\"https:\\\/\\\/x.com\\\/Great_Learning\",\"https:\\\/\\\/www.instagram.com\\\/greatlearningofficial\\\/\",\"https:\\\/\\\/www.linkedin.com\\\/school\\\/great-learning\\\/\",\"https:\\\/\\\/in.pinterest.com\\\/greatlearning12\\\/\",\"https:\\\/\\\/www.youtube.com\\\/user\\\/beaconelearning\\\/\"],\"description\":\"Great Learning is a leading global ed-tech company for professional training and higher education. It offers comprehensive, industry-relevant, hands-on learning programs across various business, technology, and interdisciplinary domains driving the digital economy. These programs are developed and offered in collaboration with the world's foremost academic institutions.\",\"email\":\"info@mygreatlearning.com\",\"legalName\":\"Great Learning Education Services Pvt. Ltd\",\"foundingDate\":\"2013-11-29\",\"numberOfEmployees\":{\"@type\":\"QuantitativeValue\",\"minValue\":\"1001\",\"maxValue\":\"5000\"}},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#\\\/schema\\\/person\\\/6f993d1be4c584a335951e836f2656ad\",\"name\":\"Great Learning Editorial Team\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2022\\\/02\\\/unnamed.webp\",\"url\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2022\\\/02\\\/unnamed.webp\",\"contentUrl\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2022\\\/02\\\/unnamed.webp\",\"caption\":\"Great Learning Editorial Team\"},\"description\":\"The Great Learning Editorial Staff includes a dynamic team of subject matter experts, instructors, and education professionals who combine their deep industry knowledge with innovative teaching methods. Their mission is to provide learners with the skills and insights needed to excel in their careers, whether through upskilling, reskilling, or transitioning into new fields.\",\"sameAs\":[\"https:\\\/\\\/www.mygreatlearning.com\\\/\",\"https:\\\/\\\/in.linkedin.com\\\/school\\\/great-learning\\\/\",\"https:\\\/\\\/x.com\\\/https:\\\/\\\/twitter.com\\\/Great_Learning\",\"https:\\\/\\\/www.youtube.com\\\/channel\\\/UCObs0kLIrDjX2LLSybqNaEA\"],\"award\":[\"Best EdTech Company of the Year 2024\",\"Education Economictimes Outstanding Education\\\/Edtech Solution Provider of the Year 2024\",\"Leading E-learning Platform 2024\"],\"url\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/author\\\/greatlearning\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Data Preprocessing Introduction, Concepts and Definition?","description":"Data Preprocesing: the process of cleaning raw data for it to be used for machine learning activities is known as data preprocessing. Know all about it in this article.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.mygreatlearning.com\/blog\/data-preprocessing\/","og_locale":"en_US","og_type":"article","og_title":"Data Preprocessing Introduction, Concepts and Definition?","og_description":"Data Preprocesing: the process of cleaning raw data for it to be used for machine learning activities is known as data preprocessing. Know all about it in this article.","og_url":"https:\/\/www.mygreatlearning.com\/blog\/data-preprocessing\/","og_site_name":"Great Learning Blog: Free Resources what Matters to shape your Career!","article_publisher":"https:\/\/www.facebook.com\/GreatLearningOfficial\/","article_published_time":"2020-07-30T14:44:10+00:00","article_modified_time":"2024-09-03T12:30:17+00:00","og_image":[{"width":1183,"height":887,"url":"http:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/iStock-1186776025.jpg","type":"image\/jpeg"}],"author":"Great Learning Editorial Team","twitter_card":"summary_large_image","twitter_creator":"@https:\/\/twitter.com\/Great_Learning","twitter_site":"@Great_Learning","twitter_misc":{"Written by":"Great Learning Editorial Team","Est. reading time":"10 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.mygreatlearning.com\/blog\/data-preprocessing\/#article","isPartOf":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/data-preprocessing\/"},"author":{"name":"Great Learning Editorial Team","@id":"https:\/\/www.mygreatlearning.com\/blog\/#\/schema\/person\/6f993d1be4c584a335951e836f2656ad"},"headline":"Data Preprocessing Introduction, Concepts and Definition?","datePublished":"2020-07-30T14:44:10+00:00","dateModified":"2024-09-03T12:30:17+00:00","mainEntityOfPage":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/data-preprocessing\/"},"wordCount":2070,"commentCount":0,"publisher":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/#organization"},"image":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/data-preprocessing\/#primaryimage"},"thumbnailUrl":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/iStock-1186776025.jpg","articleSection":["AI and Machine Learning"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.mygreatlearning.com\/blog\/data-preprocessing\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.mygreatlearning.com\/blog\/data-preprocessing\/","url":"https:\/\/www.mygreatlearning.com\/blog\/data-preprocessing\/","name":"Data Preprocessing Introduction, Concepts and Definition?","isPartOf":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/data-preprocessing\/#primaryimage"},"image":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/data-preprocessing\/#primaryimage"},"thumbnailUrl":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/iStock-1186776025.jpg","datePublished":"2020-07-30T14:44:10+00:00","dateModified":"2024-09-03T12:30:17+00:00","description":"Data Preprocesing: the process of cleaning raw data for it to be used for machine learning activities is known as data preprocessing. Know all about it in this article.","breadcrumb":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/data-preprocessing\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.mygreatlearning.com\/blog\/data-preprocessing\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.mygreatlearning.com\/blog\/data-preprocessing\/#primaryimage","url":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/iStock-1186776025.jpg","contentUrl":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/iStock-1186776025.jpg","width":1183,"height":887,"caption":"Digital background depicting innovative technologies in (AI) artificial systems, neural interfaces and internet machine learning technologies"},{"@type":"BreadcrumbList","@id":"https:\/\/www.mygreatlearning.com\/blog\/data-preprocessing\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Blog","item":"https:\/\/www.mygreatlearning.com\/blog\/"},{"@type":"ListItem","position":2,"name":"AI and Machine Learning","item":"https:\/\/www.mygreatlearning.com\/blog\/artificial-intelligence\/"},{"@type":"ListItem","position":3,"name":"Data Preprocessing Introduction, Concepts and Definition?"}]},{"@type":"WebSite","@id":"https:\/\/www.mygreatlearning.com\/blog\/#website","url":"https:\/\/www.mygreatlearning.com\/blog\/","name":"Great Learning Blog","description":"Learn, Upskill &amp; Career Development Guide and Resources","publisher":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/#organization"},"alternateName":"Great Learning","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.mygreatlearning.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.mygreatlearning.com\/blog\/#organization","name":"Great Learning","url":"https:\/\/www.mygreatlearning.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.mygreatlearning.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2022\/06\/GL-Logo.jpg","contentUrl":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2022\/06\/GL-Logo.jpg","width":900,"height":900,"caption":"Great Learning"},"image":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/GreatLearningOfficial\/","https:\/\/x.com\/Great_Learning","https:\/\/www.instagram.com\/greatlearningofficial\/","https:\/\/www.linkedin.com\/school\/great-learning\/","https:\/\/in.pinterest.com\/greatlearning12\/","https:\/\/www.youtube.com\/user\/beaconelearning\/"],"description":"Great Learning is a leading global ed-tech company for professional training and higher education. It offers comprehensive, industry-relevant, hands-on learning programs across various business, technology, and interdisciplinary domains driving the digital economy. These programs are developed and offered in collaboration with the world's foremost academic institutions.","email":"info@mygreatlearning.com","legalName":"Great Learning Education Services Pvt. Ltd","foundingDate":"2013-11-29","numberOfEmployees":{"@type":"QuantitativeValue","minValue":"1001","maxValue":"5000"}},{"@type":"Person","@id":"https:\/\/www.mygreatlearning.com\/blog\/#\/schema\/person\/6f993d1be4c584a335951e836f2656ad","name":"Great Learning Editorial Team","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2022\/02\/unnamed.webp","url":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2022\/02\/unnamed.webp","contentUrl":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2022\/02\/unnamed.webp","caption":"Great Learning Editorial Team"},"description":"The Great Learning Editorial Staff includes a dynamic team of subject matter experts, instructors, and education professionals who combine their deep industry knowledge with innovative teaching methods. Their mission is to provide learners with the skills and insights needed to excel in their careers, whether through upskilling, reskilling, or transitioning into new fields.","sameAs":["https:\/\/www.mygreatlearning.com\/","https:\/\/in.linkedin.com\/school\/great-learning\/","https:\/\/x.com\/https:\/\/twitter.com\/Great_Learning","https:\/\/www.youtube.com\/channel\/UCObs0kLIrDjX2LLSybqNaEA"],"award":["Best EdTech Company of the Year 2024","Education Economictimes Outstanding Education\/Edtech Solution Provider of the Year 2024","Leading E-learning Platform 2024"],"url":"https:\/\/www.mygreatlearning.com\/blog\/author\/greatlearning\/"}]}},"uagb_featured_image_src":{"full":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/iStock-1186776025.jpg",1183,887,false],"thumbnail":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/iStock-1186776025-150x150.jpg",150,150,true],"medium":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/iStock-1186776025-300x225.jpg",300,225,true],"medium_large":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/iStock-1186776025-768x576.jpg",768,576,true],"large":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/iStock-1186776025-1024x768.jpg",1024,768,true],"1536x1536":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/iStock-1186776025.jpg",1183,887,false],"2048x2048":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/iStock-1186776025.jpg",1183,887,false],"web-stories-poster-portrait":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/iStock-1186776025.jpg",640,480,false],"web-stories-publisher-logo":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/iStock-1186776025.jpg",96,72,false],"web-stories-thumbnail":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/iStock-1186776025.jpg",150,112,false]},"uagb_author_info":{"display_name":"Great Learning Editorial Team","author_link":"https:\/\/www.mygreatlearning.com\/blog\/author\/greatlearning\/"},"uagb_comment_info":0,"uagb_excerpt":"What is data preprocessing? For machine learning, we need data. Lots of it. The more we have, the better our model. Machine learning algorithms are data-hungry. But there\u2019s a catch. They need data in a specific format. In the real world, several terabytes of data is generated by multiple sources. But all of it is&hellip;","_links":{"self":[{"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/posts\/16544","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/users\/41"}],"replies":[{"embeddable":true,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/comments?post=16544"}],"version-history":[{"count":10,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/posts\/16544\/revisions"}],"predecessor-version":[{"id":111571,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/posts\/16544\/revisions\/111571"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/media\/17844"}],"wp:attachment":[{"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/media?parent=16544"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/categories?post=16544"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/tags?post=16544"},{"taxonomy":"content_type","embeddable":true,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/content_type?post=16544"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}