Data visualization is a graphical representation of information and data. By using visual elements like charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and patterns in data.
In the world of Big Data, data visualization tools and technologies are essential to analyze massive amounts of information and make data-driven decisions.
Contributed by: Dinesh
Benefits of good data visualization
Our eyes are drawn to colours and patterns. We can quickly identify red from blue, square from the circle. Our culture is visual, including everything from art and advertisements to TV and movies.
Data visualization is another form of visual art that grabs our interest and keeps our eyes on the message. When we see a chart, we quickly see trends and outliers. If we can see something, we internalize it quickly. It’s storytelling with a purpose. If you’ve ever stared at a massive spreadsheet of data and couldn’t see a trend, you know how much more effective a visualization can be. The uses of Data Visualization as follows.
- Powerful way to explore data with presentable results.
- Primary use is the pre-processing portion of the data mining process.
- Supports the data cleaning process by finding incorrect and missing values.
- For variable derivation and selection means to determine which variable to include and discarded in the analysis.
- Also play a role in combining categories as part of the data reduction process.
Data Visualization Techniques
- Box plots
- Heat maps
- Tree maps
- Word Cloud/Network diagram
The image above is a box plot. A boxplot is a standardized way of displaying the distribution of data based on a five-number summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”). It can tell you about your outliers and what their values are. It can also tell you if your data is symmetrical, how tightly your data is grouped, and if and how your data is skewed.
A box plot is a graph that gives you a good indication of how the values in the data are spread out. Although box plots may seem primitive in comparison to a histogram or density plot, they have the advantage of taking up less space, which is useful when comparing distributions between many groups or datasets. For some distributions/datasets, you will find that you need more information than the measures of central tendency (median, mean, and mode). You need to have information on the variability or dispersion of the data.
Five Number Summary of Box Plot
|First quartile (Q1/25th Percentile)”:||The middle number between the smallest number (not the “minimum”) and the median of the dataset|
|Median (Q2/50th Percentile)”:||the middle value of the dataset|
|Third quartile (Q3/75th Percentile)”:||the middle value between the median and the highest value (not the “maximum”) of the dataset.|
|Maximum”||Q3 + 1.5*IQR|
|interquartile range (IQR)||25th to the 75th percentile.|
A histogram is a graphical display of data using bars of different heights. In a histogram, each bar groups numbers into ranges. Taller bars show that more data falls in that range. A histogram displays the shape and spread of continuous sample data.
It is a plot that lets you discover, and show, the underlying frequency distribution (shape) of a set of continuous data. This allows the inspection of the data for its underlying distribution (e.g., normal distribution), outliers, skewness, etc. It is an accurate representation of the distribution of numerical data, it relates only one variable. Includes bin or bucket- the range of values that divide the entire range of values into a series of intervals and then count how many values fall into each interval.
Bins are consecutive, non- overlapping intervals of a variable. As the adjacent bins leave no gaps, the rectangles of histogram touch each other to indicate that the original value is continuous.
Histograms are based on area, not height of bars
In a histogram, the height of the bar does not necessarily indicate how many occurrences of scores there were within each bin. It is the product of height multiplied by the width of the bin that indicates the frequency of occurrences within that bin. One of the reasons that the height of the bars is often incorrectly assessed as indicating the frequency and not the area of the bar is because a lot of histograms often have equally spaced bars (bins), and under these circumstances, the height of the bin does reflect the frequency.
Also Read: Machine Learning Interview Questions
Histogram Vs Bar Chart
The major difference is that a histogram is only used to plot the frequency of score occurrences in a continuous data set that has been divided into classes, called bins. Bar charts, on the other hand, can be used for a lot of other types of variables including ordinal and nominal data sets.
A heat map is data analysis software that uses colour the way a bar graph uses height and width: as a data visualization tool.
If you’re looking at a web page and you want to know which areas get the most attention, a heat map shows you in a visual way that’s easy to assimilate and make decisions from. It is a graphical representation of data where the individual values contained in a matrix are represented as colours. Useful for two purposes: for visualizing correlation tables and for visualizing missing values in the data. In both cases, the information is conveyed in a two-dimensional table.
Note that heat maps are useful when examining a large number of values, but they are not a replacement for more precise graphical displays, such as bar charts, because colour differences cannot be perceived accurately.
Also Read: Top Data Mining Tools
The simplest technique, a line plot is used to plot the relationship or dependence of one variable on another. To plot the relationship between the two variables, we can simply call the plot function.
Bar charts are used for comparing the quantities of different categories or groups. Values of a category are represented with the help of bars and they can be configured with vertical or horizontal bars, with the length or height of each bar representing the value.
It is a circular statistical graph which decides slices to illustrate numerical proportion. Here the arc length of each slide is proportional to the quantity it represents. As a rule, they are used to compare the parts of a whole and are most effective when there are limited components and when text and percentages are included to describe the content. However, they can be difficult to interpret because the human eye has a hard time estimating areas and comparing visual angles.
Another common visualization technique is a scatter plot that is a two-dimensional plot representing the joint variation of two data items. Each marker (symbols such as dots, squares and plus signs) represents an observation. The marker position indicates the value for each observation. When you assign more than two measures, a scatter plot matrix is produced that is a series scatter plot displaying every possible pairing of the measures that are assigned to the visualization. Scatter plots are used for examining the relationship, or correlations, between X and Y variables.
It is a variation of scatter chart in which the data points are replaced with bubbles, and an additional dimension of data is represented in the size of the bubbles.
Timeline charts illustrate events, in chronological order — for example the progress of a project, advertising campaign, acquisition process — in whatever unit of time the data was recorded — for example week, month, year, quarter. It shows the chronological sequence of past or future events on a timescale.
A treemap is a visualization that displays hierarchically organized data as a set of nested rectangles, parent elements being tiled with their child elements. The sizes and colours of rectangles are proportional to the values of the data points they represent. A leaf node rectangle has an area proportional to the specified dimension of the data. Depending on the choice, the leaf node is coloured, sized or both according to chosen attributes. They make efficient use of space, thus display thousands of items on the screen simultaneously.
Word Clouds and Network Diagrams for Unstructured Data
The variety of big data brings challenges because semi-structured, and unstructured data require new visualization techniques. A word cloud visual represents the frequency of a word within a body of text with its relative size in the cloud. This technique is used on unstructured data as a way to display high- or low-frequency words.
Another visualization technique that can be used for semi-structured or unstructured data is the network diagram. Network diagrams represent relationships as nodes (individual actors within the network) and ties (relationships between the individuals). They are used in many applications, for example for analysis of social networks or mapping product sales across geographic areas.
These are some of the Visualization techniques used to represent data effectively for their better understanding and interpretation. We hope this article was useful. You can also upskill with our free courses on Great Learning Academy.0