Free downloadable datasets

To build a machine learning model dataset is one of the main parts. Before we start with any algorithm we need to have a proper understanding of the data. These machine learning datasets are basically used for research purposes. Most of the datasets are homogeneous in nature.

We use a dataset to train and evaluate our model and it plays a very vital role in the whole process. If our dataset is structured, less noisy, and properly cleaned then our model will give good accuracy on the evaluation time.

Top 20 datasets which are easily available online to train your Machine Learning Algorithm

  1. ImageNet
  2. Coco dataset
  3. Iris Flower dataset
  4. Breast cancer Wisconsin (Diagnostic) Dataset
  5. Twitter sentiment Analysis Dataset
  6. MNIST dataset (handwritten data)
  7. Fashion MNIST dataset
  8. Amazon review dataset
  9. Spam SMS classifier dataset
  10. Spam-Mails Dataset
  11. Youtube Dataset
  12. CIFAR -10
  13. IMDB reviews
  14. Sentiment 140
  15. Facial image Dataset
  16. Wine Quality Dataset
  17. The Wikipedia corpus
  18. Free Spoken digit dataset
  19. Boston House price dataset
  20. Pima Indian Diabetes dataset

1, ImageNet:

Imagenet dataset is made by the group of researchers and the images in the dataset organized according to the WordNet hierarchy. This dataset can be used for machine learning purposes and computer vision research fields as well. In WordNet hierarchy, each concept is described using the Synset concept. Basically Synset consists of multiple words or word phrases. In WordNet approximately almost 100,000+ synsets are available.

Size of the Dataset: ~ 150 GB

  • Each record consist of with bounding boxes and respective class labels
  • ImageNet provides 1000 images for each synset
  •  URLs of the images is given in the ImageNet
  • Because of its large scale image dataset, it helps the researchers

Download the Dataset

a ImageNet Synsets with 15 image samples (one image from each category). b Corel-1000 dataset showing 15 sample images from 10 categories. c Caltech-256 dataset showing 15 sample images from 

2. Coco dataset:

Coco dataset stands for Common Objects in Context dataset Mirror and it is large-scale object detection, segmentation, and captioning dataset. This dataset has 1.5 million object instances for 80 object categories.

COCO has used five types of annotation 

  • object detection
  • keypoint detection
  • stuff segmentation
  • panoptic segmentation
  • image captioning

In COCO dataset annotations are stored in a JSON file.

Features are provided by the COCO dataset:

  • Object segmentation
  • Recognition in context
  • Superpixel stuff segmentation
  • 330K images (>200K labelled)
  • 1.5 million object instances
  • 80 object categories
  • 91 stuff categories
  • 5 captions per image
  • 250,000 people with keypoints

Download the Dataset

3. Iris Flower Dataset:

The iris flower dataset is built for the beginners who just start learning machine learning techniques and algorithms. With the help of this data, you can start building a simple project in machine learning algorithms. The size of the dataset is small and data pre-processing is not needed. It has three different types of iris flowers like Setosa, Versicolour, and Virginica and their petal and sepal length, stored in a 150×4 numpy.ndarray.

Features

  • The dataset consists of four attributes, i.e., sepal length in cm, sepal width in cm, petal length in cm, and petal width in cm.
  • This dataset has three classes
  • Each class of this dataset has 50 instances and the classes are Virginica, Setosa, and Versicolor.
  • t characteristics of this dataset are multivariate.
  • All of the attributes are real in this data

Download the Dataset

Iris species classification | The Good Python

4. Breast cancer Wisconsin (Diagnostic) Dataset:

Breast cancer Wisconsin (Diagnostic) Dataset is one of the most popular datasets for classification problems in machine learning. This dataset based on breast cancer analysis. Features for this dataset computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe the characteristics of the cell nuclei present in the image.

Features

  • Three types of attributes are mentioned in the dataset, i.e., ID, diagnosis, 30 real-valued input features.
  • In the dataset for each cell nucleus, there are ten real-valued features calculated,i.e., radius, texture, perimeter, area, etc.
  • The main two classes are specified in the dataset to predict i.e., benign and malignant.
  • In this dataset total of 569 instances are present which include 357 benign and 212 malignant.

Attribute Information:

  1.  ID number
  2.  Diagnosis (M = malignant, B = benign)
    3-32)

Ten real-valued features are mentioned for each cell nucleus:

  • Radius (mean of distances from the centre to points on the perimeter)
  • texture (standard deviation of grey-scale values)
  • perimeter
  • area
  • smoothness (local variation in radius lengths)
  • compactness (perimeter^2 / area – 1.0)
  •  concavity (severity of concave portions of the contour)
  • concave points (number of concave portions of the contour)
  • symmetry
  • fractal dimension (“coastline approximation” – 1)

Download the Dataset

Breast Cancer Wisconsin (Diagnostic) Data Set | Kaggle

5. Twitter sentiment Analysis Dataset:

Analyzing sentiment is one of the most popular application in natural language processing(NLP) and to build a model on sentiment analysis this dataset will help you. This dataset is basically a text processing data and with the help of this dataset you can start building your first model on NLP.

Structure of the dataset:

Three main columns are there in this dataset,

  • ItemID – id of twit
  • Sentiment – sentiment
  • SentimentText – text of the twit

Features

  • This dataset consists of three types or three tones of data, like neutral, positive, and negative.
  • Format of the dataset is CSV (Comma separated value)
  • Dataset is divided into two parts 1. Train,csv 2. Test.csv
  • So using this dataset you do not need to split your data for training and evaluation part.
  • All you need to do, build your model using train.csv and evaluate your model using test.csv
  • Two data fields are there, i.e., ItemID (ID of tweet) and SentimentText (text of the tweet).

Download the Dataset

6. MNIST dataset (handwritten data):

MNIST dataset is built on handwritten data. This dataset is one of the most popular deep learning image classification datasets. This dataset can be used for machine learning purpose as well. Dataset has 60000 instances or example for the training purpose and 10000 instances for the model evaluation. This dataset is beginner-friendly and helps to understand the techniques and the deep learning  recognition pattern on real-world data.  Data does not take much time to preprocess. For a beginner who is keen to learn deep learning or machine learning, they can start their first project with the help of this dataset.

Size: ~50 MB

Number of Records: 70,000 images in 10 classes (including train and test part)

Features

  • MNIST dataset is one of the best datasets which helps to understand and learn the ML techniques and pattern recognition methods in deep learning on real-world data.
  • Dataset contains four types of files like train-images-idx3-ubyte.gz, train-labels-idx1-ubyte.gz, t10k-images-idx3-ubyte.gz, and t10k-labels-idx1-ubyte.gz.
  • MNIST dataset is divided into two parts 1. Train,csv 2. Test.csv
  • So using this dataset you do not need to split your data for training and evaluation part.
  • All you need to do, build your model using train.csv and evaluate your model using test.csv

Download the Dataset

GitHub - cazala/mnist: mnist digits in javascript

7. Fashion MNIST dataset:

Fashion MNIST dataset is also one of the most use datasets and build on cloths data. Fashion  MNIST dataset can be used for deep learning image classification problem. This dataset can be used for machine learning purpose as well. Dataset has 60000 instances or example for the training purpose and 10000 instances for the model evaluation. This dataset is beginner-friendly and helps to understand the techniques and the deep learning recognition pattern on real-world data.  Data does not take much time to preprocess. For a beginner who is keen to learn deep learning or machine learning they can start their first project with the help of this dataset. Fashion MNIST dataset is created to replace MNIST dataset. All the images in this dataset are in grayscale with 10 classes.

Size: 30 MB

Number of Records: 70,000 images in 10 classes

Features

  • Fashion MNIST dataset is one of the best dataset which helps to understand and learn the ML techniques and pattern recognition methods in deep learning on real-world data.
  • Fashion MNIST dataset is divided into two parts 1. Train,csv 2. Test.csv
  • So using this dataset you do not need to split your data for training and evaluation part.
  • All you need to do, build your model using train.csv and evaluate your model using test.csv

Download the Dataset

fashion_mnist | TensorFlow Datasets

8.  Amazon review dataset:

Amazon review dataset is also used for Natural language processing purpose. Analyzing sentiment is one of the most popular application in natural language processing(NLP) and to build a model on sentiment analysis this dataset will help you. This dataset is basically a text processing data and with the help of this dataset, you can start building your first model on NLP. This dataset contains ratings, text, helpfulness votes, product metadata, description, category information, price, brand,  image features, links for the product, and view and bought graph as well. All the data contains 142.8 billion reviews spanning May 1996-July 2014. This dataset will give you the essence of the real business problem and helps you to understand the trend the sales over the years.

Features

  • Amazon review dataset consists of Amazon product reviews
  • It includes both product and user information, ratings, and review
  • Official Paper: J. McAuley and J. Leskovec. Hidden factors and hidden topics: understanding rating dimensions with review text. RecSys, 2013.
  • This data consists of duplicate data as well.

Download the Dataset

9. Spam SMS classifier dataset:

In today’s society finding spam, the message is one of the most important parts. So data scientist came up with an idea where you can train your model using the dataset and your model will predict the spam message. This dataset will help you to train your model to predict spam message. Machine learning classification algorithm can be used to build your model and this dataset is also beginner-friendly and easy to understand as well.  Spam SMS classifier dataset has a set of SMS labelled messages that are collected for SMS Spam analysis.

Features

  • Spam SMS classifier dataset has 5,574 messages
  • This dataset is written in English.
  • Each line of this dataset contains one message
  • This dataset has two datasets: One column stands for the classification of spam message or not and another one is raw text.
  • Spam SMS classifier dataset is in the CSV format (comma-separated value).

Download the Dataset

10. Spam-Mails Dataset: 

In today’s society finding spam mail is one of the most important parts. So data scientist came up with an idea where you can train your model using the dataset and your model will predict the spam mail. This dataset will help you to train your model to predict spam mail. Machine learning classification algorithm can be used to build your model and this dataset is also beginner-friendly and easy to understand as well.  Spam mails dataset has a set of mail tagged. This dataset is a  collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site. This is basically a UK forum where the cell phone users make public claims about SMS spam messages. Most of them were receiving a huge number of spam messages every day. And the identification process of those spam messages was a very hard and time-consuming task. the process involved careful scanning hundreds of web pages. The Grumbletext Web site is http://www.grumbletext.co.uk/. -> A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000 legitimate messages collected for research at the Department of Computer Science at the National University of Singapore. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available. The NUS SMS Corpus is available at: http://www.comp.nus.edu.sg/~rpnlpir/downloads/corpora/smsCorpus/. -> A list of 450 SMS ham messages collected from Caroline Tag’s PhD Thesis.

  • Most of the part of the dataset are not spam that is about 86% almost.
  • In this dataset you need to split your data, it does not come with train and test division

Download the Dataset

11. Youtube Dataset: 

Youtube video dataset is based on youtube information about the videos they have. It helps to make a video classification model using a machine learning algorithm. YouTube-8M is a video dataset which consists of millions of YouTube video IDs. It has high-quality machine-generated annotations derived from numerous visual entities and audio-visual features from billions of frames and audio segments. This dataset helps to learn machine learning as well as computer vision part also. This dataset has improved quality of annotations and machine-generated labels and also it has  6.1 million URLs, labelled with a vocabulary of 3,862 visual entities. all the videos are annotated with one or more labels (an average of 3 labels per video).

Features

  • This dataset has a large-scaled labelled dataset with the high-quality machine-generated annotations.
  • In this dataset videos are sampled uniformly.
  • Each video in Youtube dataset is associated with at least one entity from the target vocabulary.
  • The vocabulary of the dataset is available in CSV format (Comma-separated value)

Download the Dataset

12. CIFAR -10: 

CIFAR 10 is also an image classification dataset which consists of various object images. With the help of this dataset, we can perform many operations in machine learning and deep learning as well. CIFAR stands for Canadian Institute For Advanced Research. This dataset is one of the most commonly used datasets for machine learning research. CIFAR 10 dataset  has 60,000 32×32 color images in 10 different classes. Those different classes are

  1. aeroplanes
  2. cars
  3. birds
  4. cats
  5. deer
  6. dogs
  7. frogs
  8. horses
  9. ships
  10. and trucks

And each of these class has 6000 images each.CIFAR 10 is used for Computer recognizing algorithm in deep learning to train computer how to recognize the object. Resolution of the images in CIFAR 10 is 32*32 that is considered as low resolution so it allows the learner to learn different algorithm with less time. CIFAR 10 dataset is beginner-friendly as well. This dataset is famous for deep learning algorithm convolutional neural network.

Features:

  • CIFAR 10  dataset is one of the best datasets which helps to understand and learn the ML techniques and object detection methods in deep learning on real-world data.
  • CIFAR 10  dataset is divided into two parts 1. Train 2. Test
  • So using this dataset you do not need to split your data for training and evaluation part.
  • All you need to do, build your model using train data and evaluate your model using test data
  • IN CIFAR 10 Total, there are 50,000 training images and 10,000 test images.
  • The dataset is divided into 6 parts – 5 training batches and 1 test batch.
  • Each batch has 10,000 images.

Size: 170 MB

Number of Records: 60,000 images in 10 classes

Download the Dataset

13.  IMDB reviews: 

IMDB dataset stands for  Large Movie Review Dataset. Analyzing sentiment is one of the most popular application in natural language processing(NLP) and to build a model on sentiment analysis IMDB movie review dataset will help you. This Large Movie Review dataset has 25,000 highly polar moving reviews which are may be good or bad. IMDB datset often use for sentiment analysis purpose using Machine learning or deep learning algorithm. This dataset is prepared by Standford researchers in 2011. This dataset comes with 50/50 split for training and testing purpose. This dataset also achieved 88.89% accuracy. IMDB  data was used for a Kaggle competition titled “Bag of Words Meets Bags of Popcorn” in  2014 to early 2015. In that competition accuracy was achieved above 97% with winners achieving 99%.  IMDB is popular for movie lovers as well and binary sentiment classification was mostly made using this.  Without the training and test review examples in the dataset, there is further unlabeled data for use.

Size: 80 MB

Number of Records: 25,000 highly polar movie reviews for training, and 25,000 for testing

Features:

  • IMDB  dataset is one of the best dataset which helps to understand and learn the ML techniques and  deep learning methods on real-world data.
  • IMDB  dataset is divided into two parts 1. Train 2. Test
  • So using this dataset you do not need to split your data for training and evaluation part.
  • All you need to do, build your model using train data and evaluate your model using test data

Download the Dataset

14. Sentiment 140:

Sentiment 140 dataset built on twitter data. Analyzing sentiment is one of the most popular application in natural language processing(NLP) and to build a model on sentiment analysis Sentiment 140 dataset will help you. This dataset is basically a text processing data and with the help of this dataset, you can start building your first model on NLP. Sentiment 140 dataset is beginner-friendly to start a new project in natural language processing. This data pre removed the emotions and it had six features altogether.

  • polarity of the tweet
  • id of the tweet
  • date of the tweet
  • the query
  • username of the tweeter
  • text of the tweet

Features:

  • It has 1,600,000 tweets which were extracted using the twitter api
  • The tweets were annotated like (0 = negative, 2 = neutral, 4 = positive)
  • These annotations are used to detect  the sentiment for the particular tweet

Fields in the dataset:

  • target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
  • ids: The id of the tweet ( 2087)
  • date: the date of the tweet (Sat May 16 23:58:44 UTC 2009)
  • flag: The query (lyx). If there is no query, then this value is NO_QUERY.
  • user: the user that tweeted (robotickilldozr)
  • text: the text of the tweet (Lyx is cool)

Size: 80 MB (Compressed)

Number of Records: 1,60,000 tweets

Download the Dataset

15. Facial image Dataset:

Facial image dataset is based on face images for male and female both. Using facial image dataset machine learning and deep learning algorithms can be performed to detect gender and emotion. It has a variation of data like variation of background and scale, and variation of expressions.

Information about the dataset:

  • Total number of individuals: 395
  • Number of images per individual: 20
  • Total number of images: 7900
  • Gender:  contains images of male and female subjects
  • Race:  contains images of people of various racial origins
  • Age Range:  the images are mainly of first year undergraduate  students, so the majority of individuals are between 18-20 years old but some older individuals are also present.

Features

  • The dataset has four directories.
  • You can download the dataset according to your system requirement and demand.
  • All the version of the data has the zipped version.
  • Total 395 individuals are there and each of them has 20 images
  • Resolution of the images are 180 * 200 pixel stored in 24 bit RGB JPEG format.

Download the Dataset

MegaFace facial recognition dataset origin raises privacy and ...

16. RED Wine Quality Dataset:

RED wine quality dataset is also popular and interesting for all the machine learning and deep learning enthusiast. This dataset is also beginner friendly and you can easily apply machine learning algorithm in this data. With the help of this dataset you can train your model to predict the wine quality. This dataset has wine’s physicochemical properties. Regression and classification both approach of machine learning can be used by using Red wine quality dataset. In this dataset are related to red and white variants of the Portuguese “Vinho Verde” wine. Because of privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.). In the dataset, the classes are ordered and not balanced (e.g. there are much more normal wines than excellent or poor ones).

Information about input variables based on physicochemical tests:

1 – Fixed acidity

2 – Volatile acidity

3 – Citric acid

4 – Residual sugar

5 – Chlorides

6 – Free sulfur dioxide

7 – Total sulfur dioxide

8 – Density

9 – pH

10 – Sulphates

11 – Alcohol

Output variable (based on sensory data):

12 – Quality (score between 0 and 10)

Features

  •  Two types of variables are there in the dataset, i.e., input and output variables.
  • Input variables are fixed acidity, volatile acidity, citric acid, residual sugar, and so forth.
  • The output variable is quality.
  • 12 attributes are present and the attribute characteristics are real.
  • The number of total records is 4898.

Download the Dataset

What's the perfect red wine serving temperature? Ask Decanter

 

17. The Wikipedia corpus:

Wikipedia corpus consists of Wikipedia data only. This has the collection of the full text on Wikipedia and contains almost 1.9 billion words from more than 4 million articles. This dataset is basically used for natural language processing purpose. It is a very powerful dataset and you can search by word, phrase or part of a paragraph itself.

Size: 20 MB

Number of Records: 4,400,000 articles containing 1.9 billion words

Features

  • This dataset has a large-scaled and can be used for machine learning and natural language processing purpose
  • As the dataset is big in nature its helps to train the model perfectly
  • It has 4,400,000 articles containing 1.9 billion words

Download the Dataset

18. Free Spoken digit dataset:

Free Spoken digit dataset is simple audio or speech data which consists of recordings of spoken English digits. The format of the file is wav at 8 kHz.  All the recordings are trimmed to have near minimal silence at the beginning and ends. This dataset is created to solve the task of identifying spoken digits in audio. The main thing about the dataset is, it is open. So anyone can contribute to this repository. As it is open so it is expected that the dataset will grow over time

 Characteristics of the Dataset:

  • 4 speakers
  • 2,000 recordings (50 of each digit per speaker)
  • English pronunciations

Files format: {digitLabel}_{speakerName}_{index}.wav Example: 7_jackson_32.wav

Features:

  • Open source
  • Helps to solve digit pronunciations problem
  • Allows to contribute anyone

Download the Dataset

19. Boston House price dataset: 

Boston House price dataset is collected from  U.S Census Service concerning housing in the area of Boston Mass. This dataset is used to predict the house price depending upon a few attributes. Machine learning regression problem can be done using the data. The dataset has five hundred six cases all total.

Total columns in the dataset:

crim

per capita crime rate by town.

zn

proportion of residential land zoned for lots over 25,000 sq.ft.

indus

proportion of non-retail business acres per town.

chas

Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

nox

nitrogen oxides concentration (parts per 10 million).

rm

average number of rooms per dwelling.

age

proportion of owner-occupied units built prior to 1940.

dis

weighted mean of distances to five Boston employment centres.

rad

index of accessibility to radial highways.

tax

full-value property-tax rate per \$10,000.

ptratio

pupil-teacher ratio by town.

black

1000(Bk – 0.63)^2 where Bk is the proportion of blacks by town.

lstat

lower status of the population (percent).

medv

median value of owner-occupied homes in \$1000s.

Features:

  • Total cases in the dataset 506
  •  14 attributes are there in each case, like: CRIM, AGE, TAX, and so forth.
  • The format of the dataset is CSV (Comma separated value)
  • Machine learning regression problem can be applied in the dataset

Download the Dataset

20. Pima Indian Diabetes dataset:

Artificial Intelligence is now widely used in the healthcare and medical industry as well. The dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. Diabetes is one of the most common and dangerous diseases and now spreading of the diabetes is very easy. A chronic condition in diabetes body develops a resistance to insulin and a hormone which converts foods into Glucose. Diabetes affects so many people worldwide and it has Type 1 and Type 2 diabetes. For type 1 and type 2 diabetes, they have different characteristics. So  Pima Indian Diabetes dataset is basically used to predict the diabetes based on certain diagnostic measurements. This machine learning model helps the society and the patient as well to detect the diabetes disease quickly. This is one of the best dataset to make a model on diabetes prediction. Particularly we can say all patients here are females at least 21 years old of Pima Indian heritage. There are to total of nine columns in the dataset:

  1. Pregnancies
  2. Glucose
  3. Blood pressure
  4. Skin thickness
  5. Insulin
     
  6. BMI
  7. DiabetesPedigreeFunction
  8. Age
  9. Outcome

Features:

  • The format of the dataset is CSV (Comma separated value)
  • Almost most of the patients of this dataset are female, and at least 21 years old.
  • There are several variables are there in the dataset, like, number of pregnancies, BMI, insulin level, age, and one target variable.
  • It has a total of 768 rows and 9 columns

Download the Dataset

Dataset is the base and first step to build a machine learning applications.Datasets are available in different formats like .txt, .csv, and many more. For supervised machine learning, the labelled training dataset is used as the label works as a supervisor in the model. And for unsupervised learning algorithm in machine learning dataset label is required. The unsupervised model learns by itself not from the label.

Please read the full article to understand which dataset is preferable for your machine learning algorithm.

I hope this article will help you to understand thoroughly about the best 20 datasets which are available freely.

For free upksilling courses on Machine Learning and data science, visit GL Academy.

Happy Learning!

0

LEAVE A REPLY

Please enter your comment!
Please enter your name here

16 − sixteen =