Machine learning becomes engaging when we face various challenges and thus finding suitable datasets relevant to the use case is essential. Its flexibility and size characterise a data-set. Flexibility refers to the number of tasks that it supports. For example, Microsoft’s COCO( Common Objects in Context) is used for object classification, detection, and segmentation. Add a bunch of captions for the same, and we can use it as a dataset for an image caption generator as well. That’s the power of a robust dataset. Well, when we are just starting, we shall be working with some of the small and standard machine learning datasets like the CIFAR-10, MNIS, Iris, etc. These datasets are preloaded in many of the libraries these days and can be quickly loaded. Keras, scikit-learn provide options for the same.
Machine Learning: Important Dataset Sources
Let us begin by finding machine learning datasets that are problem-specific, and hopefully cleaned and pre-processed.
It surely is a strenuous task to find specific datasets like MS-COCO for all varieties of problems. Therefore, we need to be intelligent about how we use datasets. For example, using Wikipedia for NLP tasks is probably the best NLP dataset there possibly is. In this article, we discuss some of the various sources for Machine Learning Datasets, and how we can proceed further with the same. A word of caution, be careful while reading the terms and conditions that each of these datasets impose, and follow accordingly. This is in the best interest of everyone indeed.
Google has been the search engine giant, and they helped all the ML practitioners out there by doing what they are legends at, helping us find datasets. The search engine does a fabulous job at getting datasets related to the keywords from various sources, including government websites, Kaggle, and other open-source repositories.
2. .gov Datasets:
With the United States, China and many more countries becoming AI superpowers, data is being democratised. The rules and regulations related to these datasets are usually stringent as they are actual data collected from various sectors of a nation. Thus, cautious use is recommended. We list some of the countries that are openly sharing their datasets.
Indian Government Dataset
Australian Government Dataset
EU Open Data Portal
New Zealand’s Government Dataset
Singapore Government Dataset
Kaggle is known for hosting machine learning and deep learning challenges. The relevance of Kaggle in this context is that they provide datasets, and at the same time provide a community of learners and ML practitioners, whose work shall help us with our progress. Each challenge has a specific dataset, and it is usually cleaned so that we don’t have to do the bland work of cleaning necessarily and can instead focus on refining the algorithm. The datasets are easily downloadable. Under the resources section, there are prerequisites and links to learning material, which helps us whenever we are stuck with either the algorithm or the implementation. Kaggle is a fantastic website for beginners to venture into applications of machine learning and deep learning and is a detailed resource pool for intermediate practitioners of machine learning.
4. Amazon Datasets (Registry of Open Data on AWS)
Amazon has listed some of the datasets available on their servers as publicly accessible. Therefore, when using AWS resources for calibrating and tweaking models, using these locally available datasets will fasten the data loading process by tens of times. The registry contains several datasets classified according to the field of applications like satellite images, ecological resources, etc.
UCI Machine Learning Repository provides easy to use and cleaned datasets. These have been the go-to datasets for a long time in academia.
An exciting feature that this website provides is it lists the paper which used the dataset. Therefore, all research scientists and people from academia will find this resource handy. The datasets available cannot be used for commercial purposes. For more details, check the websites of the datasets provided.
The subreddit can be used as a secondary guide when all other options lead nowhere. People usually discuss the various available datasets and how to use existing datasets for new tasks. A lot of insights regarding the necessary tweaking required for datasets to work in different environments can be obtained as well. Overall, this should be the last resource point for datasets.
Let’s focus on datasets specific to the major domains that have seen accelerated progress in the last two decades. Having domain-specific datasets available enhances the robustness of the model, and thus more realistic and accurate results are possible. The areas include computer vision, NLP and, Data analytics.
Datasets for other applications
Computer Vision Datasets
There are several computer vision datasets available. The choice of the dataset depends on the level of competence we are working with. The pre-loaded datasets on Keras and scikit-learn are sufficient for learning, experimenting and implementing new models. The downside with these datasets is that the chances of overfitting of the model are high due to the low complexity in the datasets. Therefore, for intermediate ML practitioners and organisations solving specific problems can refer to various sources:
https://computervisiononline.com/datasets: A variety of resources and datasets are available on the website. It lists most of the open-source datasets and redirects the user to the dataset’s webpage. The datasets available can be used for classification, detection, segmentation, image captioning and many more challenging tasks
http://riemenschneider.hayko.at/vision/dataset/: This website lists out almost all the available datasets. It makes it easy for finding relevant datasets by providing the option of searching with the help of tags associated with each dataset. We highly recommend our readers to try this website out.
Natural Language Processing
NLP is growing at a phenomenal pace, and recently language modelling has had its imagenet moment, wherein people can start building applications with state of the art conversational NLP agents. When it comes to NLP, several scenarios require task-specific catered datasets. NLP deals with sentiment analysis, audio processing, translation, and many more challenging tasks. Therefore, it is necessary to have a massive list of datasets:
https://github.com/niderhoff/nlp-datasets: Majority of the datasets in the domain are listed in the following GitHub repository.
https://www.figure-eight.com/data-for-everyone/: The datasets on this website are cleaned and provide a vast database to choose from. The appealing and easy-to-use interface makes this a highly recommended choice.
Statistic and Data Science
Data Science covers a range of tasks including creating recommendation engines, predicting parameters given the data, like time-series data, and doing exploratory and analytical research. Small organisations and individual practitioners don’t have what the big giants have, that is the data, and hence open datasets such as these is a huge boon to create actual models that reflect real data, and not simulated data.
http://rs.io/100-interesting-data-sets-for-statistics/: There are various datasets available for specific tasks, and it’s a wonderful resource point.
http://deeplearning.net/datasets/: These are benchmark datasets and can be used for comparing the results of the model built with the benchmark results.
This is an exhaustive list of datasets for machine learning, analytics, and other applications. We wish you the best of luck while implementing models. Also, we hope you come up with models that can match the benchmark results.
If you are interested in learning Machine Learning concepts and pursue a career into the domain, upskill with Great Learning’s PG Program in Artificial Intelligence and Machine Learning