Public Datasets for Machine Learning Prototypes

Utilizing publicly available datasets for machine learning projects has numerous advantages, especially for individuals or organizations just starting out with machine learning, or those working on specific research topics.

Here are some of the common reasons why starting with a publicly available dataset makes sense:

Cost-Effectiveness:

Collecting and annotating data can be expensive and time-consuming. Publicly available datasets can be free or relatively inexpensive, making them a cost-effective way to start a project.

Quality and Reliability:

Public datasets often come from reputable sources and have undergone some level of quality control. This means they are often clean, well-organized, and annotated, which can save a lot of time and ensure the reliability of the training process.

Benchmarking:

Public datasets often serve as benchmarks in the machine learning community. They allow for the comparison of different algorithms and models on a common ground, making the evaluation of model performance more objective and comparable.

Speed:

Having access to pre-collected and pre-processed data can significantly speed up the development and experimentation process, allowing for quicker iterations and faster progress in developing machine learning models.

Ethical Considerations:

Some publicly available datasets have been collected and shared with consent and under ethical guidelines, which is crucial for responsible AI development.

Legal Compliance:

Public datasets often come with clear licensing terms, making it easier to ensure compliance with legal and regulatory requirements related to data usage.

Exploratory Analysis:

Public datasets can also be used for exploratory analysis to understand the data landscape of a particular domain before investing in data collection.

Proof of Concept:

They can be used to create proofs of concept to showcase the potential of machine learning applications to stakeholders before investing more resources.

Sites to explore when starting your dataset search:

Some example dataset links grouped by a few customer ops categories:

Customer Survey or Support Histories

SFO Customer Survey See this site for yearly surveys that San Francisco International Airport conducts. There is also a variety of other data on the collection site - DataSF.

Bitext Synthetic Customer Support Dataset Free synthetic data from Bitext.

Conversation logs from TripAdvisor Travel-related customer service data from four sources.

Live call datasets These are hard to come by as most are privately owned. See also this.

Product Reviews

Amazon Commerce Reviews Dataset derived from customer reviews on Amazon Commerce Website.

Online Product Reviews List of over 71,045 reviews from 1000 different products.

Sentiment Analysis Study Datasets Some of these require approval to access - see bottom of page linked.

Ecommerce Transactions

UK-based online retailer , UK-based online retailer II These online retail datasets contains all the transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2011.

iMerit Summary of Retail, Sales, and Ecommerce Datasets If you’re interested in training an ML model using retail datasets, then this iMerit-compiled list is a great place to start.

Data.gov A large catalog of US Government datasets. Search on Ecommerce or keywords related to your area of interest.

Product Search

Appen Pre-labeled Datasets Large library of for-sale datasets spanning many data formats, use cases, and languages.

Datafiniti Large datasets for product search, including from Amazon and Best Buy.

Email Corpora

Enron Emails There appear to be very few public email corpora due to privacy concerns. Although the Enron case study does not pertain specifically to customer operations, the dataset may be useful in prototyping email-related use cases.

Public Datasets for Machine Learning Prototypes

Recent Posts