Obtain Datasets¶

Validation of the model is always performed against specific data combined into datasets. To obtain trustworthy results, a dataset must satisfy the following requirements:

Format: the data should be compatible with the model domain (Computer Vision or Natural Language Processing).
Content: a dataset must be representative. The data needs to be aligned with the model use case (for example, images of people for face detection).
Size: a dataset should contain a sufficient number of items (100+).

Note

You can use Datumaro to make the process of creating your dataset easier. Datumaro is a free framework and CLI tool for building, transforming, and analyzing datasets and annotations.

Image Dataset¶

Image datasets can be either Annotated or Not Annotated:

Not Annotated dataset contains only images and allows using most of the DL Workbench features: measure performance, optimize, and visualize the model, etc.
Annotated dataset contains images and information about what each image is showing. It expands the possibilities of working with a model and allows measuring accuracy and optimizing the model within a controllable accuracy drop.

Text Dataset¶

A text dataset should be represented as a table in СSV/TSV format of at least two columns with Text and Label for Text Classification use case. Textual Entailment task requires a СSV table of three columns with Premise, Hypothesis, and Label. HuggingFace’s datasets library provides access to different text datasets.

Obtain Datasets¶

Image Dataset¶

Text Dataset¶

See Also¶