Obtain Datasets¶
Validation of the model is always performed against specific data combined into datasets. To obtain trustworthy results, a dataset must satisfy the following requirements:
Format: the data should be compatible with the model domain (Computer Vision or Natural Language Processing).
Content: a dataset must be representative. The data needs to be aligned with the model use case (for example, images of people for face detection).
Size: a dataset should contain a sufficient number of items (100+).
Note
You can use Datumaro to make the process of creating your dataset easier. Datumaro is a free framework and CLI tool for building, transforming, and analyzing datasets and annotations.
Image Dataset¶
Image datasets can be either Annotated or Not Annotated:
Not Annotated dataset contains only images and allows using most of the DL Workbench features: measure performance, optimize, and visualize the model, etc.
Annotated dataset contains images and information about what each image is showing. It expands the possibilities of working with a model and allows measuring accuracy and optimizing the model within a controllable accuracy drop.
Upload Not Annotated Dataset
You can upload your images to create not annotated dataset. If you do not have enough images or want to enlarge the dataset, use augmentation methods to increase the size of a dataset by generating modified image copies.
On the Create Project page, go to Select a Validation Dataset tab and click Import Image Dataset:
You will see the Create Dataset page where you can add your own images and specify the dataset name:
After you click Import, you are redirected to the Create Project page where you can check the import status.
Not Annotated Dataset Augmentation
Apply different augmentation types to create variations of your images and improve the model performance. Extending your validation dataset also helps to avoid possible overfitting of a calibrated model. Augmentation methods include different image modifications, such as horizontal and vertical flips, random erase, noise injection, and color transformations.
Horizontal Flip
Horizontal image flip means reversing the rows and columns of an image pixels horizontally. Usually it does not modify the object category.
Vertical Flip
Vertical image flip reverse the rows and columns of an image pixels vertically. It is recommended to use this method in the context of the selected image and model task to avoid recognition issues.
Random Erase
Random Erase randomly selects a rectangle section in the image and erases its pixels with random values. Note that this augmentation methon might randomly erase an object particularly important for your use case. It is recommended to use this method in the context of the selected image and model task.
Noise Injection
Noise Injection means injecting a matrix of random values. Noise Injection presents itself as random black and white pixels spread through the image. This method helps to avoid overfitting when you model concentrates on the image patterns that occur frequently but may not be useful.
Color Transformations
Color Transformations change brightness and contrast of the image. You can select one or several presets with changed parameters. The preset specifies whether the brightness of the augmented image will be lighter(+20%) or darker(-20%). Contrast is the degree to which light and dark colours in the image differ. You can make the constrast of the augmented image higher(+20%) or lower(-20%).
After clicking Import, you are redirected to the Create Project page where you can check the import status. To remove an imported dataset from the list, click the bin icon in the Action column.
All images were taken from ImageNet, Pascal Visual Object Classes, and Common Objects in Context datasets for demonstration purposes only.
Upload Annotated Dataset
NOTE: Sample datasets must consist of a small sampling of images and be in ImageNet, Pascal Visual Object Classes (Pascal VOC), Common Objects in Context (COCO), Common Semantic Segmentation, Labeled Faces in the Wild (LFW), Visual Geometry Group Face 2 (VGGFace2), Wider Face, Open Images, Cityscapes or unannotated format. To learn more about the formats, refer to Dataset Types.
On the Create Project page, go to Select a Validation Dataset tab and click Import Image Dataset:
To import a new dataset, click Upload Dataset tab. Upload a .zip or .tar.gz archive with your dataset and specify the Dataset Name and Dataset Type:
Click Import. You are automatically directed back to the datasets table, where you can see the status of the import and select a dataset by clicking on it:
Text Dataset¶
A text dataset should be represented as a table in СSV/TSV format of at least two columns with Text and Label for Text Classification use case. Textual Entailment task requires a СSV table of three columns with Premise, Hypothesis, and Label. HuggingFace’s datasets library provides access to different text datasets.
Upload Text Dataset
Select the dataset file in СSV/TSV format. If you do not have a text dataset, download one of the files from the Text Dataset Tip. Or open a dataset from HuggingFace’s datasets library. Find Update on GitHub button and download a dataset from the repository.
Specify dataset name.
Specify file encoding. If you see incorrect symbols in the Raw Dataset Preview, try to select another encoding.
Select separator. If the file is incorrectly split into columns, try to select another separator.
Specify whether the file has header to exclude it from the dataset.
Select the task type.
Textual entailment is the task of deciding whether the meaning of the Hypothesis (second text) can be inferred from the Premise (first text). The entailment relation is specified by the Label. Use textual entailment pattern for classification tasks that require two texts as input. Specify the number of the dataset columns that contain Premise, Hypothesis, and Label.
Text classification is the task of assigning a sentence or document an appropriate category (also called label or class). Specify the number of the dataset column that contains Text and Label for classification.
Click Import and proceed to create a project.