recognite.data package

class recognite.data.DataFrameDataset(df: DataFrame, label_key: str, image_key: str, label_to_int: Dict[str, int], transform: Callable | None = None)[source]

A dataset based on a pandas DataFrame.

The provided DataFrame contains the path of each image and the corresponding label.

Parameters:

df – The DataFrame containing the image paths and labels.
label_key – The column in the DataFrame that contains the label of each image.
image_key – The column in the DataFrame that contains the image path of each image.
label_to_int – A dictionary that maps the label to a unique integer.
transform – A transform to apply to the image before returning it.

df: The DataFrame with the image paths and labels.

label_key: The column in the DataFrame that contains the label of each image.

image_key: The column in the DataFrame that contains the image path of each image.

label_to_int: The dictionary that maps the label to a unique integer.

transform: The transform applied to each image before returning it.

unique_labels: The unique labels present in this dataset.

__init__(df: DataFrame, label_key: str, image_key: str, label_to_int: Dict[str, int], transform: Callable | None = None)[source]

recognite.data.train_val_datasets(data_csv_file: str, image_key: str = 'image', label_key: str = 'label', num_folds: int = 5, val_fold: int = 0, fold_seed: int = 0, num_refs: int = 1, ref_seed: int = 0, tfm_train: Callable | None = None, tfm_val: Callable | None = None) → Tuple[DataFrameDataset, DataFrameDataset, DataFrameDataset][source]

Creates training and validation datasets for recognition training.

The datasets are created based on a CSV file that contains all image paths with a corresponding label. The images are split up into a training and a validation subset. The data split is based on labels, so no label will occur both in the training and in the validation subset. We do this to reflect a real-world recognition scenario where the labels used during training typically do not occur during inference, emphasizing the need for generalizable embeddings.

For the train-validation split, we use a folds-based approach that allows for easy K-fold cross-validation. How the dataset is split is determined by num_folds (the number of folds), val_fold (which fold to use for validation) and fold_seed (the seed used for the random generator that shuffles the labels before creating the folds). For example, when setting num_folds = 5, each value for val_fold from 0 to 4 gives rise to another validation set and another training set. Training and evaluation with each of these different combinations of train and validation datasets, thus comes down to a 5-fold cross validation.

The validation set is further split into a query and a gallery set. These sets contain the same labels but different images. The gallery set provides references with which the images in the query set should be compared. This is again to reflect a real-world scenario where a gallery of labeled references is used to identify a given query.

The query-gallery split is determined by num_refs (the number of references per label) and ref_seed (the seed for the random generator that shuffles the data before splitting).

Parameters:

data_csv_file – The path of the CSV file containing the images and corresponding labels.
image_key – The column in the CSV file that contains th image path of each image.
label_key – The column in the CSV file that contains the label of each image.
num_folds – The number of folds to use for splitting the dataset into training and validation. Note that the folds are label-based, not sample-based.
val_fold – The index of the fold to use for validation. The others will be used for training.
fold_seed – The random seed to use for k-fold splitting.
num_refs – The number of references per class in the gallery.
ref_seed – The state of the random generator used for randomly choosing gallery reference images.
tfm_train – The transform to apply to the training images.
tfm_val – The transform to apply to the validation images.

Returns:

The training dataset, the gallery dataset and the query dataset.

recognite.data.split_gallery_query(df: DataFrame, num_refs: int = 1, seed: int = 0, label_key: str = 'label') → Tuple[DataFrame, DataFrame][source]

Splits a DataFrame in a gallery and query subset.

We randomly select a fixed number of samples per label to compose the gallery set. The other samples are put in the query set.

Parameters:

df – The DataFrame to split.
num_refs – The number of samples per label to use for the gallery.
seed – The seed of the random generator used for choosing the gallery samples.
label_key – The name of the column that contains the labels of the samples.

Returns:

A tuple with the gallery and query DataFrame.

recognite.data.k_fold_trainval_split(df: DataFrame, num_folds: int = 5, val_fold: int = 0, seed: int = 0, label_key='label') → Tuple[DataFrame, DataFrame][source]

Splits the given DataFrame into a train and validation subset.

The subsets are composed by shuffling the labels in the DataFrame, using random seed seed. The labels are then split into num_folds folds, where each label can only be in a single fold. We then choose one fold for validation (as given by val_fold) and the other num_folds - 1 folds for training.

Parameters:

df – The DataFrame to split.
num_folds – The number of folds to use for splitting the dataset into training and validation. Note that the folds are label-based, not sample-based.
val_fold – The index of the fold to use for validation. The others will be used for training.
seed – The random seed to use for k-fold splitting.
label_key – The column in the CSV file that contains the label of each image.

Returns:

A tuple with the training DataFrame the validation DataFrame.