智力活动是一种生活态度 https://mountaye.github.io/blog/

.py | PyTorch data processing wrapper

Sep 1, 2022

The data structure and processing of general supervised learning, and the encapsulation of the above structure and processing by PyTorch

General Supervised Learning Data Structure and Processing

training set, validation set, test set

All data as a whole constitute a large set, each element of this set contains an input and a target, denoted by x and y respectively.

Divide this large set into three subsets that have no intersection with each other, namely the training set, the validation set, and the test set.

The training set and validation set are used during training.

When data from the training set is brought into the model, the model is in training mode, and the derivative of the model output with respect to the parameters is recorded. Update the parameters of the model by comparing the loss of substituting "model output" and "training target y" into the loss function . At the same time, record the results of the "model output" and "training target y" brought into the validation function, and compare with the validation set.

When the data of the validation set is substituted into the model, the model is in evaluation mode, the model only calculates the output according to the input, and does not record the derivative of the parameter. Observe whether the training falls into "overfitting" by observing the results of the "model output" and "training target y" brought into the validation function . A model can be considered overfit when the validation function results on the training set keep decreasing, but the validation function results on the validation set are almost unchanged.

The test set is used after training is complete, and when it is substituted into the model, the model is in evaluation mode. Used to evaluate the quality of training results.

epoch vs. batch

If all the data are trained at the same time, the space required is generally larger than the computer memory. Therefore, the training set is generally randomly divided into several batches, and the data of one batch is inserted into the model for training at the same time. In a batch, the derivatives of each model output to the parameters are accumulated together, and the model parameters are updated after the whole batch is over. , while the derivative is cleared to zero. Because the concept of batch is related to memory, the value is generally chosen to be an index of 2.

Running all batches of the training set once is called an epoch. A training typically takes many epochs until the loss function result is low enough, or the validation set shows overfitting.

PyTorch encapsulates the above structure and processing

`Dataset`

As mentioned earlier, the data set includes two parts: input and target. The role of Dateset and its subclasses is

If the data is already a regular two tensors before loading the data into the Dataset -

 import torch
from torch.utils import data

# ...

for x,y in zip(train_x,train_y):# do something with x and y

trainset = TensorDataset(train_x,train_y)
for x,y in trainset:# do something with x and y

- This step is really meaningless.

The interesting thing is that you can write a dataset class yourself, inherit Dataset , and then overload the __getitem__() and __len__() methods, so that you can stuff some data that is not suitable for tensors into the Dataset , and you can learn from images. Add image enhancement steps here and further use in DataLoader

`DataLoader`

DataLoader = Dataset + Sampler , because in the general example, only the data set needs to be simply and randomly divided, and only parameters such as batch_size are used, and there are few places where Sampler is used.

The most common use case is WeightedRandomSampler . When training a classifier, sometimes the data of one category is much less than the other, so it is more difficult for the trainer to judge this category (because as long as no brain excludes this category, a good accuracy rate can be obtained), so it is necessary to balance different groups. weights among others.

 list(WeightedRandomSampler(weights=[0.1, 0.9, 0.4, 0.7, 3.0, 0.6], num_samples=5, replacement=True))
# [4, 4, 1, 4, 5]
list(WeightedRandomSampler(weights=[0.9, 0.4, 0.5, 0.2, 0.3, 0.1], num_samples=5, replacement=False))
# [0, 1, 4, 3, 2]

After balancing, convert to batch, with BatchSampler :

 list(BatchSampler(WeightedRandomSampler(weights=[0.1, 0.9, 0.4, 0.7, 3.0, 0.6], num_samples=5, replacement=True), batch_size=2, drop_last=False))
# [[4, 4], [1, 4], 5]
list(BatchSampler(WeightedRandomSampler(weights=[0.1, 0.9, 0.4, 0.7, 3.0, 0.6], num_samples=5, replacement=True), batch_size=2, drop_last=True))
# [[0, 1], [4, 3]]

Summarize

 import torch
from torch.utils import data

train_x = torch.rand((100,5))
train_y = torch.rand((100,2))
trainset = data.TensorDataset(train_x,train_y)

# either:
trainloader = data.DataLoader(trainset,batch_size=2,drop_last=True,sampler=data.WeightedRandomSampler(weights=[0.1, 0.9, 0.4, 0.7, 3.0, 0.6], 
        num_samples=5, 
        replacement=True))
# or:
trainloader = data.DataLoader(trainset,batch_sampler=data.BatchSampler(data.WeightedRandomSampler(weights=[0.1, 0.9, 0.4, 0.7, 3.0, 0.6], 
            num_samples=5, 
            replacement=True), 
        batch_size=2, 
        drop_last=True))

for epoch in range(100):for x,y in trainloader:train(model,x,y,loss_function)

It should be noted that the dimensions of x and y for x,y in trainset are the dimensions of a single data, the simplest case is the P and Q dimension vectors, and the for x,y in trainloader at this time are matrices of dimensions (B,P) and (B,Q), where B is batch_size. The calculation in the train() function takes this extra dimension into account.