Image Classification Problem is the task of assigning an input image one label from a fixed set of categories. Many other seemingly distinct Computer Vision tasks (such as ojbject detection, segmentation) can be reduced to image classification.
Challenges
- Viewpoint variation: A single instance of an object can be oriented in many ways with respect to the camera
- Scale variation: Visual classes often exhibit variation in their size (size in the real world, not only in terms of their extent in the image)
- Deformation: Many objects of interest are not rigid bodies and can be deformed in extreme ways.
- Occlusion: The objects of interest can be occluded. Sometimes only a small portion of an object (as little as few pixels) could be visible.
- Illumination conditions: The effects of illumination are drastic on the pixel level.
- Background clutter: The objects of interest may blend into their environment, making them hard to identify.
- Intra-class variation: The classes of interest can often be relatively broad, such as chair. There are many different types of these objects, each with their own appearance.
Data-driven approach: provide the computer with many examples of each class and then develop learning algorithms that look at these examples and learn about the visual appearance of each class.
Image classification pipeline: Input -> Learning -> Evalution
Cross-validation: In practice, people prefer to avoid cross-validation in favor of having a single validation split, since cross-validation can be computationally expensive. The splits people tend to use is between 50%-90% of the training data for training and rest for validation. However, this depends on multiple factors: For example if the number of hyperparameters is large you may prefer to use bigger validation splits. If the number of examples in the validation set is small (perhaps only a few hundred or so), it is safer to use cross-validation. Typical number of folds you can see in practice would be 3-fold, 5-fold or 10-fold cross-validation.
The pixel-wise distance does not correspond at all to perceptual or semantic similarity.
If your data is very high-dimensional, consider using a dimensionality reduction technique such as PCA or even Random Projections.