Lately OpenAI introduced a new multimodal neural network called CLIP (Contrastive Language-Image Pre-training). A model trained on 400 million pairs of images and text collected from the internet. It uses zero-shot learning capabilities akin to GPT-2 and GPT-3 language models. It is able to perform image classification based on natural language.
“We find that CLIP, similar to the GPT family, learns to perform a wide set of tasks during pretraining, including object character recognition (OCR), geo-localization, action recognition, and many others.” - a quote from OpenAI paper.
CLIP is very efficient in a variety of image classification tasks. However, it struggles with domain tasks such as pulmonary nodules classification or plant pathogens detections.
Why is it so great?
Currently, the computer vision space needs to tackle many challenges with datasets labeling (that is costly and labor-intense), and problems with model adaptations to different tasks.
CLIP by OpenAI aims to solve those challenges.
The new model architecture allows to train a model using detailed descriptions, not a single words such as labels. In this particular case, the authors of the model took descriptions from internet image captions.
Thanks to the similar “zero-shot” capabilities of GPT-2 and GPT-3 CLIP can be applied to any visual classification benchmark. The only thing you need to do is to provide names of the visual categories.
In order to run a model, go to the repository. Explore how a zero-shot prediction using CLIP works in practice running a notebook.
For the purpose of the quick tour, we tried to classify the image of the shark, but only partially seen. We used classes from the Cifar100.