Multi-label classification is a challenge task since we must identify many kinds of objects in different scales. While using global features of an image may discard small object information, many researches have shown that an attention mechanism improves feature extraction and that label relations reveal label co-occurrence, both of which benefit a multi-label classification task. In this work, we extract attended features from one image by Transformer and simultaneously consider labels’ co-occurrence. Then, we use the attended features to generate a classifier applied on the semantic space to predict the labels. Experiments validate the proposed method.