本篇論文以VGG19為Backbone基礎模型下,加入Squeeze-and-Excitation net以及加入PatchNet成為雙支網路。其中Squeeze-and-Excitation主要加強channel-wise 的特徵,而PatchNet主要用意為,藉由Patch可以在圖片上各個不同地方都提取到不一樣的特徵,像是取出物件的區塊特徵,可能有耳朵、鼻子、嘴巴、身體...,利用這些特徵在與VGG19取出特徵進行雙線性(Bilinear)的運算,且經過此運算能夠讓這兩個網路取出的特徵相互比較他們的關係性,進而可以提取出更加完整的物件本體,而非只有頭或是身體的特徵。最後本文在AwA2、CUB、SUN資料庫去進行測試,整體上雖然只有AwA2效果能夠與其他論文比較,且稍微的超過它文的結果,但我們主要的想法是提升物件特徵的完整性,也於本文最後使用熱圖(heat map)視覺化本文取出的特徵證實本文提出的方法有效。
We proposed a new patch net structure for zero shot learning (ZSL). In addition to the global features extracted from VGG19, patch net features are intended to catch the overall region-of-interested. These two features are fused via a bilinear operation. The fused image feature is mapped to the semantic space by a fully connected layer. The structure only adopts one simple cross entropy loss function so it is easy to train. According to the experiments, this method can extract more completeness features than those well-known backbones do in some images. In specific dataset, our method is competitive to other state of the art methods.