Memes are used for spreading ideas through social networks. Although most memes are created for humor, some memes become hateful under the combination of pictures and text. Automatically detecting hateful memes can help reduce their harmful social impact. Compared to the conventional multimodal tasks, where the visual and textual information is semantically aligned, hateful memes detection is a more challenging task since the image and text in memes are weakly aligned or even irrelevant. Thus it requires the model to have a deep understanding of the content and perform reasoning over multiple modalities. This paper focuses on multimodal hateful memes detection and proposes a novel method incorporating the image captioning process into the memes detection process. We conduct extensive experiments on multimodal meme datasets and illustrate the effectiveness of our approach. Our model achieves promising results on the Hateful Memes Detection Challenge. Our code is made publicly available at GitHub.