Facial action unit detection, which aims to detect facial muscle activities from face images, is an important task to enable the emotion recognition from facial movements. By coding facial muscle activities into a system of facial Action Units (AUs), facial expressions can be clearly described. However, it is still challenging to predict the AUs from fine-grained facial appearances, and one of the challenges lies in handling various appearances from different subjects. In this thesis, we address the problem by introducing an auxiliary neutral face image to produce person-specific transformations for various subjects. With the help of neutral faces, our method extracts effective features of facial muscle activities despite the divergent individual appearances. We propose to combine an additional face clustering task on top of the AU detection task to form a multi-task network cascades and train the cascades jointly. First, to train the face clustering networks for producing person-specific transformations, we utilize identity-annotated datasets which contain numerous subjects to alleviate a common problem that existing AU-annotated datasets contain only a few subjects. Second, we transform the facial features using the person-specific transformations to reduce individual differences for predicting AU labels. As a result, the proposed network cascades exploit not only the visual but also the identity information and thus more effectively detect AUs based on the personalized appearance normalization. Our experimental results on the BP4D dataset show that our method outperforms state-of-the-art ones. Experiments under cross-dataset and cross-group scenarios also show the advantage of our method in terms of robustness.