Studies in second language acquisition, especially in the area of language learning strategies, frequently employ the survey method alone as their means of investigation. Incongruent results are normally explained in terms other than the survey measure as such. One of our recent qualitative studies, however, revealed that respondents have different reference systems in mind when answering Likert-type questions. In this study, we call into question the ambiguities of the Likert-type five-point scale in learning strategy elicitation. Four parallel questionnaires consisting of the same batch of 20 items taken from Oxford (1990) were administered among a group of 120 tertiary level non-English majors in China. Questionnaire directly took Oxford's scale without specifying dimensions of reference; Questionnaire 2 told the respondents to choose their answers by comparing with their peers in the same grade Questionnaire 3 asked them to select their present behavioral frequency as compared with their own past learning experience in secondary schools; and in Questionnaire 4, subjects were told to tick the relevant fre- quency of a behavior by comparing its frequency of occurrence with that of other language skills. Data from the four questionnaires were subjected to repeated measures MANOVA analysis using SPSS/PC+ Results showed that out of the 20 items, 13 were significantly different among the four questionnaires. Methodological implications for questionnaire research are next discussed and suggestions for future research proposed The survey technique that uses an ordinal scale to measure the strength of an attitude, and uses several items to form an attitudinal construct is usually referred to as a Likert-scale (Shaw & Wright, 1967). Since Likert (1932) modified Thurstone's (1928) scaling method and made it an easy-to-use measurement technique, the Likert-scale has flourished for decades in social and behavioral research. By far it is most often applied to attitudinal measurement; fewer studies, however, employ the Likert-scale as a yardstick for human behavior (Dunn tudes, it usually takes the form of a five-point scale that greement to a statement (from strongly disagree to strongly agree). When behaviors are the target of measurement, on the other hand, the scale becomes the measurement of the frequency with which a behavior is thought to occur. Numerous problems have been reported on the validity and reliability of the Likert scale (see, for example, Busch, 1993: Keppel, 1991; Turner, 1993). Some of these problems result from the scale itself, others from its applications. For instance, one of the widely used formats for the elicitation of behavioral frequency never, rarely, sometimes, often, always is quite often dubious due to its semantic inexplicitness. Take the word often, for example Different individuals will almost certainly disagree on how frequently an action is to take place before being regarded as often. One solution to this problem is to spell out the frequency of occurrence of a behavior. Still, one needs to take meticulous care about how he does the specification simply to avoid even more confusion. As an example, Oxford's (1990) explanation to somewhat true of me as true of me for half of the time (p. 293) may well be argued to have added more trouble than illumination. "What is half of the time of the time when I am awake, half of all my time spent on learning, or what? Wen (1993) asked, "Half Another related problem does not quite lie in the scale as such. It is not unusual to see results from the Likert type questionnaire subjected to a statistical analysis that presumes a linear relationship between the psychological or behavioral construct tested by the scale and a criterion measure when in fact the relationship is other than linear. Granted that simple correlations between each questionnaire item and the dependent variable measure may not greatly distort the actual picture, when a construct resulting from several items averaged is correlated with the dependent measure, distortion is much more likely to occur if the relationship between some items in the construct and the criterion measure is linear while the relationship between the other items in the construct and the same criterion measure is not. Moreover, even if the whole construct does enjoy homogeneity in terms of its relationship with the criterion measure, confusion is still likely to result from more sophisticated statistical tests such as multiple regression, LISREL, or path analysis where all constructs are put together for linear modeling. To be more specific, the relationship between anxiety as measured via Likert-type questionnaires and learning outcome is known to be non-linear, which by no means suggests that anxiety is not important in learning. However, a linear analysis of the two constructs would produce a result suggesting a weak relationship between them. The best way to prevent this from happening is to plot each questionnaire item and each construct against each criterion measure before subjecting them to further analysis. In addition, response sets-especially cultural differences in response sets, a problem directly associated with Likert scaling have also been bothering social scientists for a long time (e.g., Hui & Triandis, 1989; Triandis & Triandis, 1962). For example, it has been repeatedly demonstrated that the Asians differ from the British (Wright et a 1978) and Hispanics (Hui & Triandis, 1989) in terms of what they exactly mean when they respond to Likert-type questions. Zax & Takahashi (1967) also have reported that Asians tend to use the middle of the scale and take it as an indication of their highly valued modesty, whereas Mediterranearn people tend to use extreme responses to show their sincerity. While these findings are fully justified, we nevertheless believe that even people from a homogeneous cultural background may also differ in terms of what they really mean when they choose the same answer. In other words, individual respondents may well have very different subjective reference systems when presented with a relative scale. These problems are particularly relevant to research in SLA, as the bulk of work on language learning strategies, for instance, frequently employs the survey method alone as a means of investigation (e.g., Oxford & Nyikos, 1989; Oxford, Nyikos, & Ehrman, 1988; Politzer & McGroarty, 1985). Incongruent results are normally explained in terms other than the measurement as such (Gu, 1992). One of our recent qualitative studies (Wen, 1993), however, has revealed that respondents' different reference systems might have influenced the ways the Likert-type questions were answered. For example, some subjects complained about not knowing whom to compare with when asked about how often they performed a learning behavior. "What do you mean by often?" asked one. "Compared to my classmates, I seldom do it. Compared to myself several years ago in my secondary school, however, I'm doing it quite often "Compared to listening and reading," said another, "I rarely do any speaking and writing at all." To make matters worse, some subjects reported that they might compare with their classmates when answering one item, and compare with their own past learning experiences when answering another item Obviously, these subjective reference variations distort to a considerable extent the interpretation of survey results, so much so, in fact, that we began to doubt the reliability of any general survey measure that relies solely upon the Likert-scale as its indicator of learning behaviors short of backing it up with other means of data elicitation. The present study was thus designed to confirm or reject our doubt and to see whether different questionnaires that specify different systems of reference would yield different results.