Mark Andrews
  • training courses
  • book
  • publications
  • presentations
  • notes
  • about

On this page

  • Example problem
  • Rating free-text responses using Llama
  • LLM Evaluation

Qualitative text analysis with local LLMs: Part II

llm
R

This is Part II of a three part note on analysing text with a locally running LLM. In this part, I provide a worked example of doing text analysis with a local LLM using R. The example problem comes from a free-text rating problem in recently published Psychological Science paper.

Author

Mark Andrews

Published

January 28, 2025

In Part I of this note, we introduced the topic of using local LLMs for qualitative text analysis. There, we also explained how to install a local LLM and use it with R using the ellmer package. In this part, we will use a local LLM to analyse and rate free-text survey responses. While this task is not as difficult as some other qualitative analysis, it is still a nontrivial problem and also allows us to compare the LLM performance to that of humans.

Example problem

The particular problem we will consider was part of the research study described in the following article:

  • Merrell, W. N., Choi, S., & Ackerman, J. M. (2024). When and why people conceal infectious disease. Psychological Science, 35(3), 215-225. https://doi.org/10.1177/09567976231221990.

The materials used in this study are made available on OSF. Part of this study was about the reasons people give for concealing infectious illnesses. Participants wrote their reasons for doing so in free-text survey responses, which were then rated by two people in terms of the reasons they gave. Here’s an example of a response:

I didn’t really want people to be afraid of me. I had taken a test and I knew I wasn’t positive for Covid, so I was just worried that people would think it was false and that they’d avoid me or get mad at me for attending when I was sick

K <- 409

There were 409 free-text responses like this. All of them are available in this spreadsheet on OSF.

Two raters were asked to rate each response for the following motivations:

  1. (Self vs. other motivation) Did the participant mention motivations for concealment that were more related to the self or more related to others?
  2. (Illness harm motivation) Did the participant mention illness harm to others in their motivation response?
  3. (Stigma, rejection, isolation motivation) Did the participant discuss motivations for concealment that mentioned feeling stigmatized, rejected, excluded, or isolated?
  4. (Missing class/work motivation) Did the participant mention concealing because they did not want to miss work or class?

They were given detailed instructions about how to rate the responses with respect to these motivations, which are available here. For example, by way of general introduction to the task, they were told:

We have collected free response data from students at the University of Michigan and healthcare workers within Michigan Medicine. Your job will be to read through these free responses about why participants said they were motivated to hide signs of infectious illness and indicate where they fall on a number of different variables …

and also provided details about the format of the data and how to indicate their on a spreadsheet. Then for the self vs. other motivation rating task, for example, the instructions were as follows:

Did the participant mention motivations for concealment that were more related to the self or more related to others?
Examples of self motivation include not wanting to miss out on in-person things like work or class, not wanting others to judge them, not wanting others to avoid them.
Examples of other motivation include not wanting to worry other people, not wanting to burden others by missing a work shift.
Coding scheme: please put a “1” if it is a self motivation, a “2” if it is an other motivation, and a “0” if it is neither/unclear.
Note: It is possible for a response to mention both self and other reasons, so it is ok for there to be both a 1 and a 2 for the same response.

The instructions for the other rating questions were similar in their level of detail.

Rating free-text responses using Llama

Using Ollama and R’s ellmer package, the basic code that is needed to do the rating task is quite simple. For example, for the self vs. other motivation rating, and using the example free response mentioned above, the following code is sufficient:

instructions <- '
We have collected free response data from students at the University of Michigan
and healthcare workers within Michigan Medicine.
Your job will be to read through these free responses about why participants said they
were motivated to hide signs of infectious illness and indicate where they fall on a 
number of different variables.

Does the participant mention motivations for concealment that were more related to
the self or more related to others?
Examples of self motivation include not wanting to miss out on in-person things
like work or class, not wanting others to judge them, not wanting others to avoid them.
Examples of other motivation include not wanting to worry other people, not wanting
to burden others by missing a work shift.

Coding scheme: please put a "1" if it is a self motivation, a "2" if it is an other
motivation, and a "0" if it is neither/unclear.
NOTE: It is possible for a response to mention both self and other reasons, so it is ok
for there to be both a 1 and a 2 for the same response, i.e. "1,2".

Your final answer should be either "0" or "1" or "2" or "1,2".
'

free_text <- "
I didn't really want people to be afraid of me. I had taken a test and 
I knew I wasn't positive for Covid, so I was just worried that people would think
it was false and that they'd avoid me or get mad at me for attending when I was sick
"

client <- chat_ollama(model = "llama3.3", system_prompt = instructions)
client$chat(free_text)
1 This response mentions motivations related to the self, such as not wanting others to judge
them ("get mad at me"), avoid them, or think negatively of them ("think it was false"). These
are all concerns about how others might perceive or react to the participant, which falls under
self-motivation. There is no mention of motivations related to others, such as not wanting to burden
or worry them.

There are a few important points to note here:

  • The output contains more information that we strictly asked for. We just asked for a numeric rating value but also got an explanation for the rating. Of course, that explanation is useful but it generally better if we can control the output and its format, so we can explicitly ask for an explanation to accompany the rating, and we can then ask for how all the output should be formatted.
  • The output of the LLM is random and so each time the client$chat(free_text) command is run, a possibly different response is obtained.
  • The output above is an unformatted string. If we want to collect and store large numbers of LLM responses, especially if the responses can contain a lot of information, it is very helpful if we instruct the LLM to format the output in some way, for example as JSON or YAML.

Given these points, we can modify our instructions as follows:

instructions <- 'We have collected free response data from students at the University of Michigan
and healthcare workers within Michigan Medicine.
Your job will be to read through these free responses about why participants said they were motivated 
to hide signs of infectious illness and indicate where they fall on a number of different variables.

Does the participant mention motivations for concealment that were more related to the self or more 
related to others? Examples of self motivation include not wanting to miss out on in-person things
like work or class, not wanting others to judge them, not wanting others to avoid them.
Examples of other motivation include not wanting to worry other people, not wanting to burden
others by missing a work shift

Coding scheme: please put a "1" if it is a self motivation, a "2" if it is an other motivation,
and a "0" if it is neither/unclear.
Note: It is possible for a response to mention both self and other reasons, so it is ok for there
to be both a 1 and a 2 for the same response, i.e. "1,2".

Before answering, explain your reasoning step by step, using example phrases or words.
Then provide the final answer.

Your final answer should be either "0" or "1" or "2" or "1,2".

Format your response as a JSON object literal with keys "reasoning" and "answer".
The value of "reasoning" is your step by step reasoning, using phrases or words.
The value of "answer" is the numerical score ("0", "1", "2", or "1,2") that you assign
to the free response text.

And example of a JSON object literal response is this:

  {"reasoning": "The text shows many elements of a self motivation.",
  "answer": "1"}

Please make sure that there is an opening (left) and closing (right) curly brace in what you return.
If not, this is not a proper JSON object literal.
'
client <- chat_ollama(model = "llama3.3", system_prompt = instructions)
results <- client$chat(free_text)

Now, we get JSON formatted output and we can use a JSON parser to return this as an R list:

jsonlite::fromJSON(results)
$reasoning
[1] "The participant mentions not wanting people to be afraid of them and not wanting others to avoid them or get mad at them, which suggests a motivation related to how others might perceive or react to them. This is indicative of a self motivation, as the participant is concerned about their own social interactions and potential judgments from others. Specifically, the phrases 'people would think it was false', 'avoid me', and 'get mad at me' imply that the participant is worried about being judged or ostracized, which is a self-related concern."

$answer
[1] "1"

Also, for each free-text response, we can re-run the above commands multiple times to get multiple ratings. This way, we can find the most common numeric rating for each text and use this as the final rating. The code for this analysis can be found in this website’s GitHub repo.

LLM Evaluation

As mentioned, in the data set we are using, there were 409 free-text responses. Each one was rated by two separate (human) raters according to four separate criteria. We provided the same rating instructions used by the human rater, albeit with the modifications mentioned above to obtain formatted output, to the Llama LLM and so it too provided ratings for each of the 409 free-text responses for each of the four criteria. For each text and criteria, we re-ran the LLM analysis 10 times and then used its most common response as its ultimate rating of each text and criterion.

To evaluate the LLM, for each of the four criteria, we can calculate how often its rating was identical to each of the two raters and then calculate its average agreement. For example, if its ratings for one criterion were identical to the first rater 70% of the time, and identical to the second rater 80% of the time, its average agreement with the raters for that criterion is simply 75%. For comparison, for each criterion, we can calculate how often the two human raters agreed. These results are shown in the following table:

Criterion LLM-rater average agreement Inter-rater agreement
self-other 75.7% 88.3%
harm 64.1% 89%
stigma 85.5% 90.5%
miss 91.3% 89.7%

Overall, averaging over the four criteria, the average agreement between the LLM and the human raters is 79.1%, compared to an overall agreement between the two humans of 89.4%. Clearly, however, there is some variability in performance across the four criterion, with a quite poor performance of 64.1% for the LLM on the “harm” criterion.

Discussion

In that rating task just described, the average agreement between the LLM and human raters was approximately 80%. This compares to an agreement of approximately 90% between the two human raters. What can we conclude from this evaluation and this result? First, I don’t think we can conclude that much, certainly not about the general usefulness of LLMs for qualitative analysis. This is just one task after all and we would need a greater variety of tasks to properly evaluate the use of LLMs generally. Second, the performance is not great. It is not bad, but not great either. A great result would be one where the average agreement with LLMs and human raters is 90%, and so there would be as much agreement between LLMs and human raters as between human raters themselves. However, if 90% is the gold standard, 80% is presumably a respectable and far from disastrous result. More importantly, it is a result that was obtained by just using off-the-shelf LLMs with no customization or fine-tuning. There was even no so-called “prompt engineering”, whereby we try optimize the prompts we use. As described above, the prompt was just the instructions that the human raters used, with minor modifications to obtain a formatted output. A final point of encouragement is that LLMs are evolving very rapidly. A performance of 80% using the readily available models in December 2024 may possibly be exceeded very quickly as new models become available.

This concludes Part II of this note. In Part III, we will consider some additional practicalities when using LLMs, especially local LLMS, for qualitative text analysis.

Back to top

Reuse

CC BY-SA 4.0

Citation

BibTeX citation:
@online{andrews2025,
  author = {Andrews, Mark},
  title = {Qualitative Text Analysis with Local {LLMs:} {Part} {II}},
  date = {2025-01-28},
  url = {https://www.mjandrews.org/notes/text_analysis_with_llms/part2.html},
  langid = {en}
}
For attribution, please cite this work as:
Andrews, Mark. 2025. “Qualitative Text Analysis with Local LLMs: Part II.” January 28, 2025. https://www.mjandrews.org/notes/text_analysis_with_llms/part2.html.

© Mark Andrews 2021-2025. CC BY-SA 4.0.