AI in Research: A Qualitative Man vs Machine Experiment

10 min readFeb 20, 2023

With the latest advancements in publicly-accessible artificial intelligence programmes, it’s led to a lot of discussions surrounding both the public and commercial applications of current-state AI. With some conversations predicting the end of the world and others damning the new attempts at new natural language programming software, it’s safe to say that while advancements have been made, it’s not time to start building the off-grid bunker yet.

Man vs. Machine in Context

In 2018, our Head of Insight, then Insight Manager, Charlotte Duff, ran an experiment dubbed ‘Man vs. Machine’ to determine the current state of artificial intelligence and its application to qualitative market research; Charlotte posed a hypothesis that some artificial intelligence programmes might be intelligent enough to replace humans in the qualitative data collection and analysis process, but this theory needed thoroughly testing to see if it was true.

So, Charlotte created a research experiment to find out the capabilities of AI in 2018, using an online insight community to see if a machined analysis element aids in scaled qualitative insight generation and then presented her findings at GreenBook’s IIEX Europe event in 2018 — winning the Best New Speaker Award. From her experiment came five outcomes:

The machine was the clear winner when it came to the speed at which AI analysed qualitative data — but a combined approach resulted in more quality insights.
While the machine had pre-programmed algorithms to help it sift through the data quicker and easier than the human researcher, the machine failed to provide context behind the charts, which is severely detrimental to insight quality.
The machine was the most objective with no confirmation bias at play, while the human researcher would obviously have subconscious biases at play manifesting most often in expectations before analysing the dataset.
The human researcher’s deep dive sentiment analysis highlighted many occasions where the machine-based analysis was totally incorrect — mostly due to the lack of contextual knowledge on the machine’s part.
Traditionally, insight experts don’t commit to any sentiment in their analysis unless it is clear cut — in this experiment, Charlotte didn’t commit to an overall sentiment, whereas the machine did and added value in this area by going down a different path for deeper understanding.

In the end, when it comes to sentiment analysis, Charlotte found that the machine provided the ‘what’ and ‘how’, and uncovered paths of inquisition that Charlotte didn’t find through her subconscious bias — but it failed to produce quality insights because of its lack of context.

With the qualitative analysis part of the experiment done, we decided to test out ChatGPT’s quantitative capabilities in the second take of this experiment.

What are the applications for current-state open artificial intelligence in market research? We tested ChatGPT using our Open Ideas insights community to see how well the AI can write a survey.

Man vs. Machine — Quantitative Experiment

In this new version of the experiment, we have harnessed ChatGPT’s capabilities and tested them to see if they’re capable of automating survey design and creation. We pitted ChatGPT against Charlotte Duff, to see who could write the best survey that was fit for purpose, created in a timely fashion and took into account the relevant context to get the best insights possible from our pool of respondents on Open Ideas, FlexMR’s dedicated online insights community.

We asked ChatGPT to “write a survey with 15 questions and a range of named question types on the topic of inflation rise within the UK and its impact on consumers”, while Charlotte worked from specific objectives, as our insight team does with clients, to create a survey on the same topic with the same number and a range of questions.

After the surveys from both the AI and Charlotte were sent out to 640 respondents each in a blind study, and completed by 12% of the sample. Our insight team analysed the data generated and created two reports of top-line results, which highlight exactly how well both performed on exploring the topic at hand — this data will be explored in another blog and will showcase the differences in the two surveys (questions and language used).

During this process, we had two expert researchers from our team, Dr Matthew Farmer and Harriet Walton, evaluate the blog written by ChatGPT and I will share their thoughts below.

Experiment Evaluation

In summary, the results of the experiment are:

ChatGPT took the prompt and designed/created a survey in a matter of minutes, while Charlotte took more time to design and write her survey.
The questions ChatGPT generated were on topic, but generic, covered a wide range of subtopics within the overarching topic and seemed random in their placement within the overall survey design — Charlotte’s questions were specific, relevant to the primary objectives and followed a natural narrative that made sense to respondents.
The answer options ChatGPT assigned to each question were random and, in some cases, illogical, while Charlotte chose logical answer options that also aided researchers in their analysis of the results.
The language ChatGPT used was simple and easy to understand, as was the case with Charlotte’s survey.

The request to generate the survey has to be specific, containing as much information as possible but still written in a logical way so that the AI understood exactly what we were after. The request used in this case was “Please can you write a survey with 15 questions and a range of named question types on the topic of inflation rise within the UK and its impact on consumers.”

Charlotte worked with written objectives and asked questions to get a better sense of what we were looking for more specifically within the general topic. This means that Charlotte started off with a better understanding of what we were looking for from the survey and was able to deliver a survey that benefitted all parties involved, stakeholders, researchers and respondents.

Expert Analysis and Commentary of AI Survey

During the survey programming stage, we had two resident insight experts, Harriet Walton and Matthew Farmer, evaluate the AI survey and provide their thoughts on the survey design. Overall, they both agreed that there were significant issues they would ideally fix before sending it out to respondents.

One issue they noted was that there wasn’t a logical flow to the questions ChatGPT generated, it oscillated between questions and topics that would normally be grouped together, and the feel of the survey was very rigid. They also questioned the usefulness of some of the questions, as ChatGPT asked some surface-level quiz-like questions that did very little to gain any understanding of the impact of inflation on the participant — Harriet noted that respondents would be tempted to google the answer, which would draw them away from the survey. And Matthew mentioned that it wasn’t necessary or relevant to the brief, so it would be a waste of usable space in the survey.

Another issue our experts noted was that there were simple errors that insight experts would never make, such as overlapping numerical answer options that would lead to unclear results, inadequate middle points in scale questions, and the lack of “don’t know” or “prefer not to say” options for the more sensitive questions. The range of answer options was another aspect that our experts commented on, as while there were a lot of different answer options, both Matthew and Harriet suggested alternative options that would gain better-resulting data.

One example of this is the demographic questions that ChatGPT included — while it was good to include these questions, the answer options it chose were open text boxes. While respondents could input their answers exactly, this ended up being a pain for our researchers as open text can lead to very specific answers or answers that are too broad to categorise, which leads to time-consuming and sometimes impossible analysis. They would typically provide answers for demographic questions in groupings, such as age ranges and income ranges. There are also guidelines that industry bodies expect insight experts to use in order to promote inclusivity, something which ChatGPT doesn’t have access to and so didn’t use.

Matthew and Harriet both recognised a lack of context throughout this survey, which would explain why it reads as generic rather than a tailored survey such as Charlotte’s. Matthew mentioned in regards to one particular question that “These answers also lack knowledge of the contemporary context (and ability to do desk research), making them feel standardised, rather than relevant to a survey specifically about the current situation in the UK.”

The question asked respondents to look at the following factors and determine whether they thought these factors contributed to the cause of the rise in inflation — the answer options it produced included: economic growth, increase in wages, the cost of raw materials and government spending. Matthew would have suggested that the answer options included current contexts such as: “COVID-19 pandemic”, “Brexit”, “War in Ukraine”, “UK Government Budget Decisions”, and “Rising Fuel Prices”.

When it comes to context, it appears in two forms in typical surveys. Firstly, like above the context should influence the questions asked and answers offered. But equally, it should be present in the form of text before new questions and sections, e.g. before demographics- “We would now like to know a bit more about you”. Harriet pointed out that there wasn’t any context provided anywhere in the ChatGPT survey, and it would have benefitted from it to better engage respondents.

Open AI Self-Analysis

After generating the survey requested, I asked it three questions. Firstly, how did it think it did when writing this survey? In response, ChatGPT replied by stating that it “provided [me] with a set of questions that are written in a clear and concise manner and that covers a range of question types, including multiple choice, likert scale, open-ended, ranking, true or false, demographic, checkbox, rating, drop-down, matrix, and a comment box. These questions are designed to gather information about the topic of your survey. However, the effectiveness of a survey ultimately depends on its implementation and how well the questions are worded and structured to gather meaningful data.”

Secondly, how would it improve the surveys if given the chance? This is a question that required some self-reflection, something the AI isn’t capable of, however, it did provide a few suggestions on how we could improve surveys in general, including:

Make sure that the questions are clear and easy to understand. Avoid using jargon or technical terms that may be confusing to respondents.
Ensure that the questions are unbiased and do not lead the respondent to a particular answer.
Make sure that the survey is not too long. Respondents may lose interest or become fatigued if the survey is too long.
Be mindful of the order of the questions, and make sure that they are structured in a logical manner.
In case you are collecting sensitive data, make sure to provide a clear statement about the confidentiality and use of the data.

It also noted that “these are general recommendations and may not be applicable to every survey. It’s always a good idea to consult with experts in survey research if you have any doubts.” The deferral to expert researchers was unexpected, but if this programme can recognise its flaws as much as the expert researchers did, it might not be too far into the future when it learns to correct itself.

Lastly, I asked ChatGPT if the surveys it created are at the same standard as a survey written by a professional researcher, and it replied:

“As a language model, I can generate a survey based on the information and examples provided to me, but the quality and effectiveness of the survey depend on the context, the research question and the target population. A survey written by a professional market researcher would take into consideration various aspects of survey research design, such as sampling, data collection methods, and data analysis, that can affect the results of the survey. They would also have knowledge and experience with the latest best practices and techniques in survey research, which can help ensure the validity and reliability of the survey results. Additionally, a professional researcher would have a good understanding of the target population and would be able to use that knowledge to create a survey that is tailored to the specific needs of that population.

While the surveys I generated can be a good starting point, it’s important to keep in mind that these surveys would need further refinement, testing and validation by a professional researcher to meet the standards of a professional survey.”

While the survey written by ChatGPT showed promise, there were some significant flaws that stakeholders need to be aware of before using it to write their surveys.

Challenges of Using Open AI

The last response from ChatGPT draws upon a poignant point, that the AI, while built on datasets from many areas of life, doesn’t have access to the live data feeds, contextual information or business objectives in order to factor that into the survey generation process. While good survey design takes into account the fundamentals of question generation and structuring, it is also making sure that the respondents will be engaged from start to finish and that the insights generated from the survey will be relevant and directly actionable.

Another challenge I want to mention is this: one aspect that our experts Matthew and Harriet picked up on was that, even though ChatGPT mentioned that it didn’t use leading questions, there was indeed a leading question in the survey — “Which of the following factors do you believe has contributed to the current rate of inflation in the UK?” For ChatGPT to not recognise this as a leading question leads us to believe that it is not ready yet for solo survey generation. It will need an expert to look over any survey it produces in order to fix the simple mistakes it has made throughout the experiment.

In terms of practicality, those enthusiasts looking to use ChatGPT for survey generation should be aware that this function is only currently available through the website. This website has become tricky to access as many people are using it on a daily basis for different purposes, which means that the server is at times at capacity and this likely won’t change until the hype of open AI software dies down.

The last challenge I will mention is a result of this experiment — while ChatGPT is indeed capable of writing a survey, those surveys are currently mired with little mistakes and issues that stem from a lack of context, and a concept of what the end result should look like but not any knowledge of what the results the survey would actually generate. Any survey written by ChatGPT or a similar AI software will need expert supervision. So, the challenge for future researchers is this: a need to develop an AI that is trained on expert-written survey datasets, and a way to feed it the relevant context with each new project.

This article was originally published on the FlexMR Insights Blog and can be accessed here.