An Examination of the Hidden Judging Criteria in the Generative Design in Minecraft Competition

Game content has long been created using procedural generation. However, many of these systems are currently designed in an ad-hoc manner, and there is a lack of knowledge around the design criteria that lead to generators producing the most successful results. In this study, we conduct a qualitative examination of the comments left by judges for the 2018–2020 Generative Design in Minecraft competition. Using the abductive thematic analysis, we identify the core design criteria that contribute to a generator that creates “good” content—here defined as interesting or engaging. By performing this study, we have identified that the core design criteria that create an interesting settlement are the usability of the settlement environment, the thematic coherence within the settlement, and an anchoring in real-world simulacra.


An Examination of the Hidden Judging Criteria in the Generative Design in Minecraft Competition
Jean-Baptiste Hervé , Christoph Salge , Member, IEEE, and Henrik Warpefelt Abstract-Game content has long been created using procedural generation.However, many of these systems are currently designed in an ad-hoc manner, and there is a lack of knowledge around the design criteria that lead to generators producing the most successful results.In this study, we conduct a qualitative examination of the comments left by judges for the 2018-2020 Generative Design in Minecraft competition.Using the abductive thematic analysis, we identify the core design criteria that contribute to a generator that creates "good" content-here defined as interesting or engaging.By performing this study, we have identified that the core design criteria that create an interesting settlement are the usability of the settlement environment, the thematic coherence within the settlement, and an anchoring in real-world simulacra.

I. INTRODUCTION
I N THIS article, we perform a methodical and qualitative examination of the comments made by judges of the 2018, 2019, and 2020 Generative Design in Minecraft Competition (GDMC) [1], [2].The GDMC is about creating a procedural generator that can generate a "good" or "interesting" settlement for any given map.It was designed to foster interest in procedural content generation (PCG), adaptive computational creativity, and cocreativity.From a computational creativity standpoint, the core challenges are the adaptation to unknown content (i.e., the input map) and the vaguely defined quality criteria.Consequently, the competition relies on human expert judges to evaluate the generated settlement quantitatively, through scoring based on given criteria, as well as qualitatively, through written comments.
The instructions given to judges ask them to use their best judgment and expertise to determine the quality of each settlement and to score the settlement in four given categories.Each category is introduced by a short description and a list of explicitly illustrative, not exhaustive criteria.This immediately raises the question: Do those criteria capture the overall quality assessment of the judges, or are there further elements that are currently not covered?In 2021, one of the judges mentioned on the GDMC Discord server 1  Jean-Baptiste Hervé and Christoph Salge are with the Department of Computer Science, University of Hertfordshire, AL10 9AB Hatfield, U.K. (e-mail: jbaptiste.herve@gmail.com;C.Salge@herts.ac.uk).
Henrik Warpefelt is with the Department of Software Engineering and Game Development, Kennesaw State University, Marietta, GA 30060 USA (e-mail: research@warpefelt.se).
Digital Object Identifier 10.1109/TG.2023.3329763 1 [Online].Available: https://discord.gg/MtYJfsUnVNbased on the scoring guide, yet their favorite generator was not the one with the most points.By looking at the written, optional comments over the years, we hope to identify what might be the missing or hidden criteria that cause this discrepancy.
Participants have also complained that some feedback from the judges reads as if they have little experience playing Minecraft.This raises another question, namely, does both experience and actual interaction (rather than just observation) influence the judgment, and is this evident in the given feedback?
In this article, we are examining the written feedback made by the human judges that accompanies their numerical scores of the generated settlements in the competition.Both the scores and comments provided by the judges were made publicly available by the competition organizers. 2  The primary aim of this article is to gain a better understanding of the criteria judges of the competition actually use to ascertain the quality of a Minecraft settlement.In doing so, we will support work toward the development of formalized and automatic metrics for PCG.Although there exists a range of metrics for this purpose, a few of them have been explicitly compared to human qualitative judgment or have shown to be good predictors of human quality assessment, particularly in the domain of Minecraft settlements.Many existing metrics focus more on measuring what the generator can express rather than how this content is received and interpreted by players.Additionally, we currently know of no works that have studied human self-reported quality judgments for any kind of PCG in games.While this work is focused on Minecraft settlements, we still hope that this analysis can shed some light on how humans evaluate PCG in general.

II. BACKGROUND
In this section, we will start by introducing the game Minecraft and the GDMC competition.We will then present theory critical to understanding PCG, as well as how players interpret and evaluate PCG artifacts.

A. Minecraft and the GMDC Competition
Minecraft [3] is a voxel-based game developed by Mojang Studio, where the players progress in an open world made out of blocks.These blocks represent different materials, such as wood, rock, or coal.Players can destroy blocks, place them in any position within the world, or even combine them through crafting mechanics in order to create new types of blocks or items.Minecraft is mostly known for its open-ended nature and has been compared to LEGO on a computer.Even though the game offers a main objective, which includes visiting alternative  dimensions and fighting a dragon, it is mostly used as a sandbox game.Many players use the block mechanic to terraform the game world, create structures such as houses, castles, or cities, and play the game according to self-imposed goals and challenges.Since the art style and setting of Minecraft are very generic, the game affords free creation of almost any kind of artifact, with only the player's imagination setting the limits.
The GDMC is a yearly competition in which teams submit a settlement generator [4] for Minecraft, which is a computer program that can add or remove blocks from a given Minecraft map without human intervention.All the submitted generators are then run on three maps with a fixed size of 256 × 256 blocks, which are selected by the organizers [2].An example of these settlements can be seen in Fig. 1.All the generated settlements are then sent to the jury.The jury includes experts in various fields, such as artificial intelligence (AI), game design, or urbanism.Each judge scores the settlements between 0 and 10 points, in each of the following categories: Adaptability, Functionality, Narrative, and Aesthetic.Adaptability is how well the settlement is suited for its location-how well it adapts to the terrain, both on a large and small scale.Functionality is about what affordances the settlement provides, both to the Minecraft player and the simulated villagers.It covers various aspects, such as food, production, navigability, security, etc. Narrative reflects how well the settlement itself tells an evocative story about its own history, and who its inhabitants are (there is a separate bonus challenge about also adding a written PCG text that tells the story of the settlement [5]).Finally, Aesthetic is a rating of the overall look of the settlements.In the competition, the rating of each category is computed for each generator by averaging (mean) across all judge's scores.The rating works in the following way: A grade of 5 means that the result looks human-made, a grade of 6-9 corresponds to what we would expect from an expert human, and finally, a 10 would be attributed to a "superhuman performance."After they have examined each generator's performance on each of the three maps included in the competitions, the judges provide a score for each of the four categories.
In addition to these ratings, judges provide qualitative written feedback for each generated settlement.The comments left by judges are wide ranging and describe the feelings evoked by the settlement, the perceived quality of the generated artifacts, and the ways in which the generated content does or does not fulfill their expectations.The comments also serve as a connection between the real-world understandings of PCG artifacts.However, in order to provide context for our evaluation of these comments, we must first describe some theory of PCG.

B. Procedural Content Generation
PCG is the creation of content through algorithmic means.In computer science, it usually refers to software, i.e., a generator, which produces content.A single output from a generator is also commonly referred to as an artifact.PCG techniques are used in particular in the video games industry, where generators are used to produce gameplay elements (levels, items, etc.), aesthetic elements (trees, buildings, characters, etc.), or even narrative elements (quests, lore, dialogs, etc.).It has been used in a wide range of games, with different genres and ambitions.It can be used in many different ways, from a production tool to a core element of the game itself.The procedural generation of game content is a complex process, and as such there exists a need to describe some of the issues inherent in the nature of PCG.

C. Possibility Space
The possibility space [6] represents the range of any artifact we can think of for a given type of artifact we intend to generate.If we take the GDMC as an example, the possibility space of an entry would be any combination of 256 × 256 × 256 Minecraft blocks, matching the dimensions of the space within which the competition occurs.At the same time, a generator might produce artifacts in only a portion of our possibility space, which we will refer to as our generative space [6].The generative space is a space contained within our possibility space, and part of the work of the creation of a generator is to actually design its generative space.For example, a GDMC competitor may want to prune the possibility space by removing anything that is not a settlement.This process may also introduce a "style" to their generator by further constraining it to only contain medieval-looking villages or modern cities.It is also important to note that such space is limited by technical elements.Translating one's vision into software is not a simple task, and therefore, the generative space can be impacted by the inability of the generator's designer to create certain aspects of the artifact or even bugs.
A common concern in PCG is the repetitiveness of the artifacts, which leads to less interest in and engagement with the generated content.Cardona-Rivera [7] defined this as the kaleidoscope effect, which occurred as the player begins to visualize the generative space of a generator and its boundaries.Once the generative space is fully identified, players might guess the nature of the next artifacts, thereby lowering the amount of unexpected content they experience.By extension, this also lowers the perceived novelty of the generated content.
Although humans might be able to learn and predict the space of a generator, it is technically complex to represent this space.Consequently, a projection of this generative space to a humanselected dimension is used for both visualization and analysiswhich is referred to an expressive range analysis (ERA) [8].The analysis of the expressive range of a generator can be useful in order to understand its behavior and how the artifacts are spread among the dimensions.Therefore, the usefulness of an expressive range depends mostly on the relevance of the dimensions by which it is defined.Usually, the dimensions used are automatically computed metrics applied to the whole artifact.They do not necessarily need to capture something associated with quality, but there is often an underlying assumption that higher values in certain dimensions are preferred.More importantly, the metrics should capture meaningful differences, so artifacts lying in different areas of the expressive range should appear different to the players.Based on this idea of similarity and difference among generated artifacts, the concepts of perceptual differentiation and perceptual uniqueness were introduced by Compton [9].Perceptual differentiation is the feeling that an artifact is different in some way from the previous one while perceptual uniqueness occurs when a single artifact is distinguishable and has its own character.Those two definitions rely on the individual perception and feeling of a player and therefore hard to capture through computational means.Although a generator can create a large amount of technically different and unique artifacts, the difference between generated items might not be relevant for or even perceptible by a human.The PCG research community aims to develop tools to automatically compute these distinctions.The most established one is still the ERA [8], which is used in numerous scientific publications as an illustration of a generator's capabilities.Tools such as Danesh [10] even aim to make this approach accessible beyond the scientific community.ERA also has some limitations, and there have been different approaches to extend it [11].Although ERA is not the only tool for visualization and analysis, many other approaches also rely on projections to a user-chosen dimension.Perceptual differentiation remains difficult to evaluate in a consistent and generic way and the automated evaluation of perceptual uniqueness is still an open challenge.
Furthermore, a truly qualitative automatic evaluation of PCG is still challenging [12].Several experiments have already been conducted with the intent to critically examine some of the commonly used metrics and their relevance [13], [14].While part of the toolset is pertinent for specific evaluation scenarios, it is clear that we are missing player-driven evaluation methods that could generalize.We believe that new metrics, built with the goal of capturing human perception, could be helpful for tackling the issues listed previously.

D. Player Interpretation
The following sections present a theoretical framework that aims to describe the nuances of player-driven evaluation of PCG artifacts.It should be noted that many of these theories refer to the "user."In this article, we replace the word "user" with "player," which for our purposes is functionally identical but thematically more correct.Although the term experiencer may technically be more apt, the use of "player" more closely matches our intended interpretation.
At its core, the player-driven evaluation of PCG artifacts is centered around the player's understanding of what they are being shown by the game.As described by Warpefelt [15], the player observes a collection of details, or what Warpefelt calls indicators presented by the game, interprets them, and forms expectations on the game.However, the player interpretation and forming of expectation is a complex process, which is influenced by how the player is situated (when and where they are playing), their previous experiences with this or other games, and what expectations have been set before the player begins their play session (for example, by advertising or reviews).These factors together work as a lens through which the player forms an interpretation of the gaming experience.From this, interpretation arises the player's experience.
Within the context of this article, we are focusing on two of the main parts of the underlying theories that describe the player's situation: How the environmental storytelling [16] of the game sets expectations and provides indicators [15], and how the player's previous experience can be described in terms of the human-computer interaction of character [17].
1) Character: The concept of character was first introduced by Janlert and Stolterman [17] in 1997.Through the evaluation of character, we are able to discern the nature of objects that we observe-and to understand how these objects are different from one another.More formally, character encapsulates how we can understand what objects are and discern the difference between subclasses of objects, say a sports car versus an SUV.Character is composed of a number of characteristics that help the viewer of an artifact understand its nature.Through the evaluation of character, we are able to understand how we can use an object (which affordances it provides) and what we can expect from an object.
2) Mechanisms of Player Interpretation: Warpefelt [15] has incorporated the concept of characteristics into the indicator theory, where each indicator not only feeds into the usability aspects (signifying of affordances) but also the narrative aspects (indices for storytelling [16], [18]) and the setting of expectations (through characteristics).Together, these factors allow us to deconstruct how the player interprets the game using a bottom-up approach focused on the examination of how the interpretation of details feeds into the player's holistic understanding of the game.This approach is especially useful for PCG artifacts since their creation is by nature detail-oriented.
3) Player Interpretation and the Game Environment: Players also interpret the environment in order to recreate the context and the narrative environment in which they evolve.The ruins of a castle and gigantic skyscrapers both lead to different conclusions regarding the time period, the region, or even the series of events that occurred in that place.This storytelling strategy is commonly used in level design and has been defined as indexical storytelling [16].Behind this term is the concept of stories told through traces or indices as defined by Peirce [18], which the player connects in order to recreate the context of a place or the past events.However, the player has to be able to interpret the indices and their connections [19], referencing their repertoire of characters [17].Furthermore, the player themselves can contribute to the environmental storytelling, as their own actions can leave traces.Thus, the game's story is not only the one intended by the designers, but also one created and influenced by the player.In essence, this is the mechanic by which the player iteratively refines and evolves the materials used in the construction of their alterbiography [20].
Nitsche [19] exposes how architectural features and their characteristics (in the parlance of Janlert and Stolterman [17]) have an evocative effect that impacts human interactions.These evocations are, however, defined through several parameters, such as the distance from the features or the environment for instance.But more importantly, they are defined by the past experience and the culture of the player.Thus, one of the key points of examination for the player's interpretation of a game would be to examine the composition of the game world.

III. METHODS AND MATERIALS
In this study, we used an abductive thematic analysis methodology (combining both inductive and deductive coding) to analyze free-text comments provided by judges for the 2018, 2019, and 2020 editions of GDMC.We initially performed inductive coding on the judge comments for 2018 and then used it as a base for the deductive coding of the comments in the 2019 and 2020 instances of the GDMC.For each year, the researchers reached a consensus around the codes, and this consensus-decided collection of codes was then carried on to the next year.Once we had coded all three years, we elicited themes from the compounded list of codes.All authors played an equal role in coding and theme elicitation.
This study also provides two unique perspectives of evaluation, in contrast to looking at qualitative evaluations of fully human-designed game content in general.First, the judges all knew that the content was algorithmically generated, and it is interesting to see how and if this influences their judgments.
Second, the diversity of different generators might help to illuminate quality criteria that only become apparent by contrasting the approaches of different generators, both within a given year, and over the course of the three years.Together, these two methodological and data attributes should help provide a greater understanding of the problem described by this article.

A. Respondents and Data
In the following theme descriptions, the judging group is referred to as "the respondents."In total across the 3 editions of the GDMC, we had 17 unique respondents evaluating 21 different settlements over three years.Split across the different years, 2018 had 9 judges evaluating 4 settlements, 2019 had 11 judges evaluating 6 settlements, and 2020 had 9 judges evaluating 11 settlements.
It should be noted that we are working with the published, textual feedback of the competition, which did not include any demographic data, nor was any collected by the organizers.Furthermore, the small size of the respondent group makes betweengroup analysis based on demographic data largely meaningless.Based on published comments, it seems that experience with Minecraft itself was varied among the group, including both experienced players and people who had not played Minecraft before.

IV. THEMES
The data analysis resulted in a total of 15 themes, composed of 8 main-level themes, and 7 subthemes.All themes, complete with overview descriptions, can be seen in Table I.The detailed specifications can be found in the following sections.

A. Navigation
This theme describes the ease of navigation within the settlement.Overall, the respondents quickly identified visual indications for navigation, for example, signposts.Visual landmarks, such as large buildings, also played a key role in how visitors oriented themselves within the settlement and game world, echoing theories introduced by Nitsche [19] and Fernández-Vara [16].
As can be expected, roads also played a key role in how the respondents reasoned about navigation within the settlement.However, the design of the roads within the settlement needs to be of sufficient quality if they are to contribute to the navigational ability of players.Roads need to match the terrain, connect to other roads, and generally be perceived as sensible.Overall, respondents transfer a large degree of real-world expectations onto roads found in the settlement.This indicates some degree of transfer between the game and the real world in terms of repertoire of character [17].
Finally, the difficulty of traversing the landscape affected to what degree the respondents considered the landscape to be navigable.Broken terrain, such as cliffs, or traversal aids, such as bridges, both contributed to the perceived level of difficulty of navigation.This is indicative of the fact that indices [15], [19] inform the evaluation of the game world in terms of affordances-which is in accordance with how theories are described by McGrenere and Ho [21] and Warpefelt [15], [22].
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

Quality of Life
When evaluating a settlement, the respondents included notions of livability in their evaluation.This included the navigability of the settlement (see the Navigation theme in Section IV-A) but also to what extent the settlement provided food or services to the inhabitants.Furthermore, it considers the quality of the buildings and the level of safety afforded by protective measures, such as walls and lighting against monsters and natural hazards (see the Gameplay Elements theme in Section IV-H for a further discussion on this).As with the Navigation (see Section IV-A) the findings here echo the theory of repertoire of character described by Janlert and Stolterman [17].

C. Environmental Narrative
The theme Environmental Narrative encompasses a large number of codes, all describing various aspects of how narrative is conveyed through the game environment in the Minecraft world.In the data, respondents discussed how environmental narrative arises from many sources, including the architecture and layout of cities, the various functions provided by buildings (for example, farms), how distinct building types and landmarks all work together to provide a narrative to the player, without any traditional storytelling, allowing them to suspend disbelief.This is essentially a manifestation of indexical storytelling [16], where the game world provides indices for the player to latch onto to build their alterbiography [20], i.e., the story of how the specific player experienced the play session.This process has previously been described by Warpefelt [15] and involves how narrative storytelling helps convey the affordances [21] of the game world to the player.
The respondents also reacted particularly strongly to incomplete environmental narratives-where the game world seemingly is setting up the fundamentals for some kind of narrative but by and large fails to follow through.Essentially, there is an interesting hook but no resolution.This can be particularly problematic when the generator uses a single hook to make the space seem alive, as described by respondent 2020:1 in terms of a 2020 submission that made extensive use of a monorail running through the settlement: The whole approach of using the monorail of course fell down on the isolated island where there wasnt (sic) opportunity to build the actual settlement.-2020:1 When the core element of the environmental narrative takes up so much space that the settlement itself cannot be instantiated in the game world, this is obviously problematic in terms of creating a believable settlement, and not just a monorail stop.As such, that particular generator's overreliance on a single feature led it to not be feasible for constrained spaces.It should be noted that this theme operates on the gestalt of the artifact interpreted holistically, rather than individual details, in contrast to how Warpefelt [15] describes this phenomenon.However, this phenomenon is inextricably linked to the details created by the generator.As seen in the monorail example before, the holistic interpretation of the artifact is still critically dependent on the details.
1) Environmental Narrative Dissonance: The subtheme Environmental Narrative Dissonance covers the parts of the experience where the generated artifact portrays an environmental narrative that is incongruent with what is expected by the respondent.This causes a mismatch of expectation and delivered content, thus breaking immersion and disrupting the suspension of disbelief.Like its parent theme, this operates on the gestalt of the artifact but is inextricably linked to the generated details of the artifact.The problems associated with these are often strongly correlated with the kaleidoscope effect described by Cardona-Rivera [7] and the concept of expressive range described by Smith and Whitehead [8], i.e., that the player is starting to see the limits of what the generator can express, and that the content is starting to be perceived as repetitive and predictable.

2) Cultural/Trope Evocation: The subtheme Cultural/Trope
Evocation is related to the instances when the generator uses cultural or trope shorthand to create connections to existing bundles of expectations in the player.Examples of this are generators using torii gates to invoke a "Japanese" feeling, or the use of natural materials and medieval architecture to invoke a fantasy feeling.
3) Individual Exterior Quality: The subtheme Individual Exterior Quality describes the occurrences where the judges drew conclusions based on the exterior of the buildings.This phenomenon is related to several different types of details found on the exterior of a building, including the general aesthetic of the building, the evocation of certain tropes (as described by the subtheme Cultural/Trope Evocation in Section IV-C2), the materials from which the building is constructed, and the accessibility of the building.
Furthermore, the exterior of buildings needs to fulfill Compton's principle of perceptual differentiation [9].Buildings need to be both different and alike-if they are too similar, they end up blending together, and if they are too different the cohesive look of the settlement is lost.Thus, there exists a Goldilocks range where the perceptual differentiation of buildings is just right.We have not been able to elicit the exact parameters of this zone in our study.
It should also be noted that individual elements in the settlement's makeup can have a strong impact on the overall perception of the settlement."Signature buildings," which have a unique look and are strongly evocative, tend to leave better impressions and more long-lasting memories.This is coherent with Compton's principle of perceptual uniqueness and is exemplified in many settlements from 2019, which prominently featured signature buildings such as windmills and a monorail.
However, the immersive effect of buildings is also fragile.Single discordant details can have a strong negative impact on the interpretation of, and favorable disposition toward, a settlement.If buildings are placed in ways that are seemingly unrealistic (for example, half hanging off a cliff with a giant foundation) this can have a deleterious effect on the acceptance of the settlement.To a large extent, this evaluation seems to be done based on the "realism" of the building-i.e., if it is possible to build such a structure in the real world.This is indicative of the judges transferring and applying their repertoire of character [17] as a set of expectations on the settlement.
Overall, the external evaluation of the settlement seems to be largely holistic.Settlements that were favorably rated often provided a fond of buildings that were varied enough to be perceptually different, but without being so varied that the settlement seems haphazardly constructed.Furthermore, they had a few perceptually unique signature buildings that provided anchoring points within the settlements.The character of buildings was also evaluated within the context of the other buildings present in the settlement.

4) Individual Interior Quality:
The Individual Interior Quality subtheme describes the occurrences where judges interpreted a building's interior.It should be noted that very few buildings had developed interiors in the first years of the competition, and those comments mostly dealt with a lack of furniture inside the building.Once these were starting to become prevalent in later years, judges started commenting on the quality of the furnishing.
A core factor in the evaluation was also incoherence in building interiors.Rooms that were improperly scaled, or floors that were unreachable, were rated particularly poorly by judges.This reinforces the idea that judges bring with them their repertoire of character into the game, as mentioned in the Individual Exterior Quality subtheme (see Section IV-C3).
One item of particular interest is that the interiors of buildings seem to have been evaluated more or less separately from the exteriors of buildings.Indoor environments were judged one by one and were seemingly not impacted by the overall, holistic, evaluation of the settlement.This seems to suggest that there is a second layer of analysis done for the interior.

D. Settlement Composition
This theme covers how the settlement is fitted to the terrain and adapted to local features, laid out, and sized.Overall, what we are evaluating here are the functional components of the evoked character [17] of the settlement.
Terrain fitting is evaluated based on how well the settlement follows the features of the terrain, how it handles terrain features like rivers or small islands, and how the terrain has been altered in order to make the settlement fit in the desired space.The key success factor is finding a balance between a real-looking settlement and one where the world has not been entirely bulldozed to make space for the buildings.Success in terms of settlement composition is also related to how well the materials used to build the settlement match the materials found in the nearby areas and biomes, and how well these materials are integrated into the building designs.An example of good matching would be log houses in a heavily wooded colder area, or Adobe-style housing in the desert.Although these are not necessarily the only type of building that fits with such terrain, they do evoke a certain related character that we as players would expect to see in such a climate.By extension, this also connects to the Cultural/Trope Evocation theme (see Section IV-C2) This is also echoed the evaluations by the respondents, here from Respondent 2019:1: [B]iome-variants of varied buildings are clustered around habitable areas, with a particularly thorough road system and farms with varied crops (sic) -Respondent 2019:1 As we can see in this quite, there is also a notion that the layout and positioning of the settlement are evaluated based on "soft" ideas of reasonableness-i.e., a notion of what looks like a "real" place.Ideally, buildings should be placed in a way that makes sense to the player as they bring their real-world-based understanding into the evaluation process, essentially fulfilling the evoked character [17]  Respondent 2019:2 picks out a core feature in one of the judged maps for 2019, which contained central markers in the form of watchtowers.However, they also pick up on the winding paths and farms, both features that are classically associated with pastoral landscapes, thus providing a strong sense of place.This highly interconnected presentation of tropes provides the respondent with a strong sense that this is a "real" place.
However, settlement composition can also have a negative effect on the player experience.Cardinal sins in this area include overlapping buildings or placing them in a way that is physically impossible in the real world (such as hanging off a ledge with no Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.support structure).Sizing is also a concern among respondents, and it is critical that the buildings be sized in a way that is perceived as plausible-for example, a straw hut several hundred meters on the side is seen as less believable.Finally, the pathing of the settlement needs to be appropriate for the presented thematics-again connecting to the Cultural/Trope Evocation theme.
1) Interconnectedness: The Interconnectedness theme operates more on the mental model of the settlement than the settlement itself.It exists adjacent to the navigational affordances of the pathing in the generated settlement, but instead represents the player's mental model of how the pathing fits together, and how it evokes and/or reinforces the character of the settlement.In addition, the theme covers the understanding of how one settlement may be composed of multiple smaller components, for example, a city split into districts or a collection of villages.
A negative impact of poor interconnectedness can be exemplified by this quote from Respondent 2019:3: The houses are spread out throughout the world, with no easy access paths between them, and some houses are too close together and hard to access, making it not very functional -Respondent 2019:3 Here, the respondent expresses that the lack of paths, and by extension connections, between elements of the settlement make it difficult to navigate the space.Conversely, a level of interconnectedness between parts of the settlement can be beneficial, as seen in the following quotes: Nice architecture and design choices -I also think the pathing was quite good.I like how understated some aspects of it were -the houses were small, nestled in between trees, with nice dirt paths.Really felt like a small forest village -Respondent 2020:2 The paths around the housing is -for me -one of the strongest aspects of this, given it helps me understand the larger structure of the settlement.Plus the lamps that help signpost the path -Respondent 2019: 4 As we can see in the quotes from these three respondents, the pathing of settlements is a core part of how we understand how they are connected.Note, however, that the pathing is not the same as interconnectedness.Instead, it acts as a kind of facilitator for the formation of a mental map of the settlement.

E. Real-World Evocation
The real-world evocation theme is related to the ways in which the settlement evokes the real world.This is similar but distinct from the cultural evocation mentioned in the theme Cultural/Trope Evocation and instead focuses on the details of the generator's output rather than the gestalt of the settlement as a whole.
Respondents raised interest not only in the real world alike aesthetic, but also in the feeling it produces.Any clues that give the impression that actual people are or could be living in the settlements are perceived as positive additions.The presence of furniture within the building, sources of food, places of work, and other proof of human activities are examples.Examples also included anything suggesting human and legal organization, such as boundaries around houses or farms, different neighborhoods, etc.These evocations can also be linked to cultural elements, both in the architecture and the settlement layout.But if some elements contribute to the "real-world" feeling, others degrade it, in particular, any element that is either unbelievable or unrealistic.It could be something impossible in our reality, such as a floating building, or a bad measurement of real-world characteristics, such as an overflatten terrain.
We can deduce that real-world evocation in the context of Minecraft settlement assists the player in understanding the sense of the settlement, and how to interact with its surroundings.

F. Effect on the Player
In some cases, generated settlements presented especially interesting features that stand out from the rest of the generated content, for example, strong architectural attractions such as windmills or towers (see Fig. 2).In these cases, the respondents have described these features as being awe-striking and providing a visceral sense of wonder and a call to adventure.As exemplified by a response from Respondent 2019:5: Simply put, these generated features act as a focal point for the player's attention, and provide an impactful gaming experience.In some cases, the features also act as a clarion call to adventure within the generated world.The respondents (who were judging the generated artifacts for a competition) sometimes described generated features as leading them off from their core mission of judging and making them explore the world a bit more than they otherwise would.As explained by Respondent 2019:6: Loved the stone structures and mossy rock -almost like ruins in some places?They were so interesting I actually dug them up to see if there was anything hidden!-Respondent 2019:6 As described by the respondent, the ruinlike stone structures acted as a call to adventure, where they felt a need to explore more.Essentially, these structures act as what Fernández-Vara calls wieners-essentially a feature that draws the attention of the player and acts as an attractant to the area in which it is located.As described by Fernández-Vara [16], this is a concept imported from theme park design.

G. Various Varieties
Several of the judges made comments related to the overall concept of variety.This is noteworthy since variety was not one of the judging criteria.In PCG, variety is often seen as synonymous with the expressive range [8] of the generator, i.e., the range of different artifacts the generator can produce.The comments made by the judges related to the GDMC context can help us decompose this concept further, thus aiding in our understanding of how variety impacts PCG artifacts in this specific case.
In general, variety was mentioned as something positive and desirable.A large portion of the comments related to variety or the lack thereof.Some comments were ambiguous and could refer to the differences between the overall artifacts (the three different settlements made for three different maps), but several comments from judges did speak specifically about the variety between houses or elements of the same settlement.
This illustrates that Minecraft settlements are composite artifacts, larger artifacts composed of several similar subunits.In technical terms, this means that there is an easy separation between hierarchies, and a generator could generate an overall composite structure while another generator then fills in the subunits.Alternatively, one of those two levels might be human made, such as the houses in this case, which are often templates, which are then arranged by a generator.This idea of a composite artifact is not unique to the GDMC challenge.Other creative artifacts, such as books, pieces of music, or maps, can be seen and generated as composites of small units and faces similar challenges.Again, the variety between the subunits is remarked on positively: Very nice looking houses, various sizes and heights, some actual stuff in the interior too!And different colour schemes.-Respondent 2019:7 There is a competing quality criterion though, codified as cohesion.The subunits, usually houses in our case, should be similar in a way that makes them believable and belong to the same overall settlement.This is in contrast to the desire to have them be different.A good generator here seems to be able to strike a balance between those two drives.
1) Generator Style: Addressing this conflict between cohesion and variety raises the question if there are certain elements that should be varied while others should be kept constant.The concept of generator style is discussed extensively by respondents, and there are several codes that form a subtheme of variety around this topic.
This subtheme encapsulates the possible dimensions of the buildings into the following three categories, related to how they value variety in a specific dimension: 1) those that should always be varied between subunits to make for believable variety; 2) those that correspond to a positive quality, and should be set toward a target value (or maximized); 3) those that relate to a stylistic choice, i.e., should be chosen once, and then adhered to for all buildings in the settlement, to create a sense of cohesion.For individual respondents, several codes could be sorted into these categories.However, it was difficult to sort the various codes for stylistic dimensions in a way that is consistent across responses.What some might see as a quality criterion is seen as a stylistic choice by others.As such, we have introduced this theme to encapsulate the multifaceted nature of generator style and how it impacts a complex generated artifact like a Minecraft settlement.
2) Situated Adaptivity: Another subtheme related to variety is adaptivity, something that was specifically promoted as a criterion by the judging guide.It is about the ability of the generator to react appropriately to different input maps and should be evident by the fit of the final settlement artifact into the provided map.This is an opportunity to introduce variety by building on the variety of the provided map prompts, but is more technically challenging, as it requires more than just randomness-directed adaption.
As it is a criterion for the challenge, its presence is generally commented on in a positive way, and the codes related to this subtheme are closely related to the illustrative comments on the judging guide.The two biggest focuses are the adaption to biome materials, and how well buildings are placed in the landscape, i.e., the adaption to the height map-as described in the following two quotes: The individual biomes are nice and it helps diversify the generated output -Respondent 2020:3 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
[B]iome-variants of varied buildings are clustered around habitable areas, with a particularly thorough road system and farms with varied crop (sic) -Respondent 2019:1

H. Gameplay Elements
Minecraft is not just an editor to generate maps and settlements with 3-D blocks, but also a game, and as such, artifacts within Minecraft can and are seen as gameplay elements.We see this reflected in the comments of the judges, with several of them relating to the theme of gameplay elements.
Common and very Minecraft-specific comments talk about the presence of monsters and how the settlements fail to keep them from spawning, or how the settlements do not offer protection from monsters.This effect might have arisen from one of the judging criteria, a subject that we will discuss at the conclusion of this article.
There are also several comments talking about the navigability, or lack thereof, or the settlements.This is less Minecraftspecific but still evaluates the artifact as more than just an observable creative piece, and more as something to be interacted with.
Several respondents also talk about food, but it is unclear here if this is commented on from a perspective of a player trying to obtain food, or as a comment on the environmental narrative.
Another code that is related to this theme is light-which plays a surprising role in many different evaluation elements, as it is both relevant for gameplay, such as monster spawn prevention, but also allows the player to see and contributes to the mood and aesthetic impression of the settlement.
However, we saw a large variance in how often gameplayrelated codes were expressed by different respondents.We suspect that this may arise from a difference in how experienced the judges were with Minecraft, which may have informed expectations.

V. CONCLUSIONS, DISCUSSION, AND FUTURE WORK
The core finding of this study is that the unifying evaluative theme for all judges seems to be the Suspension of Disbelief [23].More precisely, the judgment of an individual human seems in large part determined by how the judges' expectations are fulfilled by the generator.Unfortunately, for any attempt to produce more computational metrics, those expectations are influenced by a large range of hard-to-quantify factors.For a start, there are cultural and biographical factors.Expectations seem to also shift from year to year.Where in the first year biome-based material replacement was seen as innovative, it was criticized as boring in year four.There was also some indication that judges took into account that this was made by an algorithm-reporting that the output was great for PCG, but others harshly judged entries against human standards.Comparison to real-world places was sometimes seen as positive, but also led to the identification of glaring flaws, that other judges yet identified as clever conceits to game logic and gaming conventions.Overall, there was a lot of language indicating that judges had been both positively surprised and negatively disappointed, with some of the same elements being inconsistently labeled as both good and bad.Overall, the comments demonstrated that the players have a lot of different experiences, and hence expectations, we would need to consider, if we wanted to model their judgment.Furthermore, we also saw that they paid attention to, or at least reported on, very different aspects of the artifact.Taken together, the plethora of different factors makes the case for a very multifaceted analysis of generative artifacts.In turn, this makes it even harder to conceive of a computational metric that evaluates a complex PCG artifact, such as a Minecraft settlement, holistically in a similar way to a human observer.
Human judges seem to address multifaceted analysis problems in part by conceiving of Minecraft settlements as composite artifacts and judging both their parts, and their composition, on multiple levels-as evidenced by the distinction between interior and exterior environments, or the theme of interconnectednesswhich mostly seems to operate on conceptual rather than concrete parts of the settlement.The nature of a Minecraft settlement as a composite artifact, thus, had some interesting consequences.Primarily, it allowed for some part of the artifact to set expectations for the rest.Within this scope, there then seems to be some tradeoff between the positive novelty of interesting variation, balanced with providing an expected consistency in style and tone, echoing Compton's theory of perceptual uniqueness and differentiation [9] as well as Janlert and Stolterman's concept of character [17].It is unclear what exactly the sweet spot between novelty and consistency is here, though.Judges' comments indicate that there are some dimensions, such as style, where variation is less desired than in others.Given that many PCG artifacts such as longer text, music, and game levels can be seen as composites, we posit that determining one way toward the evaluation of complex, composite artifacts would be to automatically determine both the good and bad dimensions for expected variety and the right tradeoff between novelty and consistency.Warpefelt's indicator theory [15] is useful as a qualitative descriptor for these concepts but would need further development to be useful as a quantitative measure and, even so, will likely be highly situational.Compositionality also raises the question of how we deal with the order of experiences when we have a human judge evaluate an artifact that cannot be holistically perceived in one moment-yet offers no predefined perception order (unlike a book).One interesting methodological option here would be to actually measure order effects by selectively exposing participants or players to different parts of an artifact and taking existing order effects as a positive sign of the existence of expectation setting.The inherent nonlinear nature of in-game narrative, especially when it comes to environmental narrative, makes the player experience of generated artifacts difficult to analyze using traditional tools for narrative inquiry, and we find that there is a need for a theory allowing for more quantifiable analysis of these nonlinear narratives.Some initial tools exist, for example, Warpefelt's indicator theory [15] or Fernández-Vara's indices [16].However, as mentioned earlier, these tools will need specific adaptation to generative settlements in Minecraft.
Finally, we were intrigued by the judges reporting on the emotional effect they were experiencing based on the content presented.The idea that brilliant art is capable of moving us, both emotionally and to action, is not new.However, we now have empirical evidence that even the slightly flawed art produced by generators can cause these effects, which suggests that the threshold for when this effect is induced may be lower, and the induction of the effect may not be reserved simply for brilliant art.
Furthermore, the direct report by the players that a specific component of the generated settlement caused them to reconsider their goals for a given interaction shows the importance of actually interacting and experiencing the artifact, rather than just observing it.It also shows that singular stand-out features, again realizing Compton's theory, can have a large impact on the player's experience.This might also suggest a new way of evaluating PCG artifacts.We could, for example, measure how much a player will deviate from a chosen path or moderate their behavior when exposed to it.We posit that it could be a sign of quality for PCG artifacts if they could affect both a player's experience and their chosen actions only by using strong indicators or indices within the game environment, or simply by virtue of subverting or fulfilling the player's expectations.
In summary, analyzing the judge's comments from the GDMC competition has provided interesting insights into how players interpret generative artifacts.Although there are several theories that help us deconstruct and understand these experiences, there still exists a strong need for a more in-depth understanding of how we evaluate generative game artifacts-especially complex composite artifacts like settlements.

Fig. 1 .
Fig. 1.Panoramic view of a Minecraft settlement generated for the competition.
of the settlement.As explained by Respondent 2019:2: Great design work here -the watchtowers are fantastic, as is their placement [...].Lovely winding paths and farm arrangements, great sense of place.-Respondent 2019:2

Fig. 2 .
Fig. 2. Example of strong architectural attractions in a generated town.
that they scored all settlements Manuscript received 13 April 2023; revised 9 July 2023 and 14 August 2023; accepted 16 October 2023.Date of publication 7 November 2023; date of current version 17 September 2024.Recommended by Associate Editor A. Liapis.(Corresponding author: Henrik Warpefelt.)