An informational study of the evolution of codes and of emerging concepts in populations of agents

We consider the problem of the evolution of a code within a structured population of agents. The agents try to maximise their information about their environment by acquiring information from the outputs of other agents in the population. A naive use of information-theoretic methods would assume that every agent knows how to"interpret"the information offered by other agents. However, this assumes that one"knows"which other agents one observes, and thus which code they use. In our model, however, we wish to preclude that: it is not clear which other agents an agent is observing, and the resulting usable information is therefore influenced by the universality of the code used and by which agents an agent is"listening"to. We further investigate whether an agent who does not directly perceive the environment can distinguish states by observing other agents' outputs. For this purpose, we consider a population of different types of agents"talking"about different concepts, and try to extract new ones by considering their outputs only.


Introduction
If we consider organisms capable of processing information, then we can argue that they must be able to internally assign meaning to the symbols they perceive in a code-based manner [10].
For instance, bacteria perceives chemical molecules in their environment and interprets them in order to better estimate environmental conditions and (stochastically) decide their phenotype [24,1,23,27]. Plants detect airborne signals released by other plants, being able to interpret them as attacks of pathogens or herbivores [13,29]. Therefore, a correspondence between environmental conditions and chemical molecules must be established. It is in this way that Barbieri characterises codes, and he proposes three fundamental characteristics for them: they connect two independent worlds; they add meaning to information; and they are community rules [2].
Codes connect two independent worlds by establishing a correspondence or mapping between them. These worlds are independent and thus there are no material constraints for establishing arbitrary mappings. The meaning of information comes exclusively from the mapping: symbols by themselves are meaningless. Finally, the third property requires that the correspondence between the two worlds constitutes an integrated system.
For instance, human languages establish a correspondence between words and objects [2]; in bacteria it is between chemical molecules and environmental and social conditions [35,36].
Words (or chemical molecules) by themselves do not have any meaning, and each individual of a population can define, arbitrarily to some extent, their own set with its mapping. However, populations of individuals sharing the same code are ubiquitous in nature. How is it that codes come to be shared by many individuals when their constitution involve arbitrary choices for each individual? This question is what we are investigating in the present paper.
For this work, we assume a simple scenario where organisms live in a fluctuating environment.
If they can perfectly predict the future environmental conditions, they can prepare themselves by adopting a proper phenotype, and, therefore, survive. However, when uncertainty about the environment remains, organisms will follow a bet-hedging strategy [31,28], where they try to maximise their long-term growth rate by adopting the phenotype that matches the environment in proportions based on the information they have about it. For example, seeds of annual plants germinate stochastically in different periods in relation to the probability of rainfalls, and their chances of survival are maximised when they match this probability [6].
The relation between information and long-term growth rate can be expressed elegantly in information theoretic terms, where an increase in the environmental information of an organism is translated into an increase in its long-term growth rate [30,17,18,8,26]. Such models achieve the maximisation of the long-term growth rate by maximising an organism's information about the environment. If we assume this behaviour in organisms, then those obtaining additional environmental information (other than that from their sensors, which we assume it does not completely eliminate environmental uncertainty) from other individuals will have an advantage over those that do not, since they would be able to better predict the future conditions. However, for individuals to be able to communicate with each other, they must be able to translate symbols into environmental conditions, where the output of these symbols results from an individual's code. We consider the code of an individual as a stochastic mapping from its sensors states to a set of outputs.
For this study, we consider outputs (or messages) of individuals (or agents) as conventional signs. In semiotics, the science of all processes in which signs are originated, stored, communicated, and being effective [10], two types of signs are traditionally recognised: conventional signs and natural signs [7]. In conventional signs there is no physical constraint on the possible mappings, they are established by conventions. Although in physical systems there can be limitations to the possible mappings that can be implemented, in this work we assume complete freedom of choice.
On the other hand, in natural signs, there is always a physical link between the signifier and signified, such as smoke as a sign of fire, odours as signs of food, etc. [3].
In this work, we are not interested in the particular detailed mechanisms by which an agent implements its code, nor how the agent decodes the outputs of other agents. Instead, we focus on the theoretical limits on the amount of environmental information an agent can possibly acquire resulting from different scenarios of population structure and codes distribution. The natural framework to analyse such quantities is information theory [30]. However, it does not take semantic aspects into account, it only deals with frequencies of symbols instead of what they symbolise.
Codes, on the other hand, add meaning to information, which makes the integration of sciences such as semiotics with information theory non-trivial [9,4]. In the following section, we present an information-theoretic model which incorporates the necessity of conventions by dropping from the model the usual implicit assumption of knowing the identity of the communicating units.

Model
To introduce the model in a progressive manner, let us first consider three agents, θ 1 , θ 2 and θ 3 . Each of these agents depend on the same environmental conditions for survival, which are modelled by a random variable µ. Agents acquire information about the environment through their sensors, which are modelled by random variables Y θ1 , Y θ2 and Y θ3 , all three conditioned on µ, for agents θ 1 , θ 2 and θ 3 , respectively. We assume each agent acquires the same amount and aspects of environmental information from µ, i.e. p(Y θ1 |µ) = p(Y θ2 |µ) = p(Y θ3 |µ). Let us further assume that the information each agent acquires about the environment does not eliminate its uncertainty, i.e. H(µ|Y θi ) > 0 for 1 ≤ i ≤ 3. The code of an agent is a stochastic mapping from its sensor states into a set of outputs, and is represented by the conditional probabilities p(X θ1 |Y θ1 ), p(X θ2 |Y θ2 ) and p(X θ3 |Y θ3 ) for agents θ 1 , θ 2 and θ 3 , respectively (see Fig. 1 Let us assume that agent θ 1 perceives only the outputs of agents θ 2 and θ 3 . One possible way of computing the information about the environment agent θ 1 has is to consider the mutual information between µ and the joint distribution of the sensor of θ 1 and the outputs of θ 2 and θ 3 : I(µ; Y θ1 , X θ2 , X θ3 ). However, by writing down this quantity, we are implicitly assuming that agent θ 1 "knows" which output corresponds to θ 2 and which output corresponds to θ 3 . Therefore, in this consideration, an agent can theoretically do the translations of the outputs according to some internal model of other agents and infer the mentioned amount of information about its environment.

Indistinguishable sources of messages
For this study, on the contrary, we consider an agent observing other agents' messages, but under the assumption that the originator of a message cannot be identified. In this way, the total amount of information an agent can infer from the outputs of other agents will depend on to which extent it either can identify who the other agents are or can rely on them using a coding scheme that does not depend too much on their particular identity. For instance, if agents θ 2 and θ 3 both agree on the output for each of the environmental conditions, then agent θ 1 should be able to infer more environmental information than if they disagree on the output for each of the environmental conditions, given that agent θ 1 does not know which of the agents it is observing.
To model this idea, let us assume a random variable Θ denoting the selected agent. This agent depends on the same environmental conditions for survival as θ 1 , which are modelled, as above, by a random variable µ. Agents acquire information about the environment through their sensors, which are modelled by a random variable Y Θ conditioned on the index variable denoting the agent under consideration, Θ , and µ. The amount of acquired sensory information of a specific agent θ about µ is given by I(µ; Y θ ). As above, the code of an agent is a stochastic mapping from its sensor states into a set of messages, and is represented by the conditional probability p(X θ |Y θ ) for an agent θ (see Fig. 2 However, now we want to model the fact that we do not know which agent is observed. In the case with maximum uncertainty, Θ is uniformly distributed, and then this parametrisation of the codes considers the outputs of all agents in Θ altogether, such that if we are not observing Θ , we cannot identify whose agent's output we are observing. In Eq. 3 and Eq. 4 we show two examples of codes for agents θ 2 and θ 3 , while their sensor states are define by the Eq. 2 (Eq. 1 defines the sensors states of agent θ 1 ). We compute how much information about the environment there is when the outputs of both agents (θ 2 and θ 3 ) are considered together by agent θ 1 .
If we assume p(θ 2 ) = p(θ 3 ) = 1/2, and p(µ 1 ) = p(µ 2 ) = 1/2 and = 0.01, then if we consider the codes shown in Eq. 3, we have that I(µ; Y θ1 , X Θ ) = 0.97872 bits, where Θ consists of agents θ 2 and θ 3 . However, had θ 2 and θ 3 "opposite" codes as shown in Eq. 4, then I(µ; Y θ1 , X Θ ) = 0.9192 bits, which is exactly I(µ; Y θ1 ), that is, I(µ; X Θ |Y θ1 ) = 0 bits (agent θ 1 cannot acquire any side information from the outputs of agents θ 2 and θ 3 ). We should note here that the sensor states y 1 and y 2 of agents θ 2 and θ 3 in the conditional probability shown in Eq. 1 and 2 refer almost deterministically to the same environmental condition, and therefore the loss of side information is thus entirely due to the incompatible codes. The conditional probabilities of sensor states given the environmental conditions further defined throughout the paper are also assumed to be almost deterministically.

Environmental information of a population
The model shown in Fig. 2 considers the environmental information of agent θ 1 , ignoring its own output X θ1 . Nevertheless, agents ignoring their outputs is contrary to our assumption over the incapability of agents to identify the sources of the outputs. On the other hand, we are assuming a specific type of communication, one which could be classified as persistent within the different classifications of stigmergy ( [37,33,22], see [14] for a summary). To incorporate this option in the model shown in Fig. 2, we could consider the state space of Θ as the set {θ 1 , θ 2 , θ 3 }. Then, to express not only the environmental information of agent θ 1 , but the average environmental information of the whole population, we can parametrise the agent by a random variable Θ (defined over the same state space, representing the same set of agents as Θ ), such In this way, the average environmental information of a population of the agents selected by Θ is given by I(µ; Y Θ , X Θ ) (see Fig. 3). This measure can be consider as the objective function to maximise in our model. However, we would be making two important assumptions: first, this objective function assumes agents have access to the environmental conditions µ, which they indirectly do but only through their sensors; and second, every agent would perceive the output of every other agent, including itself. In this work, instead, we propose that agents follow a behaviour such that it maximises the similarity of their outputs (via their codes) with those of which the agent perceives. A consequence of this behaviour is that the average information about µ is also maximised. In addition, we will introduce a potentially flexible "population structure", so that we can specify which agents interact with which.

Code similarity
First, we introduce a copy of the codes of the agents, such that when we instantiate the variables X Θ and X Θ , the probabilities are the same. The structure of the population is then given by p(Θ, Θ ) = p(Θ)p(Θ ). However, the conditional independence of Θ and Θ restricts significantly the diversity of the structures that can be represented. In such cases, the agents selected by Θ perceive the outputs of all the agents selected by Θ and vice versa. In order to model a general interaction structure between agents, we consider p(Θ, Θ ) not independent, as shown in the Bayesian network in Fig. 4, where we introduce a helper variable Ξ. This allows different agents selected by Θ to perceive outputs from exclusive agents selected by Θ . Figure 4: Bayesian network representing the relantionship of the variables in the model of code evolution.
Y Θ is an i.i.d copy of YΘ and X Θ is an i.i.d. copy of XΘ. Θ covers the same set of agents as Θ, but its probability distribution is not necessary the same.
We define the objective function as I(X Θ ; X Θ ), that is the average code similarity of a population of agents according to the population structure p(Θ, Θ ). For instance, if the interaction probability of two agents is zero, then the similarity of the codes of these two agents is irrelevant for the objective function. On the other hand, they interact with probability bigger than zero (p(θ, θ ) > 0, for some agents θ and θ ), then how similar their codes are will influence I(X Θ ; X Θ ).
If we consider our system as a process in time, then at each time-step two agents are chosen according to p(Θ, Θ ). Agent Θ reads the output of agent Θ (generated via its code, which is i.i.d over time), and let us assume that it stores the pair (Y Θ , X Θ ), i.e. its current sensor state together with the perceived output. If this is repeated a large number of times, then the total amount of environmental information that can be inferred from the collected statistics by the population is bounded by I(µ; Y Θ , X Θ ). This is the theoretical limit to which we refer in the introduction, and for this study we are not interested in how the inference is computed. However, we implicitly assume that agents decode the perceived outputs according to their codes.

Distance between two codes
In order to visualise the evolution of codes, we define the distance between the codes of two agents θ i and θ j as the square root of the Jensen-Shannon divergence [40,19] between them. This measure has the property that 0 ≤ JSD(θ i , θ j ) ≤ 1 when log 2 is used, and the square root yields a metric. Let us note that this distance requires the sensor states Y to be named identically (for the corresponding states of µ) among agents in order to be meaningful. As we stated above, this is (closely) the case in all our experiments. This requirement over the sensor states discards the possibility of using other measures such as mutual information.
To illustrate the behaviour of our model, we consider four different scenarios, which are described in Sec. 4. The common parameters for the first two experiments are the following: the population consists of 25 agents; the amount and quality of the acquired sensory information is the same for For the third scenario, the only difference is that we consider only 15 agents, since the dimensions to consider with a flexible structure grows quadratically with the number of agents.
The optimisation algorithm used in the following experiments is CMA-ES (Covariance Matrix Adaptation Evolution Strategy), which is a stochastic derivative-free method for non-linear optimisation problems [12]. We utilised the implementation provided by the Shark library v3.0.0 [15] with its default parameters, which implements the CMA-ES algorithm described in [11]. The evolutionary algorithm used for optimisation does not intend to represent the actual evolution of the codes. Instead, we are interested in the solutions of this optimisation process, which are representative of the possible outcomes of evolution.
To visualise the evolution of the codes of the agents, we use the method of multidimensional scaling provided by R version 2.14.1 (2011-12-22). This method takes as input the distance matrix between codes, and plots them in a two-dimensional space preserving the distances as well as possible. To visualise, not only the distances between the resulting codes, but also how they relate to the distances between initial codes, we provide a distance matrix of both initial and resulting codes. The initial codes are randomly set by the evolutionary algorithm.

Results
In this section, we analyse the outcome of the four different scenarios where code similarity is maximised. While the outcomes are particular for one simulation, they are illustrative of the richness that the model is able to capture, which is described for each scenario. The outcomes are typical solutions, and we cannot perform statistics over simulations since the many solutions are qualitatively different. However, the outcome of each scenario is presented together with a description of alternative outcomes, giving indicators of achievement of local/global optimum.

Well-mixed population
In the first scenario, each agent θ i perceives the output of every other possible agent θ j with the same probability, that is p(θ i , θ j ) = 1/25 2 for every i, j ∈ [1,25]. The maximum average code similarity is bounded by I(Y Θ ; Y Θ ) = 1.71908 bits, which is achieved under two conditions: first, every code must be a one-to-one mapping; second, the code must be universal. This is indeed the outcome of the performed optimisation, as we show in Fig. 5: the optimised codes (blue points) converged into a universal code (the distance between any of them is zero). Each red (diamond) point correspond to an initial code. dimension 2 q q q q q q q q q q q q q q q q q q q q q q q q q q initial codes final codes The resulting code adopted by the population is a one-to-one mapping between sensor states and outputs, and any of the 24 possible one-to-one mappings is a global maximum (there are 4 sensor states and 4 possible outputs). However, it is still interesting to briefly analyse the possible paths towards a universal and optimal code. In Fig. 6, we show the distribution of the adopted codes by the agents of the population in an iteration of the optimisation process where the average code similarity is I(X Θ ; X Θ ) = 1.18276 bits. Here, the most popular code is the suboptimal code shown in Fig. 6 (a). This results from the particular initialised codes, driving the agents temporarily towards a suboptimal code. However, once any of the many-to-one codes becomes (nearly) universally adopted, then any code's deviation improving the code similarity will eventually drive the convention towards optimality. The fact that it does not need simultaneous changes in the code increases the likeliness of improving the code similarity.

Spatially-structured population
In another set-up, we assume the agents are structured in a 5 × 5 grid, where p(θ, θ ) = 1/105 if θ and θ are neighbours or when θ = θ (see Fig. 8 for a representation of the structure).
After randomly initialising the codes, the performed optimisation plateaued on an average code similarity of I(X Θ ; X Θ ) = 1.13536 bits. As in the former scenario, here the optimal solution is also a universal code with a one-to-one mapping. However, in this case, the result is not a universal code, as can be appreciated in Fig. 7. Spatially structured populations are sensitive to the initial codes and how codes are updated.   The resulting code distribution among the population is shown in Fig. 9, with 8 different codes in the population. Where well-mixed populations evolved the use of common codes, agreement on codes only occurred among neighbours in spatially structured populations. As a consequence, many local conventions are established within neighbourhoods, and, once this situation is reached, the improvement of the total code similarity requires simultaneous changes to the agent's codes.
For instance, the code shown in Fig. 9 (e) could increase the average similarity of the population if p(x 2 |y 1 ) = 1, as it is in the rest of the codes. However, for this to happen (in this particular case), at least two agents need to change their code simultaneously (otherwise the average similarity decreases), which makes the deviation from the resulting code distribution unlikely.

Flexible population structure
For the third scenario, we let the structure co-evolve with the codes without any constraint (the probability distribution of the interaction between agents, p(Ξ), is optimised together with the codes). In this case, the resulting average code similarity is nearly optimal, but the code is not necessarily universal. This is because, when the structure is not fixed, agents form roughly disconnected clusters of related codes. In this process, the interaction probability of agents with unrelated codes will vanish. However, once the clusters are formed, if it is not a single isolated agent (such that no other agent perceives its output), then codes of agents are universal within each cluster. This is exemplified by the code distribution and population structure we obtained (see Fig. 10). Here, we have two clusters with universal codes, one optimal (in red) and the other suboptimal (in yellow). Agents with dissimilar codes from every other agent they interact with will become isolated in the optimisation process, as the example shows for two agents (light and dark blue).
To summarise, the optimal code similarity equals I(Y Θ ; Y Θ ), and is achieved, for instance, when all agents adopt the same one-to-one mapping. Nevertheless, the interaction probability allows agents to form disconnected clusters of related codes, where several one-to-one mappings could result while still achieving optimality. Theoretically, we could have as many one-to-one mappings as the minimum between the amount of agents and the total one-to-one mapping combinations (24 in this case).

Emerging concepts in a well-mixed heterogeneous population
So far, we have only considered populations of agents that acquired the same aspects of information from µ (i.e., p(Y θi |µ) = p(Y θj |µ) for any pair of agents θ i , θ j ).
The assumption was that the information that was relevant for the survival of the agents was the same among the agents of the population, and this was represented by µ. Now, we consider a more general scenario, where different types of agents acquire different aspects from the environmental conditions µ. We investigate whether it is possible for an agent that does not directly perceive the environment at all (we call this type of agent "blind") to predict conditions based solely on the outputs of other agents. We consider a well-mixed population, such that different types of agents are forced to talk to each other. Considerations with a flexible population structure are not interesting for our purposes, since in these cases, each type of agent forms a cluster disconnected from clusters of other types. This was confirmed by simulations which are not shown here.
Let us illustrate the idea with a relatively simple scenario: we consider five types of agents (we denote the i-th type φ i ), where each type can only distinguish whether the current state of the environment belongs to its coloured region or not. The environment consists of 9 states, and the probability of each state is uniformly distributed. We illustrate this environment by a 3×3 grid, as shown in Fig. 11, although the square does not denote the physical structure of the environment.
Then, the outputs of each type of agent will be related to the regions they capture. For instance, for agents of type φ 2 with the same deterministic code, if P r(µ ∈ {1, 2, 4, 5}|X θ = x) equals one (for all θ of type φ 2 ), then x will signify that this agent is currently in the region coloured in red in Fig. 11. We say that a population of agents has a joint concept of the environment if by considering its representation of the environmental information they capture, we can obtain information about the environment, i.e. we require that I(µ; X Θ ) > 0. For instance, the symbol x in the example above, assuming that it is only utilised by agents of the same type, can be understood as representing the concept "top-left" of the grid. The amount of environmental information that an agent θ of type φ 1 (a blind agent) captures is I(µ; Y θ ) = 0 bits, while all agents θ of the other types capture I(µ; Y θ ) = 0.991076 bits (note that the total entropy in µ to be resolved is H(µ) = 3.16993 bits). Throughout this study, we considered that agents predict the environment by considering their perceptions together with the outputs of other agents. The blind agent, instead, since it is not able to capture any direct cue from µ, we consider capable of perceiving the outputs of both of the agents selected by Θ and Θ . With this relaxed consideration, we say a blind agent has a concept of the environment if I(µ; X Θ , X Θ ) > 0, i.e. we consider the maximum amount of information an agent can possibly infer from the joint outputs X Θ and X Θ .
Let us recall that the structure of the population is well-mixed, and thus the distribution of outputs of all agents is considered, including the blind ones, which are not able to express (via their outputs) any particular concept by themselves (for a blind agent θ, I(µ; X θ ) ≤ I(µ; Y θ ) = 0, i.e. I(µ; X Θ ) vanishes). Therefore, whether a blind agent has some concept of the environment will depend, first, on the universality of the codes of each type of agent (agents representing the same information with different symbols may create ambiguities). Second, on the cardinality of the alphabet of X (i.e. |X|) utilised by the population. A small alphabet will force agents to represent different concepts of the environment with the same symbols, while a large alphabet is likely to result in exclusive representations of concepts for each type of agent.
Taking this into account, we ask, is it possible for a blind agent to identify concepts of the environment? If so, how are these concepts related to the concepts of the individual agents (other than the blind ones)? Is the size of the available alphabet related to the quality of the concepts?
To study these questions, we performed different experiments varying the size of the alphabet |X|, where the rest of the parameters remained the same. In these experiments, we optimised the similarity of codes for a population composed of 20 agents, with 4 agents of each of the five types.
In Table 1 we show that the cardinality of the alphabet of X affects the limit of the amount of information a blind agent can possibly infer about the environment. Now, if we measure the uncertainty of the environment for a blind agent for each combination of outputs X Θ and X Θ , we find that for some of them, it is zero. For instance, with |X| = 7, we found that when P r(µ = 5|X Θ = 1, X Θ = 2) = 1.0 (see Fig. 12, where only combinations with X Θ ≤ X Θ are shown). These distributions are also valid when swapping the values of X Θ and X Θ , since in the well-mixed population the structure is symmetric. Looking at the example of the conditional probability in Fig. 12, we can find many other concepts, although none of them -apart from the one already discussed-can uniquely identify a state of the environment. For instance, we have that P r(µ|X Θ = 3, X Θ = 6) = 0.33 when µ ∈ {3, 5, 7}, which is a concept for being on a particular diagonal of the environment.
In Fig. 13   symbols to represent different environmental conditions. By using a small size of the alphabet for X, we force ambiguities in the population, but these will be chosen (by evolution) such that they are minimal. In this way, we maximise the amount of information we can infer from the outputs (although this can be a local optimum). For instance, the outputs of the blind agents (type φ 1 ) for all the experiments never overlapped that of other types (unless we use |X| = 2, where there is no choice). In other words, blind agents always choose one symbol so that they minimise the amount of utilised symbols from the whole population. In all the performed experiments, we found that for values of |X| ≥ 6, the blind agent can perfectly predict the environmental state µ = 5 for at least one combination of outputs X Θ and X Θ . Interestingly, this new concept, which in this particular experiment can be called the "centre" of the world or environment, cannot be obtained by looking to individual concepts only.

Discussion
We considered four different scenarios of code evolution: in the first one, all agents perceived the outputs of all other agents, including itself. We argued that two main stages of evolution can be recognised: in the first stage, a universal code is established, which can be optimal or not. If it is not optimal, then a second stage will achieve optimality. The same result was obtained in [34], in a model of the evolution of the genetic code (represented as a probabilistic mapping between codons and amino acids), although universality and optimality were simultaneously achieved.
In the mentioned work, which developed further the ideas of [38,39], the authors argue that the universality of the genetic code is a consequence of early communal evolution, mediated by horizontal gene transfer (HGT) between primitive cells. In this evolutionary process, they argue, larger communities will have access (through the exchange of genetic material) to more innovations, leading to faster evolution than smaller ones. Then, "it is not better genetic codes that give an advantage but more common ones" [34]. Although their model does not explicitly show this property, it is captured in our model. We show that a more common, but not optimal code is widely adopted within a population (see Fig. 6). However, in our model, a code imposes itself as universal not because it provides access to more innovations (in our model there is no "code exchange", only the outputs are shared), but because the population structure forces the adoption of the most popular code. After this stage, further changes in the code of the agents eventually lead to optimality.
In another related work, [21] explored the origins of language in a scenario consisting of artificial agents with a coupled perception and production of speech sounds. Although this work is focused on plausible mechanisms for the origin of language, it assumes the same similarity principle as we do (hearing a vocalisation increases the probability of producing similar vocalisations), arriving to the same outcome (a universal language, or code). Other works have considered similar principles in the evolution of languages: for instance, the naming game [32] and the imitation game [5].
However, these models assume some common conventions in order to evolve new ones. In this study, our main assumption was that the population of agents depended on common environmental conditions.
Our second scenario, where the structure of the population is a grid, showed how establishing local conventions in early stages of evolution constrains the outcome of the code distribution, since to reconcile different conventions, several simultaneous changes are needed. On the other hand, in our third scenario, where we let the structure of the population change simultaneously with the codes themselves, such situations are avoided by "disconnecting" clusters with dissimilar conventions. This property enhances evolution, and can potentially lead to the adoption of several different conventions within an increasingly fragmenting, or "speciating" population.
Our last scenario assumed perceptual constraints on the environmental information of each agent, an we looked at emerging concepts within a well-mixed population. This scenario was studied in [20], where, as well as in our study, new conceptualisations of the world emerged as a result of considering together the concepts of every agent. In both studies, the new concept was not representable individually by any agent. Differently from the mentioned study, the new concepts obtained in our study were the result of a simple similarity maximisation principle, while in the work of [20], concepts were obtained through the modelling of an explicit fitness function.
The evolution of conventional codes could be interpreted, in the widest sense, as a form of cultural evolution. For instance, considering the definition of culture given by [25]: "Culture is information capable of affecting individuals' behavior that they acquire from other members of their species through teaching, imitation, and other forms of social transmission.", it could be argued that a form of cultural information is present in organisms, such as bacteria or plants. Although there is a dependence among the different dimensions on which information is transmitted in organisms (if we assume the dimensions to be, for instance, genetic, epigenetic, behavioural and symbol-based, as proposed by [16]), our model assumes freedom of choice in one dimension, without direct influence on the others.
Finally, communication between individuals of a population opens up the possibility of "signal cheaters", which could be either individuals that do not produce signals themselves but still perceive those of the others, or individuals who exploit other individual's learned responses to symbols to their advantage. However, our model does not allow such behaviour, since the code producing the outputs functions, implicitly, as the interpreter of the perceived signals.

Conclusion
In the proposed model, we introduced a key assumption which allowed us to evolve, for some structures, universal and optimal codes. This assumption states that an agent cannot distinguish the sources of the outputs it perceives from other agents. Following from this, a universal code will necessary introduce semantics by relating symbols to environmental conditions (via the internal states of the agent). Our model proposes an information-theoretic way of measuring the similarity within a population of codes.
In this work, we proposed, as an evolutionary principle, that agents try to maximise their side information about the environment indirectly by maximising their mutual code similarity. This behaviour produces several interesting outcomes in the code distribution of a structured population.
Depending on the population structure, it captures the evolution of a universal and optimal code (well-mixed population structure), while also the evolution of different codes organised in clusters (in a freely evolving structure), which allows the establishment of optimal as well as suboptimal conventions.
Finally, we considered a well-mixed heterogeneous population with perceptual constraints on the agents about the environment, and showed how, just by looking at the outputs of agents, it is possible to extract concepts that relate to the environment, concepts that none of the agents of the population could individually represent.