The Hunt for Information
The hunt for information contained in a series of quantitative data.
Welcome to this winter edition! Are you sure you want to continue reading this blog? Be warned that you are at risk of losing your peace of mind!
This blog aims to go beyond variance, currently a key tool of Data Science and Scientific Research.
I have opted to tell you about my measurement of variability, obtained through a simple metric, as for example, an average of observed data can be, which I call simple deviation, (γ), to be contrasted with standard deviation. To be fair, some may say that a measurement of simple deviation already exists, the mean absolute deviation, but, like standard deviation, it measures dispersion not variability. As an applicative example, I put forward a way of constructing an appropriately structured summary table capable of comprehensively describing the five-dimensional quantitative phenomenon described by the European indicators in a given year.
Kendall and Stuart (1997, vol. 1, p. 42) absolve the variance while admitting that it may seem a little artificial. Here, however, it is unforgivably considered completely artificial because it exaggerates the large differences and reduces the small ones. As if that weren't enough, dispersion is often passed off as variability when the latter can be said to be rather closer to the idea of evolution. With all due respect to Sir Ronald Fisher it is time this point was properly put to the test.
A Gap Theory
With “The growth of the series” it is appropriate to interpret it improving growth. In what way is the series growing? In gaps. To obtain an appropriate metric, this blog intends to contrast the idea of gaps, which lead to variance and its derivatives, with an idea of the gaps achieved (i.e. gaps) compared to the first unit that will follow (or, equivalently, of the hypothetical shots necessary to each unit to reach the one which preceeds it). Therefore it would be sufficient to summarize the gaps with an average, except:
We must make sure that all the quantities are readable as gaps;
The quantities must be dimensionally comparable;
The quantities must have the same direction, i.e. all improving.
The three conditions are fulfilled by variables similar to a ranking, those that I call directly informative (otherwise they can be made informative by using the distance from the preferable value). Let's see what a single quantitative series X, detected on a collective of N units, involves.
The growth of X (not supported by the context) is a purely numerical concept. To go further it is useful to move on to an improving growth that allows us to involve the purpose of the analysis. It all stems from a fundamental research request:
What happens when each quantitative quantity involved varies?
For now let's focus only on the second part of the statement: as each quantity varies. Are we sure we understand what it means? Is it really that obvious? Some clarification is needed. Variability is called into question, usually understood as the ability to assume different values: a concept defined in a generic way, to say the least. This represents the given crucial aspect that variability is included in the information that the quantity should bring to the analysis.
A quantity is to be considered informative when it sheds light on the ongoing analysis.
This happens when it is placed in context, when it is interpreted using the situation. Varying, understood as moving through the series from a minimum to a maximum, has nothing to do with dancing, being scattered around a point, as certain instruments report, including variance. Now, rather than the variation of... it is more appropriate to ask about the increase in... But this clarification is not fully satisfactory. In fact, growth is still a purely numerical concept. What can be done?
The situation being analyzed is brought into play, given that each quantity should be informative with respect to the context.
By doing so, we will be able to talk about
Quantities with improving growth
Quantities with worsening growth
Quantities with undecidable growth
a distinction which is completely non-existent in current scientific literature. The quantities that are readable as gaps in fact show improving or worsening growth: here we call them directly informative. Those with undecidable growth can be said to be indirectly informative, so much so that, for the latter it is believed that it makes no sense to measure their variability and therefore they should be made directly informative or separated from the analysis.
A quantity is defined here as having improving growth when it acknowledges a non-preferable observation in correspondence with the observed minimum; it will be said to have worsening growth when this value corresponds to the maximum. Those with undecidable growth have a preferable value within the range. Note that while growth remains a numerical concept, improving growth acquires meaning and value. In other words, given that all quantities go from a minimum to a maximum, numerical growth is not enough to be able to talk of information.
By applying the transformed Preferable Non-Preferable (PnP), i.e. the relative distance from the non-preferable value (previously proposed in a reduced version), each quantity can be read as having an improving growth and therefore treated as directly informative. By doing this all the quantities can be read as the beginning of an improving series. To measure the improving growth the preliminary steps necessary for each series are the following:
Verification as to whether the situation is capable of attributing the connotation of ranking to the quantity;
The relative distances from the non-preferable value are calculated (PnP transformed).
For now, things are as follows. Is there a measure of variability for a data series? It exists under certain conditions:
When the series is similar to a ranking;
When the ranking makes sense from the situation under scrutiny.
In fact, when this occurs, the series generates a series of gaps, that is, the distances of each datum from the first of its pursuers. It should be noted that while it can be demonstrated that the series of gaps is no less informative than the series of provenance, there does not appear to be anything analogous between the series of quadratic deviations and the initial series. This is the first keystone that allows the closing of the arc or, if you prefer, of the reasoning process.
This explains how an average of these gaps is a measure of variability of the series. So, at least one solution exists. But in the multi-dimensional case it works a little differently. In fact, multiple quantities will have to be resized, preferably with PnP, so that they all have a favorable swell/surge and insist on the same scale, from 0 to 100. In this way, however, the averages of the gaps are equivalent given that they add up to one hundred. However, the series of gaps generates in turn a new series of gaps which can be called, for the sake of simplicity, irregularities. The average of the irregularities provides the measure (γ) we are looking for.
We will now demonstrate how to summarize the k series (example: table A) available both vertically and horizontally. A special table, a new one irrespective of the metric mentioned, allows both a vertical and a horizontal synthesis. New because it requires:
A transform, which plots the data in the 0-100 range, to be called preferable-non-preferable (PnP), which involves the situation being analyzed;
Quantities oriented in the improving direction;
Dimensionally comparable quantities.
Note that the minimax transform, which makes a change in scale in order to place the data in the 0-1 or 0-100 range, is only numerical because it does not involve the context and fails to take into consideration the quantities’ orientation. The second keystone is: Some series, such as gaps, are not immediately readable; which is to say that there are quantities that are directly informative and others that are only indirectly informative. This fact highlights the existence of two types of quantities not previously detected in the literature. If a series is not comparable to a ranking, the series of distances with a preferable value becomes one.
Table A takes into consideration the 27 European countries where the vertical summaries highlight the position of Inflation followed by the Deficit, ... and the irregularity of the gaps in Debt and Employment, ... (while the standard deviation would indicate Inflation and GDP): The desirable and advisable interventions by Europe are obtained regarding the positions of GDP, Debt, ... and the irregularity of the gaps in GDP, Inflation, ... (and not in Debt, Employment, ... as the standard deviation would say).
In the horizontal summaries, the position of the Netherlands, Ireland, Sweden, Denmark... and the irregularity of the gaps for Portugal, Finland, Estonia, France... stand out (and not the standard deviation seen in Spain and Ireland, Austria, Finland...): here Europe’s desirable and advisable interventions regarding the positions of Hungary, Greece, Malta and Belgium... are obtained, as well as the irregularity of the gaps for the Netherlands and Hungary (and not of Latvia, Bulgaria, Malta, Italy, ... as the standard deviation would state).The values in the table mark the relative position reached by each country for each quantity. For example, Italy is at 28% of the range. A desirable intervention order is obtained from the row for each country; for example, for Italy and Belgium it is appropriate to intervene mainly on Debt, Employment, GDP, ... in that order. The columns μ and γ suggest the intervention priorities by the respective countries on the position achieved in terms of quantities and importance of the same, respectively. The promised information, gamma, lies in the gap irregularities (to be specific: new gaps), and is called simple deviation, and is the measure to be contrasted with classical standard deviation (Some supporters of variance could consider the standard deviation of the gaps as an alternative measure of dispersion).
I leave it to you to compare the other summaries and verify these calculations
Tabella B
Finally, a detail of the calculation of the irregularities of the first quantity, tGDP
Alternative proposals or adjustments to the suggested linear information are accepted. The metric proposed here is an irreverent insinuation for Data Science: It is a small breach cracking open in an old wall. It's just a start, I'm looking for substitutes for covariance, correlation, coefficient of determination, to name a few measures.
We will see. Hopefully soon.
Buon Anno, Buon 2024 e seguenti!
Last updated