L'informazione raccontata a mia nipote
  • L'informazione raccontata a mia nipote
  • The information told to my niece
  • Presentazione
  • Informazione
  • Osservazione
  • Variabilità
  • Variabilità Situazionata
  • Applicazione
  • Divagazione
  • Confronto
  • I conti di Bruxelles
  • Un aggiustamento preferibile
  • Rete di Dipendenza
  • Posizione e Coerenza
  • Conclusioni
  • Il Paradosso della Scienza dei Dati
  • Il Paradosso della Variabilità
  • A caccia dell'Informazione
  • The Hunt for Information
  • Dispersione
  • Dispersion
  • Una Varianza Robusta
  • A Robust Variance
  • Puntata n.3 del 2024
  • Puntata n.4 del 2024
Powered by GitBook

Copiright © 2024 Gabriele Stoppa | blog.gabriele.pro | Tutti i diritti riservati.

On this page

Was this helpful?

Export as PDF

The Hunt for Information

The hunt for information contained in a series of quantitative data.

PreviousA caccia dell'InformazioneNextDispersione

Last updated 5 months ago

Was this helpful?

Welcome to this winter edition! Are you sure you want to continue reading this blog? Be warned that you are at risk of losing your peace of mind!

This blog aims to go beyond variance, currently a key tool of Data Science and Scientific Research.

I have opted to tell you about my measurement of variability, obtained through a simple metric, as for example, an average of observed data can be, which I call simple deviation, (γ), to be contrasted with standard deviation. To be fair, some may say that a measurement of simple deviation already exists, the mean absolute deviation, but, like standard deviation, it measures dispersion not variability. As an applicative example, I put forward a way of constructing an appropriately structured summary table capable of comprehensively describing the five-dimensional quantitative phenomenon described by the European indicators in a given year.

Kendall and Stuart (1997, vol. 1, p. 42) absolve the variance while admitting that it may seem a little artificial. Here, however, it is unforgivably considered completely artificial because it exaggerates the large differences and reduces the small ones. As if that weren't enough, dispersion is often passed off as variability when the latter can be said to be rather closer to the idea of ​​evolution. With all due respect to it is time this point was properly put to the test.

A Gap Theory

With “The growth of the series” it is appropriate to interpret it improving growth. In what way is the series growing? In gaps. To obtain an appropriate metric, this blog intends to contrast the idea of gaps, which lead to variance and its derivatives, with an idea of ​​the gaps achieved (i.e. gaps) compared to the first unit that will follow (or, equivalently, of the hypothetical shots necessary to each unit to reach the one which preceeds it). Therefore it would be sufficient to summarize the gaps with an average, except:

  1. We must make sure that all the quantities are readable as gaps;

  2. The quantities must be dimensionally comparable;

  3. The quantities must have the same direction, i.e. all improving.

The three conditions are fulfilled by variables similar to a ranking, those that I call directly informative (otherwise they can be made informative by using the distance from the preferable value). Let's see what a single quantitative series X, detected on a collective of N units, involves.

The growth of X (not supported by the context) is a purely numerical concept. To go further it is useful to move on to an improving growth that allows us to involve the purpose of the analysis. It all stems from a fundamental research request:

What happens when each quantitative quantity involved varies?

For now let's focus only on the second part of the statement: as each quantity varies. Are we sure we understand what it means? Is it really that obvious? Some clarification is needed. Variability is called into question, usually understood as the ability to assume different values: a concept defined in a generic way, to say the least. This represents the given crucial aspect that variability is included in the information that the quantity should bring to the analysis.

A quantity is to be considered informative when it sheds light on the ongoing analysis.

This happens when it is placed in context, when it is interpreted using the situation. Varying, understood as moving through the series from a minimum to a maximum, has nothing to do with dancing, being scattered around a point, as certain instruments report, including variance. Now, rather than the variation of... it is more appropriate to ask about the increase in... But this clarification is not fully satisfactory. In fact, growth is still a purely numerical concept. What can be done?

The situation being analyzed is brought into play, given that each quantity should be informative with respect to the context.

By doing so, we will be able to talk about

  • Quantities with improving growth

  • Quantities with worsening growth

  • Quantities with undecidable growth

a distinction which is completely non-existent in current scientific literature. The quantities that are readable as gaps in fact show improving or worsening growth: here we call them directly informative. Those with undecidable growth can be said to be indirectly informative, so much so that, for the latter it is believed that it makes no sense to measure their variability and therefore they should be made directly informative or separated from the analysis.

A quantity is defined here as having improving growth when it acknowledges a non-preferable observation in correspondence with the observed minimum; it will be said to have worsening growth when this value corresponds to the maximum. Those with undecidable growth have a preferable value within the range. Note that while growth remains a numerical concept, improving growth acquires meaning and value. In other words, given that all quantities go from a minimum to a maximum, numerical growth is not enough to be able to talk of information.

By applying the transformed Preferable Non-Preferable (PnP), i.e. the relative distance from the non-preferable value (previously proposed in a reduced version), each quantity can be read as having an improving growth and therefore treated as directly informative. By doing this all the quantities can be read as the beginning of an improving series. To measure the improving growth the preliminary steps necessary for each series are the following:

  1. Verification as to whether the situation is capable of attributing the connotation of ranking to the quantity;

  2. The relative distances from the non-preferable value are calculated (PnP transformed).

The calculation is extended to the whole k series (table A below is an example of this where we will see how to summarize both the columns μc,σc,γc\mu_c , \sigma_c , \gamma_cμc​,σc​,γc​ , and rows, μr′,σr′,γr′\mu'_r , \sigma'_r , \gamma'_rμr′​,σr′​,γr′​). A table like A is recommended in situations in which the quantities present a weak or limited inter-correlation structure and is recommended in any case as an intermediate stage of each analysis.

For now, things are as follows. Is there a measure of variability for a data series? It exists under certain conditions:

  1. When the series is similar to a ranking;

  2. When the ranking makes sense from the situation under scrutiny.

In fact, when this occurs, the series generates a series of gaps, that is, the distances of each datum from the first of its pursuers. It should be noted that while it can be demonstrated that the series of gaps is no less informative than the series of provenance, there does not appear to be anything analogous between the series of quadratic deviations and the initial series. This is the first keystone that allows the closing of the arc or, if you prefer, of the reasoning process.

This explains how an average of these gaps is a measure of variability of the series. So, at least one solution exists. But in the multi-dimensional case it works a little differently. In fact, multiple quantities will have to be resized, preferably with PnP, so that they all have a favorable swell/surge and insist on the same scale, from 0 to 100. In this way, however, the averages of the gaps are equivalent given that they add up to one hundred. However, the series of gaps generates in turn a new series of gaps which can be called, for the sake of simplicity, irregularities. The average of the irregularities provides the measure (γ) we are looking for.

We will now demonstrate how to summarize the k series (example: table A) available both vertically and horizontally. A special table, a new one irrespective of the metric mentioned, allows both a vertical and a horizontal synthesis. New because it requires:

  1. A transform, which plots the data in the 0-100 range, to be called preferable-non-preferable (PnP), which involves the situation being analyzed;

  2. Quantities oriented in the improving direction;

  3. Dimensionally comparable quantities.

Note that the minimax transform, which makes a change in scale in order to place the data in the 0-1 or 0-100 range, is only numerical because it does not involve the context and fails to take into consideration the quantities’ orientation. The second keystone is: Some series, such as gaps, are not immediately readable; which is to say that there are quantities that are directly informative and others that are only indirectly informative. This fact highlights the existence of two types of quantities not previously detected in the literature. If a series is not comparable to a ranking, the series of distances with a preferable value becomes one.


Table A takes into consideration the 27 European countries where the vertical summaries highlight the position of Inflation followed by the Deficit, ... and the irregularity of the gaps in Debt and Employment, ... (while the standard deviation would indicate Inflation and GDP): The desirable and advisable interventions by Europe are obtained regarding the positions of GDP, Debt, ... and the irregularity of the gaps in GDP, Inflation, ... (and not in Debt, Employment, ... as the standard deviation would say).

In the horizontal summaries, the position of the Netherlands, Ireland, Sweden, Denmark... and the irregularity of the gaps for Portugal, Finland, Estonia, France... stand out (and not the standard deviation seen in Spain and Ireland, Austria, Finland...): here Europe’s desirable and advisable interventions regarding the positions of Hungary, Greece, Malta and Belgium... are obtained, as well as the irregularity of the gaps for the Netherlands and Hungary (and not of Latvia, Bulgaria, Malta, Italy, ... as the standard deviation would state).The values ​​in the table mark the relative position reached by each country for each quantity. For example, Italy is at 28% of the range. A desirable intervention order is obtained from the row for each country; for example, for Italy and Belgium it is appropriate to intervene mainly on Debt, Employment, GDP, ... in that order. The columns μ and γ suggest the intervention priorities by the respective countries on the position achieved in terms of quantities and importance of the same, respectively. The promised information, gamma, lies in the gap irregularities (to be specific: new gaps), and is called simple deviation, and is the measure to be contrasted with classical standard deviation (Some supporters of variance could consider the standard deviation of the gaps as an alternative measure of dispersion).

The ranking μr\mu_rμr​ places Italy in XXIIIth place (better than Belgium) while the irregularity of the gaps, γc′\gamma'_cγc′​, places Italy in XXIst place (better than Belgium and Luxembourg). Tabella A:

Country\PnP%

tGDP

t|DEF|

tDEB

tINFL

tOCC

Country

AU

37,79

94,23

44,0

84,04

74,67

61,6

X

19,9

III

10,08

XVIII

AU

BE

35,30

96,15

19,6

88,30

32,89

39,6

XXIV

29,3

XII

13,32

XXV

BE

BU

0

100

85,3

26,6

31,56

47,7

X

41,1

XXVI

9,71

XVII

BU

CI

23,29

34,62

44,0

84,04

72,89

44,3

XX

34,7

XXIII

7,52

VI

CI

DA

36,13

7,69

77,3

89,36

100

70

V

26,7

VIII

7,70

VII

DA

ES

13,37

50,00

100

36,17

65,78

55,4

XII

31,8

XVIII

7,40

III

ES

FI

34,29

0

68,4

90,43

69,78

62

IX

21,6

IV

7,36

II

FI

FR

31,37

50,00

39,6

90,43

69,78

47,1

XVIII

33,3

XXI

7,41

IV

FR

GE

33,81

98,08

38,4

82,98

65,78

63,7

VII

28

IX

8,74

XIV

GE

GR

25,12

34,62

8,7

75,53

30,22

38,1

XXVI

29,8

XIV

7,78

VIII

GR

IR

49,32

98,08

78,7

76,60

64,44

73,1

II

18

II

8,52

XIII

IR

IT

28,00

73,08

0

86,17

18,22

41,1

XXIII

36,9

XXIV

10,46

XXI

IT

LE

9,00

100

94,0

0

60,89

52,1

XIV

45,9

XXVII

10,38

XX

LET

LI

9,70

78,85

86,5

45,74

45,78

56,2

XI

34,2

XXII

9,35

XVI

LIT

LU

100

40,38

96,5

78,72

42,77

71,2

III

28,1

X

12,09

XXIII

LU

MA

17,65

67,31

41,3

100

0

38,4

XXV

37,8

XXV

12,02

XXII

MA

PaBa

40,89

96,15

57,8

90.45

95,11

73,3

I

28,3

XI

16,80

XXVII

PaBa

POL

6,99

63,46

58,6

79,79

10,67

49,8

XXI

24

XIX

12,95

XXIV

PO

POR

16,95

51,92

39,9

81,91

58,67

43,7

XVI

32,9

VI

7,16

I

POR

ReUn

35,74

48,08

59,3

82,98

75,11

63,1

XV

29,5

XIII

7,45

V

ReCe

ReCe

18,74

82,69

74,6

75,53

51,11

52

VIII

30,7

XVII

8,36

XI

ReUn

RO

2,10

51,92

90,6

55,32

18,67

42,9

XXII

33,1

XX

7,88

IX

RO

SLOVA

12,96

65,38

74,1

87,23

27,11

46,1

XIX

30,7

XVI

9,10

XV

SLOVA

SLOVE

22,67

92,31

80,1

67,02

58,67

63,9

VI

26,3

VII

10,34

XIX

SLOVE

SP

29,75

59,62

67,3

77,66

48,89

53,9

XIII

17,8

I

7,99

X

SP

SV

37,09

32,69

63,1

89,36

87,11

70,3

IV

-22

V0

8,50

XII

SV

UN

11,05

5,77

37,7

23,40

12,00

29,7

XXVII

30

XV

16,55

XXVI

UN

26,6

62,0

39,8

72,1

51,4

V°

II°

IV°

I°

III°

19,5

30,5

26,7

19,3

20,8

II°

V°

IV°

I°

III°

1,88

0,93

0,67

0,87

0,44

V°

III°

I°

IV°

II°

39,2

19,4

14,0

18,2

9,2

Note in Table B how different the weights are that are to be attributed to the quantities: for example, unlike the coefficient of variation on initial data and from the standard deviation σc′%\sigma'_c\%σc′​%, the simple deviation γc′\gamma'_cγc′​% proposed, which represents the new contribution of each quantity to the analysis, makes the GDP stand out.

I leave it to you to compare the other summaries and verify these calculations

Tabella B

Summary
GDP
|DEF|
DEB
INFL
OCC

112,8

2,22

27,46

3,78

69,56

58,19

1,93

16,75

2,81

4,61

26,6

62,0

39,8

72,1

51,4

19,5

30,5

26,7

19,3

20,8

3,70

3,70

3,70

3,70

3,70

0,097

0,052

0,045

0,056

0,035

Summary
GDP
|DEF|
DEB
INFL
OCC

17,3

29,1

23,5

26,6

3,5

15,2

23,8

20,8

19,3

20,8

34,0

18,3

15,8

19,7

12,3

1,88

0,93

O,13

0,94

0,48

43,1

21,3

3,0

21,6

11,0

Finally, a detail of the calculation of the irregularities of the first quantity, tGDP

Country
Gap
Irregularity

Bulgaria

0

0

Estonia

0,3117

0,3117

Cipro

0,6116

0,131062

Malta

0,699

0,087374

Austria

0,699

0

Lituania

0,699

0

Svezia

0,9611

0,262123

Belgio

1,0048

0,043687

Rep Ceca

1,0922

0,087374

Ungheria

1,3543

0,262123

Francia

1,6164

0,262123

Spagna

1,7475

0,131062

Grecia

1,8349

0,087374

Danimarca

0,3932

0,081485

Regno Unito

0,4369

0,043687

Finlandia

0,4806

0,043687

Francia

1,6164

0,262123

Spagna

1,7475

0,131062

Grecia

1,8349

0,087374

Francia

1,6164

0,262123

Spagna

1,7475

0,131062

Grecia

1,8349

0,087374

Francia

1,6164

0,262123

Spagna

1,7475

0,131062

Grecia

1,8349

0,087374

Francia

1,6164

0,262123

Spagna

1,7475

0,131062

Alternative proposals or adjustments to the suggested linear information are accepted. The metric proposed here is an irreverent insinuation for Data Science: It is a small breach cracking open in an old wall. It's just a start, I'm looking for substitutes for covariance, correlation, coefficient of determination, to name a few measures.

We will see. Hopefully soon.

Buon Anno, Buon 2024 e seguenti!

Ranking

Ranking

Ranking

Ranking

%

Ranking %

Ranking

%

μr\mu_rμr​
μr\mu_rμr​
σr\sigma_rσr​
σr\sigma_rσr​
γr\gamma_rγr​
μc′\mu'_cμc′​
μc′\mu'_cμc′​
σc′\sigma'_cσc′​
σc′\sigma'_cσc′​
γc′\gamma'_cγc′​
γc′\gamma'_cγc′​
γc′\gamma'_cγc′​
μc′(X)\mu'_c(X)μc′​(X)
σc′(X)\sigma'_c(X)σc′​(X)
μc′(tX)\mu'_c(tX)μc′​(tX)
σc′(tX)\sigma'_c(tX)σc′​(tX)
μc′(dtX)\mu'_c(dtX)μc′​(dtX)
σc′(dtX)\sigma'_c(dtX)σc′​(dtX)
CVc′(X)%CV'_c(X)\%CVc′​(X)%
σc′(tX)%\sigma'_c(tX)\%σc′​(tX)%
σc′(dtX)%\sigma'_c(dtX)\%σc′​(dtX)%
γc′\gamma'_cγc′​
γc′%\gamma'_c\%γc′​%
Sir Ronald Fisher
\GAMMA_r
γr\gamma_rγr​