Network Analysis Literacy

Sunday, December 4, 2016

Note 18 - A quote by Barabási

Albert-Laszlo Barabási at the World Economic Forum 2012

CC BY-SA 2.0: By World Economic Forum from Cologny,

Switzerland - Mastering Complexity,

https://commons.wikimedia.org/w/index.php?curid=21275699

Note 18 is simply a quote from an early article by Albert-László Barabási on how complex network analysis helps to tame complexity from 2005. I believe that we are still at the beginning of the journey, while---of course---the field has made serious progress on the analysis of, e.g., dynamic and multiplex networks:

Yet, the road to a fundamental and comprehensive understanding of networks is still rather rocky. (Barabási, 2005)

Reference:

Albert-László Barabási: "Taming complexity", Nature Physics, 1:68–70, 2005

Note 17: Universal features vs. contextual interpretation

The last point already made the point that some edge weights representing real-world concepts such as probabilities or friendship do not allow a meaningful interpretation of graph theoretic distances. Such an interpretation is depending on the context, the meaning of the relationships and weights in the real-world. However, hip new "network science" was in part so very hot in contrast to lame, old "social network analysis", because it just applied any kind of measure to all kinds of complex networks to identify structures common to all of them. This was the case for Watts' and Strogatz' seminal paper on Small-Worlds (Watts, 1998) or for Barabási and Albert's paper on Scale-Free Networks (Barabási, 1999).

Note 16: On distances, edge weights, and other modeling decisions

(Graph theoretic) distance is certainly one of the best understood
concepts in network analysis. However, not every weighted graph
should be used to compute the distance between all pairs of nodes.
This figure is under CC:BY with a reference to
Prof. Dr. Katharina A. Zweig or to this blogpost.

The distance between any two nodes in an undirected, unweighted graph is defined as the minimal number of edges one has to traverse, to get from one node to the other. A natural generalization of this concept to weighted graphs defines the distance as the minimal sum of the weights on any sequence of edges between the two nodes.

For weights that represents the length of streets, this makes perfectly sense. But of course, weights in complex networks can represent almost anything. Let's consider the number of hours two people called each other in the last two weeks. Or the probability to surf from one webpage through another by a direct link from the first to the second page.

The distance between two nodes in a complex network is used for many things, foremost so-called centrality indices like the betweenness centrality or the closeness centrality. In general, for most centrality indices, a low distance to many other nodes will make a node more central. However, the length of calls between two persons is rather a measure of their closeness, not their distance. Thus, summing up these values will actually favor those pairs of nodes, who are linked by paths with people that do not call each other for a long time. It can help here, to invert the weights to make the meaning of distance more intuitive. However, even in this case: what does it actually mean if I am connected to another person by two other persons who talk to each other for two hours each? Then my distance to that guy is 1/2 + 1/2 = 1. Does that make me closer to that guy than being directly connected to another person, which I only call for 10 minutes, i.e., with a "distance" of 6?

With the probabilities, a summation obviously makes not much sense. Here, a multiplication of the weights might be most meaningful to yield interpretable results.

Note 16. If network representation and network ana-
lytic measure are well matched, the measure’s value can
be interpreted with respect to the functionality of the net-
work for the complex system of interest. (Zweig, 2016)

Read more on this on the general keyword of "trilemma of complex network analysis" and in Chapter 5 "Network representations of complex systems" of my book.

Reference:

(Zweig2016) Katharina A. Zweig: Network Analysis Literacy, ISBN 978-3-7091-0740-9, Springer Vienna, 2016

Note 14: Network analysis and statistics

Network Analysis has made heavy use of statistics - and it seems that statistics is not humankind's strength. That does not make network analysis any easier.
This figure is under CC:BY with a reference to Prof. Dr. Katharina A. Zweig or to this blogpost.

A main part of network analysis is computing and interpreting statistical numbers. Unfortunately, statistics is not the most intuitive part of mathematics and it is well-known that even trained scientists have problem in correctly interpreting statistical results. Consider the following test:

A group of young students without any symptoms of sickness make a blood donation. Routinely, their blood is checked for an HIV infection. The test is very sensitive. For simplicity, assume that it detects 99.99% of all infected persons and that non-infected persons will get a negative test with 99.99%. If now a person's first test returns a positive result, what is her likelihood to actually be infected?

If you think it is 99.99%, you are in very good company (but wrong):

Note 14. Gigerenzer and his team showed in various
studies that almost none of the experts was able to give
the correct answer . Most answered that, as the test is
so speciﬁc and so sensitive, the probability that a person
is infected if the test says so is 99.99%. (Zweig, 2016)

Actually, the question cannot really be answered without knowing the chance that a person without any symptoms is infected. This probability can be approximated by the so-called incidence rate, the number of new infections per year. For Germany, this is around 3,000 in a nation with about 80 million inhabitants (for young people, it might actually be higher than for the general population, but as an approximation that is fine).

We now want to know the probability that a person is infected if her test turns out to be positive. There are two ways for a positive test: the person is infected and detected or the person is not-infected but falsely flagged. If we would test whole Germany, we would in essence find all of the 3,000 newly infected persons. However, from the remaining (still roughly) 80 million people, we would flag 0,01%, i.e., 1 person in 1 in 10,000. Thus, we additionally flag 8,000 people as positive. From all 11,000 people with a positive test, 8,000 would actually not be infected. I.e., the probability that a person with a positive test is infected is less than 50%, namely around 27%. Surprised?

This computation has a very important consequence. Let our 'null-hypothesis' be that any given person is not infected. Now, we know that the probability that a person is not infected and gets a positive test is very small - this value is called her p-value (probability to observe the data given the assumption in the null-hypothesis). Especially, it is smaller than p=0.05, the classic threshold value to 'reject the null-hypothesis'. However, as we have seen, we need to compute the probability that the person is infected given a positive test result. And this probability can differ strongly from the other one when the ratio of the two classes (infected vs not infected) is not around 0.5. Thus, rejecting a null-hypothesis, just because given the assumption ("not infected") the observation of the data ("positive test") is unlikely, is the wrong way.

Note 15. The only correct verbal descriptions of a p-value need to
contain the words given that the null-hypothesis is true as
the p-value conditions on that. As the p-value does not
say anything about the probability of the hypothesis to
be true, given the observed data, it cannot be used as a
basis for rejecting the null-hypothesis. (Zweig, 2016)

It is just the first step to update our probability of the assumption, given the observed data. This will be important, e.g., to identify network motifs.

If statistics is already hard, then this makes network analysis no way easier! Read more about statistical hypotheses testing on Wikipedia. Or join my Mendely group on "Good statistics papers for non-statisticians".

References:

(Gigerenzer, 2007) Gerd Gigerenzer, Wolfgang Gaissmaier, Elke Kurz-Milcke, Lisa M. Schwartz, and Steven Woloshin. Helping doctors and patients make sense of health statistics. Psychological science in the public interest, 8(2), 2007.

(Zweig2016) Katharina A. Zweig: Network Analysis Literacy, ISBN 978-3-7091-0740-9, Springer Vienna, 2016

Notes 12 and 13: Evolution of complex networks


Networks evolve under different constraints and forces, in which ; network science is especially interested. This figure is under CC:BY with a reference to Prof. Dr. Katharina A. Zweig or to this blogpost.

The perspective of statistical physics is always on finding universal forces that form and determine a given phenomenon. From this perspective, network science originating in statistical physics was also always searching for the forces on either the entities or the whole system that govern the evolution of a network's structure.

"Note 12. Network science describes the topology of a network as the eﬀects of forces on either the entities or
the whole system." (Zweig, 2016)

For example, the "smallness" of small-worlds can be explained by minimizing cost (energy) and the diameter at the same time, under the assumption that an edge between more distance nodes is more expensive. Putting some energy in long-distance edges is the optimal way to reduce the diameter of the network with the smallest total cost involved.

Looking at a dynamic system, statistical physis is then most interested in two phenomena: phase transitions and equilibria. Phase transitions indicate the point, where a small shift in one parameter radically changes the way the system behaves. As such, this is not a phase in which it is easy to understand the forces and constraints under which the networks are built. It is more a state of 'criticality' of which scale-free distributions are a tell-tale sign. When the "scale-free" degree distribution was observed (which actually is in most cases not directly scale-free) it seemed as if complex networks were in a state of 'self-organized criticality', i.e., a system that keeps itself in this state. This would have been very interesting because it points to a certain equilibrium of forces. In any case, the other most interesting state is the equilibrium, which can be either a stable one or an instable one. In the first case, small perturbations will be quickly diminished and the system returns back to the equilibrium state. In an instable one, a small perturbation will lead to a totally different state. In equilibria it is often easier to better understand the forces and constraints that form the system's behavior.

"Note 13. Network science is interested in equilibrium
structures as they can be used to understand the forces,
constraints, or incentives under which a network is built." (Zweig, 2016)

Presumed constraints are, for example, Dunbar's number in social networks, i.e., the observation that people might not be able to manage more than 150 close acquaintances. Reasonable forces are energy/cost minimization for most networks, i.e., that the number and length of edges to be built is limited or should be minimized while - in most cases - building connected and maybe even locally dense networks.

You can find more on the perspective of statistical physics in the 2nd chapter "Universal structures vs. individual featuers" of my book. There I compare it with the perspective of social science and why I believe that Network Science is really something different than (Social) Network Analysis.

Reference:

(Zweig2016) Katharina A. Zweig: Network Analysis Literacy, ISBN 978-3-7091-0740-9, Springer Vienna, 2016

All figures are under CC:BY with a reference to Prof. Dr. Katharina A. Zweig or to this blogpost, if not mentioned otherwise.

Tuesday, September 27, 2016

Drawing graphs for publications or presentations

The one software I would not like to miss for drawing and editing graphs is yEd, a free software provided by yWorks. It is an intuitive tool to draw small graphs and networks where - in contrast to most normal vector graphics programs - moving around the nodes will automatically include the edges as well. Furthermore it is possible to mark a subgraph and to turn it or to mirror its position - without mirroring or turning the node labels as well.

I've created almost all drawings for my publications with it, especially those for my book and this blog. My normal design pipeline is to draw the graph in yEd and then either to save it directly as eps or pdf for inclusion in LaTeX documents or to first save it as an svg. Then I would give some finishing touches to the drawing in inkscape, from which it can then again be exported as eps and pdf.

Despite the fact that I've been using this tool now for more than 12 years, I just learned that it is also the best way to create an image for Power Point presentations. Instead of exporting it to bmp which scales badly, try exporting it to emf. Scaling this format leads to much better results.

Monday, September 26, 2016

Foreword by Steve Borgatti

I was very happy to hear that Steve Borgatti agreed to write a foreword for my book. As I describe it in my introduction, his research was the starting point for my book. So, I'll hand over to him and copy his foreword here:

This is a delightful book. It’s so easy to read, you can almost accidentally learn quite a bit of network science without even noticing it. Written in a playful manner, it tends to enliven the brain rather than put it to sleep – quite a change from the usual pedantic tome. It’s a quirky book that does not try to be systematic. For example, it does not cover “community detection” (that’s cluster analysis to you social scientists). As a result, the book has a great deal of personality.
But what I really like about the book is the subtext. What it’s actually about, in my opinion, is how to think, and here, that means how to think with models. Most academics are very gullible when it comes to concepts outside their disciplines. Within their area, any new idea or phrasing is treated with withering skepticism, but outside their area, they adopt ideas with the speed of teenagers adopting slang or fashion. Thus, a management scholar hears about small worlds and clustering coefficients and immediately shoehorns them into their next study. A physicist learns about betweenness centrality and suddenly there are 500 papers that reference the idea. If the first paper associates betweenness with influential spreaders in the spread of a disease, all of the following papers do the same. If you internalize this book, you won’t make that mistake. You will realize that, although there is a sense in which network measures are tools like hammers, there is much more to them. Hammers work pretty much the way they work in any setting, but using a network measure implicitly entails fitting a model of how things work. And if the model doesn’t fit, the measure doesn’t either.
Curiously, although I associate model-based thinking with the physical sciences, my experience is that both physical and social scientists are equally likely to have this mindless, “pluginski” attitude about network concepts. Therefore, I think this book would be useful for both audiences. But since the content of the book is mostly drawn from what Katharina calls the “network science” field (as opposed to the “social network analysis” field), I’m guessing it will appeal mostly to budding physical scientists. Too bad, because if there was ever an introduction to network science that was especially suitable for social scientists, this is it.
I look forward to seeing this in print.
Lexington,KY
Steve Borgatti

Foreword by Steve Borgatti for Katharina A. Zweig's book: "Network Analysis Literacy", in print; (c) by Springer Wien, used with permission