HOME | ABOUT | RESUME | GITHUB | CONTACT


CapStats

The Wu-Tang Clan Network

December 10, 2018

All code can be found here

In honor of Wu-Tang Clan day, I dug out this old post I created for one of my grad school classes. We were learning about graph analysis and this is what I put together. All code is in python using the networkx package and graphlab package. Graphlab is pretty great.

For this project, we were tasked with exploring and analyzing a bi-modal network. It was important to me to choose a data set that I was familiar with. Familiarity with the data helped me make sure the calculations produce coherent results.

So why Wu-Tang Clan? The Wu-Tang Clan is a rap group that formed in the early 1990s in Staten Island, New York. There are 9 members of the group and they all went on to produce solo albums with varying degrees of success. A feature of almost all rap albums is that there are many collaborators. Using the Wu-Tang Clan and its members’ discography, it is possible to create an artist network that can be explored and analyzed.

Analyzing this network should allow us to answer some Wu-Tang related questions.

I was a teacher for 10 years. This data set is a good introductory data set because it is small, the data analysis should dovetail with the real life Wu-Tang story, and it is original. And also, for me it’s personal.

I moved to Staten Island in 1993. This was the same year that the Wu-Tang released their debut platinum selling album, Enter the Wu-Tang (36 Chambers). Before moving to Staten Island’s North Shore, I lived in a small town in upstate New York where most of the kids were listening to Alt-Rock. The music that the kids from Staten Island were listening to was radically different. I don’t really listen to rap music anymore but in the mid 90s on Staten Island the Wu-Tang was where it was at.

#read the file
import pandas as pd

url = "https://github.com/capstat/postups/tree/master/wutang/data/wutang.csv"
wutang = pd.read_csv(url, encoding='utf8')

I manually created a csv file that lists every artist featured on every Wu-Tang album or Wu-Tang member’s album from 1993 to 2007. For each artist on every album, the number of appearances is listed along with some information about the album (Year, Number of Songs, and the RIAA certification - Gold, Platinum, Multi-Platinum, None). I decided not to scrape the internet for this information because the data set was small and I believe that I would spend just as much time coding and checking the results as just typing.

I stopped collecting data after 2007 because Wikipedia stopped listing the Wu-Tang member featured on each song after 2007. If I want to continue on with the data collection I could get the information from thier penultimate album from other webistes. However, I decided that I collected enough data for the aim of this project. Also, it would be pointless to try to collect the info from thier most recent album. They only created one copy and the track info is still unknown.

The edge weights are the proportion of artist appearances on an album. The artisit of a solo album was assigned a weight of 1.0. This assumes that the solo artist appears on all of their own album’s tracks. This assumption is not true. For example, only Raekwon raps on the track “The Faster Blade” from Ghostface’s Ironman album. I was not about to listen to every song to verify that the solo artists were on all of their own tracks. I think if I did this again, I would scrape the lyrics and info from here.

#filter out any 0 appearances
wutang = wutang.loc[wutang['Appearances']!=0]
#create edge weights
wutang['App_Prop'] = wutang['Appearances']/wutang['Songs_on_Album']
wutang[wutang['App_Prop'] != 1].sort_values('App_Prop', ascending=False).head()

Above are the artists that appeared on the highest proportion of songs on an album, excluding artists’ own solo album.

From 1993 to 2007 members of the Wu-Tang Clan put out 33 albums either as a group or as solo artists. Astonishingly, from 1993 to 1998, the Wu-Tang clan released 9 albums, all were certified Gold by RIAA, and 6 were certified Platinum. The only rapper with more Platinum albums is Eminem, and it took him 20 years.

Let’s take a look at the Wu-Tang universe!

import graphlab as gl
gl.canvas.set_target('ipynb')

sf = gl.SFrame(wutang)
g = gl.SGraph()
g = g.add_edges(sf,
                src_field='Album', 
                dst_field='Artist')

g.show(vlabel='__id', vlabel_hover=True, highlight=sf['Artist'])

For all graphs, the artists are in Purple and the albums are in Green. The graph above includes all artists that have appeared on a Wu-Tang album or a Wu-Tang members’ album from 1993 to 2007. The artists in the center of the graph are the artists that have appeared on the most albums. The artists outside the Green ring of albums near the center of the graph represent artists that appeared only on one or two albums. Any album far from the center is an album that features artists that do not appear often on other albums.

#the network of Wu-Tang members only
gMembers = g.get_neighborhood(sf['Artist'][sf['Member']==1])
gMembers.show(vlabel='__id', vlabel_hover=True, highlight=sf['Artist'])
#use networkx for calculations
#based upon SNA textbook
import networkx as nx
from networkx.algorithms import bipartite as bi

#For weighted graphs the edge weights must be greater than zero. 
#Zero edge weights can produce an infinite number of equal length paths between pairs of nodes.
wutang['Weight'] = wutang['App_Prop'] * 10

g2 = nx.from_pandas_dataframe(wutang, source='Album', target='Artist', edge_attr='App_Prop')
artist_network = bi.weighted_projected_graph(g2, pd.unique(wutang['Artist']), ratio=False)

print len(artist_network.nodes()), "Artists"
170 Artists

Who is the most important Wu-Tang network member?

The RZA founded the Wu-Tang Clan. He produces most of the music for Wu-Tang albums and he also produces many of the songs of the solo albums. This network does not include that information so instead I believe the most popular member will be Method Man. His solo work is the most comercially successful compared to the other members. And since he is so popular he is often featured on other artists albums. I also imagine that Ghostface will feature prominantly because he is the artists that has the most solo albums.

def sorted_map(map):
    ms = sorted(map.iteritems(), key=lambda (k,v): (-v,k))
    return ms

c = nx.closeness_centrality(artist_network)
cc = sorted_map(c)

print "Closeness Centrality"
for each in cc[:5]: print each[0], ": ", each[1]
Closeness Centrality
Method Man :  0.862244897959
Masta Killa :  0.828431372549
Ghostface Killah :  0.816425120773
Raekwon :  0.768181818182
RZA :  0.757847533632
b = nx.betweenness_centrality(artist_network)
bt = sorted_map(b)

print "Betweeness Centrality"
for each in bt[:5]: print each[0], ": ", each[1]
Betweeness Centrality
Method Man :  0.123439237825
Ghostface Killah :  0.122549641751
Masta Killa :  0.104495516439
U-God :  0.0850536306631
Raekwon :  0.0767985121772
p = nx.pagerank(artist_network)
pr = sorted_map(p)

print "PageRank"
for each in pr[:5]: print each[0], ": ", each[1]
PageRank
Method Man :  0.0471179948029
Masta Killa :  0.0441401858446
Ghostface Killah :  0.0423033364056
RZA :  0.0394787031923
Raekwon :  0.0394629498632

Method Man does come out on top in all of the centrality measures. Suprisingly, Masta Killa is ranked highly in many of these centrality measures (he is considered one of the least popular members).

In 2014, it was announced that Cappadonna is an official member of the Wu-Tang Clan. Can network analysis back this up?

#compare centrality measures
#is Cappadonna's betweenness centrality greater than Streetlife's?
[i for i in bt if i[0] == 'Cappadonna'][0][1] > [i for i in bt if i[0] == 'Streetlife'][0][1]
False
#is Cappadonna's closeness centrality greater than Streetlife's?
[i for i in cc if i[0] == 'Cappadonna'][0][1] > [i for i in cc if i[0] == 'Streetlife'][0][1]
False
gCompare = g.get_neighborhood(['Streetlife', 'Cappadonna'])
gCompare.show(vlabel='__id', highlight=sf['Artist'])

Based upon network analysis, it would seem that Streetlife should be considered the 10th member of the Wu-Tang Clan and not Cappadonna.

How did the Wu-Tang netwok change after Triumph?

Triumph was the Wu-Tang’s most successful album. It sold over 4 million copies. What did the Wu-Tang network look like before this album and after?

#network up to Triumph
gTriumph = g.get_neighborhood(sf['Album'][sf['Year']<=1997])
gTriumph.show(vlabel='__id', vlabel_hover=True, highlight=sf['Artist'])
#the network for the 5 years after Triumph
g982002 = g.get_neighborhood(sf['Album'][(sf['Year']>=1998) & (sf['Year']<=2002)])
g982002.show(vlabel='__id', vlabel_hover=True, highlight=sf['Artist'])
#use networkx for quick calculations
gT = nx.from_pandas_dataframe(wutang.loc[wutang['Year']<=1997], 
                              source='Album', target='Artist', edge_attr='App_Prop')
gaT = nx.from_pandas_dataframe(wutang.loc[(wutang['Year']>=1998) & (wutang['Year']<=2002)],
                               source='Album', target='Artist', edge_attr='App_Prop')
print "Diameter before and after Triumph:", nx.diameter(gT), nx.diameter(gaT)
print "Density before and after Triumph:", nx.density(gT), nx.density(gaT)
Diameter before and after Triumph: 4 6
Density before and after Triumph: 0.12012012012 0.0336071695295

The network’s diameter grew by 50% and the density decreased substantially after the Triumph album.

Are Ghostface and Raekwon really best friends?

Ghostface and Raekwon are the two Wu-Tang members who have worked with each other the most in the past. Will we be able to detect this 2 person clique using SNA?

#using the jaccard coefficient to compare the similarity between nodes
preds = nx.jaccard_coefficient(artist_network, [('Ghostface Killah', 'Raekwon'),
                                                ('Ghostface Killah', 'Method Man'),
                                                ('Ghostface Killah', 'Masta Killa')])
for u, v, p in preds:
    print (u, v, p)
('Ghostface Killah', 'Raekwon', 0.6938775510204082)
('Ghostface Killah', 'Method Man', 0.7169811320754716)
('Ghostface Killah', 'Masta Killa', 0.6772151898734177)
#see if Ghostface and Raekwon are in any cliques together
gC = trim_edges(artist_network, weight=10000)
cliques = list(nx.find_cliques(gC))
cliques
[[u'Cappadonna',
  u'Masta Killa',
  u'Method Man',
  u'Ghostface Killah',
  u'Raekwon'],
 [u'Streetlife', u'Inspectah Deck'],
 [u'RZA', u'ODB'],
 [u'RZA',
  u'Masta Killa',
  u'Raekwon',
  u'Inspectah Deck',
  u'GZA',
  u'U-God',
  u'Method Man',
  u'Ghostface Killah']]

Ghostface and Raekwon do not show up alone in any cliques. Notcie that RZA and ODB do (they are cousins).

Maybe if we remove Method Man from the data set the relationship between Ghostface and Raekwon will be more pronounced.

noMM = wutang.loc[wutang['Artist']!='Method Man']
gT = nx.from_pandas_dataframe(noMM, source='Album', target='Artist', edge_attr='App_Prop')
no_MM = bi.weighted_projected_graph(gT, pd.unique(noMM['Artist']), ratio=False)

gC = trim_edges(no_MM, weight=15000)
cliques = list(nx.find_cliques(gC))
cliques
[[u'Raekwon', u'Inspectah Deck'],
 [u'Raekwon', u'RZA', u'Ghostface Killah', u'Masta Killa']]

Conclusion

The Wu-Tang Clan data set is a great sample data set to see how network analysis can be used to describe real world networks. These techniques can be used on much larger data sets to reveal insights into the nature of the relationships in the network.


HOME | ABOUT | RESUME | GITHUB | CONTACT