Visualizing the Shape of Soundcloud Communities with Web Scraping and Machine Learning

Jameson Orvis
10 min readFeb 5, 2021

--

SoundCloud is a seemingly infinite mosaic of overlapping subgenres and communities. Browsing the infinite ocean of SoundCloud can often be overwhelming, and it’s almost impossible to stay tapped in with every scene on the platform. This incredible map of undergound producers made by Paulon Records is the closest thing I’ve seen to a comprehensive documentation of the SoundCloud abyss, and even then it only focuses on producers up to the year 2015.

Seeing this inspired me to consider ways a map of SoundCloud could be generated algorithmically. One of my go-to ways to find new music on SoundCloud is to comb through an artist’s likes. Looking at an artist’s likes is a fascinating glimpse into their own music taste and the community that surrounds them. Presumably an artist that makes music you enjoy would be liking music that you would also enjoy.

But just because two artists like each others’ tracks frequently mean they are musically similar? Not necessarily. Scenes on SoundCloud are defined much more by the people making it than the music itself, which is often some nebulously defined amalgamation of styles. Failing to recognize this leads to situations like A.G. Cook’s iteration of the Spotify hyperpop playlist, which faced significant backlash for disingenuously lumping together the SoundCloud community with PC Music artists, among others. It is therefore probably more informative to analyze interactions on SoundCloud rather the actual music itself to understand the landscape of the platform.

There are four main ways you can interact with someone else on SoundCloud: Likes, comments, reposts, and follows. I needed to find some way to collect this data algorithmically. Ideally, there would be a function in the SoundCloud API along the lines of “get_likes(‘ericdoa’)” and it would spit back every song they’ve ever liked. However, the form for requesting SoundCloud API keys seems to have been down for years. There are some inconvenient ways to reverse engineer an API key by pulling it from SoundCloud’s calls to its own API, but the type of function I’m looking for doesn’t even seem to exist in the SoundCloud API’s documentation. Therefore I turned to scraping, i.e. algorithmically browsing a website and pulling data directly from the HTML.

Web scraping is almost always inconvenient, but especially so for SoundCloud. For example, an artist’s SoundCloud likes are presented in the form of an infinitely scrolling list. This means that to see all of an artist’s likes, you would potentially have to scroll for several minutes before reaching the bottom. So I found a script that repeatedly jumps down to the bottom of the page with a half second interval in between jumps to give time for the page to load. I limited the script to only jump down 50 times, only capturing an artist’s 500 most recent likes but saving lots of time scraping. Once all the likes were loaded, I downloaded the HTML directly and essentially Ctrl-F’d for artist names using Regex functions. My scraper would then record each like in a spreadsheet, a sample of which is shown below:

Each row represents the number of likes given by the artist labeling that row to the artists labeling each column. Within their last 500 likes, you can see kuru (“kggn” is SoundCloud’s internal marker for them, a discrepancy can happen if an artist has changed their name or uses non-standard characters) liked one of their own tracks, 0 Axxturel and cargoboym tracks, 5 tracks from endoh, and two tracks from orchid. If an artist liked a track from an account not already in the spreadsheet, the scraper would add a column to the spreadsheet with that account’s name and record the like. I also collected comments, reposts, and follows data in similar spreadsheets.

But if you’re intending to develop a comprehensive map of communities on SoundCloud, how do you know which accounts to visit? It’s not like there is a readily available list of every important artist in the SoundCloud underground. Visiting the most popular accounts on SoundCloud would simply give you a map of mainstream hip-hop, and even then probably not a particularly good one since most major artists don’t use SoundCloud much. And as much as I would love to get data for every account on Soundcloud, that’s not at all practical given my scraper has to tediously scroll through the likes, follows, comments, and reposts of every account it scrapes.

Therefore I chose three “seed” accounts to act as a starting point for data collection. Almost every account seen when scraping these original seed accounts would be unique, and therefore added as a column to my spreadsheets. I also designed my scraper to scrape these unique accounts after scraping the original seeds, continually finding new artists along the way and saving them to be scraped later. So theoretically, given infinite time, the scraper would eventually reach every account on SoundCloud, and the choice of seeds wouldn’t matter much. But barring infinite time, the choice of seeds would have a significant impact on what accounts the scraper would reach.

Obviously seeds I chose were heavily biased towards my own personal taste, but I tried to pick three representatives of what I thought were the most interesting scenes on SoundCloud. To represent the Digicore/hyperpop scene, I chose kuru. Many artists in the scene could have worked here, but I specifically chose kuru because they are particularly active on SoundCloud with over 7,000 likes, in addition to being a member of 3 different significant collectives in the scene (Bloodhounds, Novagang, & Graveem1nd). Next, I chose Axxturel to represent the post-SpaceGhostPurrp “dark trap” lane. This is an obvious choice, being the founder of Jewelxxet and highly active on SoundCloud. Finally, I chose cargoboym, aka the creator of Rare RCB HexD.mp3, to represent the HexD scene. In hindsight, the SoundCloud community surrounding cargoboym is not that large, and I might’ve been better served by choosing representative from a more expansive scene e.g. Plugg.

Starting from these three seeds, I allowed my scraper to run until it visited 850 accounts. I did not obtain data for some of these accounts, since if an account was found to have less than 250 followers, my scraper would skip it and move to the next one in order to save time and distinguish between actual artists vs. random SoundCloud users. By the end, the list of artists my scraper had recorded but not yet scraped ballooned to nearly 50,000 artists. There was no way I was going to get through 50,000 artists in a reasonable amount of time.

Therefore, I decided to sort the list of remaining artists by the total number of interactions the first 850 artists gave them, and continue scraping artists in this order. Doing this arguably defeats the point of finding artists with a sort of “depth first” approach, since the most liked artists, even by underground SoundCloud rappers, will skew a bit mainstream. And sure enough, the top artist that appears here is Lil Uzi Vert. However, I noticed many essential underground figures rise to the top of this list that got skipped in the first 850, so I think this approach works well.

My scraper ended up visiting 2,250 accounts, recording 500 of an artist’s most recent likes, comments, reposts, and follows into four 2,250 x 121,827 spreadsheets. I set to combine the data from each spreadsheet into one, calculating a net “interaction score” for each artist pairing. I added the corresponding values from each spreadsheet together, multiplying the number of follows, comments, and reposts by 3 while leaving the likes values unchanged. I feel this captures the relative significance of each interaction on SoundCloud, with reposts, follows, and comments being about equal weight and likes less significant than the rest. I then filtered out accounts whose interaction score sum was less than 50, getting rid of most major artist accounts and many less active users.

To prevent highly active SoundCloud accounts from skewing the data, I normalized every artist’s set of interaction scores by dividing by the total sum of their interaction scores. This way, if one account gave 10 likes to an artist and had 100 likes total, but another account liked the same artist once while only had 10 likes total, they would be given the same normalized interaction score. Finally, I filtered out accounts where their maximum normalized interaction score was greater than 0.4, meaning accounts where over 40% of all their SoundCloud activity was with a single account, getting rid of numerous accounts that did nothing but like their own tracks. These filters whittled down the number of rows left in my spreadsheet to 1,637.

Conceptually, my data was a collection of 1,637 points in 121,827 dimensional space, obviously somewhat problematic to plot. Luckily, dimensional reduction techniques exist to handle this very problem. The most common of these, Principal Component Analysis (PCA), uses eigenvectors to linearly transform data down to lower dimensional space while hopefully retaining most of the variance within the data. I used PCA on my dataset to reduce it down to two dimensions and plotted the results, labeling a random handful of artists:

PCA Dimensional Reduction

From a data visualization perspective, this graph is quite terrible. The vast majority of artists seem to exist somewhere on that thick horizontal line along the bottom, and there doesn’t seem to be any obvious meaning behind an artist’s position along this line. In general, there is no real meaning behind each dimension of a PCA reduction, and it can only be interpreted relatively. However, it is fascinating to me that PCA appears to have decided that the y-axis should be a measure of how closely an artist aligns with the Jewelxxet collective, with Axxturel and the Jewelxxet account itself appearing high above the rest.

For data visualization purposes, other dimensional reduction techniques should perform significantly better. I decided to use t-distributed Stochastic Neighbor Embedding (tSNE), a common technique used frequently for higher dimensional data visualizations. To prevent tSNE from taking an eternity to run, I first used PCA to reduce the original data set down to 50 dimensions, then reduced that data to two dimensions with tSNE. You can see a picture of the graph below, but you should really visit the interactive version here.

tSNE dimensional reduction

Looking at this graph, it is important to keep in mind that tSNE is bad at preserving the global structure of data. This means that while it might be very good at putting closely related artists next to one another, the relative positions of clusters may not necessarily matter. For example, I do think it’s significant that the Jewelxxet and Flexxcult clusters are close to one another, but I don’t think the proximity of the Jewelxxet cluster to Anti-World/Spider Gang cluster means anything significant.

The graph is certainly not perfect. I think many of the clusters, especially towards the center, appear to consist of larger accounts tSNE didn’t really know what else to do with (I, personally, would not choose to put Skrillex, Gucci Mane, and JPEGMAFIA together in the same cluster). However, as far as I can tell, it does seem to capture almost every significant scene on the platform. I annotated the plot lightly with the scenes that jumped out at me, but there are many more scenes evident in the diagram.

Jewelxxet and Flexxcult seem to form two of the most well defined clusters on the whole graph. This is as opposed to the Digicore region, where the whole scene is an overlapping mishmash of collectives. Many Digicore collectives even have several members in common, and the fact you can’t readily identify Goonncity or Novagang clusters reflects this fact. I’m not sure what, if any, significance there is to the fact the scene looks like it’s split in half horizontally (this might be an indication that the tSNE perplexity value used of 30 is a bit low).

I also think the graph does a nice job capturing the Plugg and Pluggnb scenes. The Plugg region includes almost every essential plug artist in 2021, including but not limited to 10kDunkin and Tony Shhnow, who cluster together with 645AR as members of the Atlanta-based SOS crew, Cashcache!, 1600J, and BoofPaxkMooky. The lower cluster in the Plugg region nicely captures the Clayton County based Rich Slime Gang, including Slimesito and Fluhkunxhkos and their affiliates e.g. Duwap Kaine.

Nearby is the cluster I labeled as Slayworld/Pluggnb. Pluggnb is a somwhat vague genre descriptor, but the sound is largely defined by members of the Slayworld collective e.g. Summrs, Autumn!, Izaya Tiji, Kankan, etc. The graph correctly places it nearby the Plugg scene, but clearly it’s own distinct region.

I briefly considered using a clustering algorithm to pick out further clusters, however I don’t think that would be hugely informative. There’s still a good bit of noise in the data so I think many clusters would be mostly meaningless. And I think algorithmically clustering the data undermines the point that SoundCloud genres are not well defined and the boundaries between scenes are blurry.

I could probably go on quite a while talking about things I notice in the graph. There are probably many scenes showing up that I don’t even recognize. I may let my scraper continue to collect data, but I would probably have to figure out a more efficient way to organize data, since by the end of it each loaded spreadsheet took up about two gigs of RAM and Python routinely gave me memory errors. But I think this graph, although incomplete, presents a fascinating look at the current shape of SoundCloud.

--

--