We all love to relax and chat with friends. Who doesn’t love a cold brew ? Drinking beer with people is an important social ritual that is often linked to many memories (or the absence of them), especially of our younger days. EPFL is a highly international institution with people from all around the world. They leave their home countries behind to join EPFL but what if we could bring them back a bit of it ? We decided to match beers from SAT, our beloved bar, to the countries where they would be the most appreciated. Our objective is to recommend them to people missing home and hope it might cure a bit of this homesickness ! It is also a tool to ask SAT to buy new beers, since it would be sad to leave people without a beer similar to the one from their country, don’t you think ?
A quick sip through our data
|Dataset||Beers 🍺||Brew styles 🍶||Beer lovers 👥||Beer cost 💸|
|RateBeer||396690 beers from 265 locations*||93 beer styles||70120 beer lovers from 222 locations*||No data|
|BeerAdvocate||247982 beers from 264 locations*||104 beer styles||153704 beer lovers from 194 locations*||No data|
|Satellite Bar||66 beers from 8 countries||21 beer styles||No data||Beer prices from 3.- to 16.- CHF|
*Location can be either countries or regions of countries with many active users (e.g. Individual states of the United States or regions of the United Kingdom)
Our assumptions and decisions when processing these datasets were:
- Users without a defined country in the dataset were considered to have “Unknown” location. Their data is not considered in the beer preference world map we show, but they are considered in the SAT t-SNE clustering plots.
- Beers without rating and breweries without beers were not considered.
- After preliminary observations from an exploratory step of our analysis (available in the accompanying Jupyter Notebook, we considered that BeerAdvocate and RateBeer had considerably different communities. Despite having a dataset of matching users and beers available for our study, we performed our analysis in each dataset separatedly.
An iconic duo : beer opinions and biases
One of the main challenges when analysing data from rating and review systems comes from the fact that humans are prone to bias. One’s personal bias is not only hard to quantify but may be correlated with other personal features (place of birth, age and personal experiences, to give a few examples).
Inspired by this paper on the modelling and correction of bias of NeurIPS papers, we aim to decrease the effect of systematic reviewer bias by applying a ‘mean-field’ correction to our datasets’ ratings. The pipeline to correct the beer ratings is described in the flowchart below and its accompanying text.
We computed a bias for each user, which was attenuated by a coefficient comprised between 0 and 1 and inversely proportional to the number of ratings a user has provided the platform. By doing so, we wanted to attenuate the bias of one-time reviewers, and give more weight to the bias of hardened reviewers as this bias is more trustworthy. We then removed the users’ bias from all the corresponding ratings, and recomputed average ratings for all beers from the debiased ratings of the users. Ratings were kept between 0 and 5. By doing so, we aimed at getting fairer and more representaive ratings of the quality of the beers.
Applying this correction allowed us to witness an evolution in the distribution of ratings:
We can see a clear evolution in the distribution of ratings from the RateBeer website after the correction. The distribution shifts to the right: our correction led to a higher proportion of higher ratings, suggesting that users were grading harshly before correction. This effect is not observable in the case of the BeerAdvocate website. This could be because there is an outlier user having reviewed a very high number of beers compared to the other users. Because of this, the attenuation coefficient would become very small for most users, erasing their biases.
National treasures : best brew by country and the happiness they bring
After some digging we found the beer most voted by each countries and assumed it to be the beer most drinked. It would seem that swiss enjoy BFM La Torpille or Feldschlossen Original Lager ! Users from RateBeer might be fancier than their BeerAdvocate counterpart.
As we can see the datasets are mainly representaive of the USA since each state is treated as a country and to a lesser degree Europe is also more represented than the rest of the world. It seems that RateBeer and BeerAdvocate aren’t as popular in the rest of the world !
We also counted the mean positive and negative words used by each countries in their reviews to see if some countries were more prone to praise. It would seem that south Europe is harsher than north Europe, which is interesting.
In the world-map we decided to use the state of California’s data since it’s the most populated state. We also did a zoom on the USA to show the whole dataset.