Open Food Facts is a free, collaborative database of meals merchandise from across the globe. It’s identical to the Wikipedia for meals, providing open data on merchandise, their substances, dietary information, and additional. One key attribute throughout the database is the Nutri-Score, which is a vitamin label system that grades meals from A to E to simplify dietary information for purchasers.
On this text, we’ll uncover how we’ll use the Open Meals Data dataset to verify if the Nutri-Ranking grades are fixed all through all merchandise. We’ll leverage a machine finding out strategy known as Random Decrease Forest (RCF) to determine any outlier merchandise the place the Nutri-Ranking may not align with the exact dietary content material materials.
RCF is an unsupervised anomaly detection algorithm that’s environment friendly at discovering outliers in high-dimensional data. It actually works by establishing an ensemble of decision bushes and computing an anomaly ranking primarily based totally on the “collusive displacement” (CoDisp) required to isolate an data degree. Outliers could have the subsequent widespread CoDisp all through the bushes.
This makes RCF well-suited for our strategy of discovering merchandise the place the Nutri-Ranking is inconsistent with key dietary choices like vitality, fat, sugars, and so forth. These outliers may assist flag potential factors with Nutri-Ranking job.
To get started with exploring the Open Meals Data dataset, you’ll first should get hold of the data. Fortuitously, Open Meals Data makes this simple by providing exports of the entire database in diverse codecs on their devoted data internet web page: https://en.openfoodfacts.org/data.
For our analysis, we’ll be using the data in CSV format. Proper right here’s discover ways to purchase the file:
- Navigate to https://en.openfoodfacts.org/data in your web browser
- Seek for the CSV export hyperlink, which is presently labeled “Télécharger la base en CSV” (Get hold of the database in CSV)
- Click on on this hyperlink to acquire the CSV export. It will be an enormous TAB-separated textual content material file, typically named one factor like “fr.openfoodfacts.org.merchandise.csv”
- Rename the downloaded file to have a .tsv extension instead of .csv, to clearly level out that it’s a TAB-separated file reasonably than a comma-separated one
- Now you’ll be able to load this .tsv file proper right into a Python pandas DataFrame using
pd.read_csv()
with thesep='t'
argument to specify the TAB separator
As an example:
opf_data = pd.read_csv('path/to/your/en.openfoodfacts.org.merchandise.tsv', sep='t', encoding='utf-8')
And with that, you’ll have the entire Open Meals Data database loaded and capable of uncover! The dataset incorporates a wealth of information on meals merchandise from across the globe, along with ingredient lists, dietary data, product courses, and naturally, the Nutri-Ranking grades. Throughout the subsequent half, we’ll start digging into this data to see what insights we’ll uncover.
Let’s stroll by the use of the Python code to see how we course of the Open Meals Data data:
- Import the dataset:
import pandas as pd
opf_data = pd.read_csv('../data/en.openfoodfacts.org.merchandise.csv', sep='t', encoding='utf-8', on_bad_lines="skip", nrows=sample)
2. Filter for merchandise with a sound Nutri-Ranking grade:
opf_data = opf_data[opf_data['nutriscore_grade'].isin(['a','b','c','d','e'])]
3. Select the dietary choices of curiosity and fill any missing values with zero:
important_nutrients = ['nutriscore_score', 'energy_100g', 'fat_100g', 'saturated-fat_100g', 'carbohydrates_100g', 'sugars_100g', 'fiber_100g', 'proteins_100g', 'salt_100g', 'sodium_100g']
opf_num_features = opf_data.filter(regex='_100g|ranking')[important_nutrients]
opf_num_features.fillna(0, inplace=True)
4. One-hot encode the Nutri-Ranking grades:
data_target_one_hot = pd.get_dummies(opf_data['nutriscore_grade'], prefix='nutriscore_grade')
Now we’ve acquired our perform matrix X
ready, each using merely the dietary data or concatenated with the one-hot encoded Nutri-Scores. We’re capable of run RCF:
import numpy as np
import rrcfnum_trees = 1000
n = opf_num_features.kind[0]
tree_size = 64
forest = []
whereas len(forest) < num_trees:
ixs = np.random.choice(n, measurement=(n // tree_size, tree_size), substitute=False)
bushes = [rrcf.RCTree(X[ix], index_labels=ix) for ix in ixs]
forest.extend(bushes)
avg_codisp = pd.Sequence(0.0, index=np.arange(n))
index = np.zeros(n)
for tree in forest:
codisp = pd.Sequence({leaf : tree.codisp(leaf) for leaf in tree.leaves})
avg_codisp[codisp.index] += codisp
np.add.at(index, codisp.index.values, 1)
avg_codisp /= index
This builds a forest of 1000 bushes, each expert on a random subset of 64 data components. For each degree, we compute the widespread CoDisp all through the entire bushes it appears in. Components with the subsequent avg_codisp are additional anomalous.
To hunt out the merchandise with in all probability probably the most inconsistent Nutri-Scores, we’ll take a look at these with the utmost avg_codisp:
opf_data['avg_codisp'] = avg_codisp
outliers = opf_data[opf_data['avg_codisp'] == opf_data['avg_codisp'].max()]
We’re capable of then take a look at these outliers to see which merchandise have Nutri-Scores that don’t match their dietary profile primarily based totally on the RCF outcomes.
Curiously, as soon as we take a look at the merchandise acknowledged as outliers by the Random Decrease Forest algorithm, a clear pattern emerges — the majority are diverse styles of nuts and nut butters. Some examples embrace:
- Noix décortiqués (shelled walnuts)
- 100% pindakaas met stukjes pinda (100% peanut butter with peanut gadgets)
- Crema de cacahuate (peanut butter)
- Pecan Halves
- Pure raw walnut halves & gadgets
At first look, this may occasionally appear to be the Nutri-Ranking is incorrectly assessing these merchandise. In any case, nuts are extreme in fat, which is usually associated to a lower Nutri-Ranking grade. However, this really highlights a key side of how the Nutri-Ranking algorithm treats unprocessed and minimally processed meals.
Nuts, whereas extreme in fat, comprise largely unsaturated fats which might be thought-about useful for effectively being. An anomaly detection algorithm merely looking at full fat content material materials would likely flag these merchandise as unusual. Nonetheless the Nutri-Ranking system is designed to account for the type of fat, not merely the complete amount. It supplies a additional favorable rating to meals like plain nuts that are minimally processed and comprise healthful fats.
So reasonably than being a flaw, the reality that nuts are usually detected as outliers by an unsupervised model really shows the Nutri-Ranking methodology working as meant. It demonstrates that Nutri-Ranking is providing a nuanced analysis that goes previous simplistic measures of explicit particular person nutritional vitamins. This underscores the importance of considering the Nutri-Ranking throughout the context of a meals’s whole diploma of processing and the usual of its substances, not merely the raw dietary numbers.
As a further step, we’ll moreover take a look at which dietary choices contributed most to the outlier detection by computing the widespread CoDisp per dimension:
dim_codisp = np.zeros([n,d],dtype=float)
for tree in forest:
for leaf in tree.leaves:
codisp,cutdim = tree.codisp_with_cut_dimension(leaf)
dim_codisp[leaf,cutdim] += codispfeature_importance_anomaly = np.indicate(dim_codisp[avg_codisp>50,:],axis=0)
This tells us which nutritional vitamins are sometimes most inconsistent with the assigned Nutri-Scores for anomalous merchandise.
By leveraging the Open Meals Data database and unsupervised anomaly detection with Random Decrease Forest, we’ll set up outlier merchandise the place the Nutri-Ranking grade may not exactly replicate the dietary contents. This analysis may assist validate the Nutri-Ranking system and flooring potential factors in how the scores are assigned. The code shared provides a template for conducting any such consistency look at on open dietary datasets.