Open Food Facts is a free, collaborative database of meals merchandise from around the globe. It’s just like the Wikipedia for meals, offering open information on merchandise, their substances, dietary data, and extra. One key attribute within the database is the Nutri-Score, which is a vitamin label system that grades meals from A to E to simplify dietary data for customers.
On this article, we’ll discover how we will use the Open Meals Information dataset to confirm if the Nutri-Rating grades are constant throughout all merchandise. We’ll leverage a machine studying approach referred to as Random Lower Forest (RCF) to establish any outlier merchandise the place the Nutri-Rating might not align with the precise dietary content material.
RCF is an unsupervised anomaly detection algorithm that’s efficient at discovering outliers in high-dimensional information. It really works by constructing an ensemble of resolution bushes and computing an anomaly rating based mostly on the “collusive displacement” (CoDisp) required to isolate an information level. Outliers may have the next common CoDisp throughout the bushes.
This makes RCF well-suited for our process of discovering merchandise the place the Nutri-Rating is inconsistent with key dietary options like vitality, fats, sugars, and so forth. These outliers might help flag potential points with Nutri-Rating task.
To get began with exploring the Open Meals Information dataset, you’ll first must obtain the information. Fortuitously, Open Meals Information makes this straightforward by offering exports of the complete database in varied codecs on their devoted information web page: https://en.openfoodfacts.org/data.
For our evaluation, we’ll be utilizing the information in CSV format. Right here’s learn how to acquire the file:
- Navigate to https://en.openfoodfacts.org/data in your internet browser
- Search for the CSV export hyperlink, which is presently labeled “Télécharger la base en CSV” (Obtain the database in CSV)
- Click on this hyperlink to obtain the CSV export. It is going to be a big TAB-separated textual content file, sometimes named one thing like “fr.openfoodfacts.org.merchandise.csv”
- Rename the downloaded file to have a .tsv extension as an alternative of .csv, to obviously point out that it’s a TAB-separated file moderately than a comma-separated one
- Now you can load this .tsv file right into a Python pandas DataFrame utilizing
pd.read_csv()
with thesep='t'
argument to specify the TAB separator
For instance:
opf_data = pd.read_csv('path/to/your/en.openfoodfacts.org.merchandise.tsv', sep='t', encoding='utf-8')
And with that, you’ll have the complete Open Meals Information database loaded and able to discover! The dataset incorporates a wealth of data on meals merchandise from around the globe, together with ingredient lists, dietary information, product classes, and naturally, the Nutri-Rating grades. Within the subsequent part, we’ll begin digging into this information to see what insights we will uncover.
Let’s stroll by way of the Python code to see how we course of the Open Meals Information information:
- Import the dataset:
import pandas as pd
opf_data = pd.read_csv('../information/en.openfoodfacts.org.merchandise.csv', sep='t', encoding='utf-8', on_bad_lines="skip", nrows=pattern)
2. Filter for merchandise with a sound Nutri-Rating grade:
opf_data = opf_data[opf_data['nutriscore_grade'].isin(['a','b','c','d','e'])]
3. Choose the dietary options of curiosity and fill any lacking values with zero:
important_nutrients = ['nutriscore_score', 'energy_100g', 'fat_100g', 'saturated-fat_100g', 'carbohydrates_100g', 'sugars_100g', 'fiber_100g', 'proteins_100g', 'salt_100g', 'sodium_100g']
opf_num_features = opf_data.filter(regex='_100g|rating')[important_nutrients]
opf_num_features.fillna(0, inplace=True)
4. One-hot encode the Nutri-Rating grades:
data_target_one_hot = pd.get_dummies(opf_data['nutriscore_grade'], prefix='nutriscore_grade')
Now we’ve got our function matrix X
prepared, both utilizing simply the dietary information or concatenated with the one-hot encoded Nutri-Scores. We are able to run RCF:
import numpy as np
import rrcfnum_trees = 1000
n = opf_num_features.form[0]
tree_size = 64
forest = []
whereas len(forest) < num_trees:
ixs = np.random.selection(n, measurement=(n // tree_size, tree_size), substitute=False)
bushes = [rrcf.RCTree(X[ix], index_labels=ix) for ix in ixs]
forest.prolong(bushes)
avg_codisp = pd.Sequence(0.0, index=np.arange(n))
index = np.zeros(n)
for tree in forest:
codisp = pd.Sequence({leaf : tree.codisp(leaf) for leaf in tree.leaves})
avg_codisp[codisp.index] += codisp
np.add.at(index, codisp.index.values, 1)
avg_codisp /= index
This builds a forest of 1000 bushes, every skilled on a random subset of 64 information factors. For every level, we compute the common CoDisp throughout all of the bushes it seems in. Factors with the next avg_codisp are extra anomalous.
To seek out the merchandise with probably the most inconsistent Nutri-Scores, we will have a look at these with the utmost avg_codisp:
opf_data['avg_codisp'] = avg_codisp
outliers = opf_data[opf_data['avg_codisp'] == opf_data['avg_codisp'].max()]
We are able to then look at these outliers to see which merchandise have Nutri-Scores that don’t match their dietary profile based mostly on the RCF outcomes.
Curiously, once we look at the merchandise recognized as outliers by the Random Lower Forest algorithm, a transparent sample emerges — the bulk are varied varieties of nuts and nut butters. Some examples embrace:
- Noix décortiqués (shelled walnuts)
- 100% pindakaas met stukjes pinda (100% peanut butter with peanut items)
- Crema de cacahuate (peanut butter)
- Pecan Halves
- Natural uncooked walnut halves & items
At first look, this may seem to be the Nutri-Rating is incorrectly assessing these merchandise. In any case, nuts are excessive in fats, which is often related to a decrease Nutri-Rating grade. Nevertheless, this truly highlights a key facet of how the Nutri-Rating algorithm treats unprocessed and minimally processed meals.
Nuts, whereas excessive in fats, comprise largely unsaturated fat that are thought-about helpful for well being. An anomaly detection algorithm merely taking a look at complete fats content material would doubtless flag these merchandise as uncommon. However the Nutri-Rating system is designed to account for the kind of fats, not simply the full quantity. It provides a extra favorable ranking to meals like plain nuts which are minimally processed and comprise wholesome fat.
So moderately than being a flaw, the truth that nuts are generally detected as outliers by an unsupervised mannequin truly displays the Nutri-Rating methodology working as meant. It demonstrates that Nutri-Rating is offering a nuanced evaluation that goes past simplistic measures of particular person vitamins. This underscores the significance of contemplating the Nutri-Rating within the context of a meals’s total diploma of processing and the standard of its substances, not simply the uncooked dietary numbers.
As an additional step, we will additionally have a look at which dietary options contributed most to the outlier detection by computing the common CoDisp per dimension:
dim_codisp = np.zeros([n,d],dtype=float)
for tree in forest:
for leaf in tree.leaves:
codisp,cutdim = tree.codisp_with_cut_dimension(leaf)
dim_codisp[leaf,cutdim] += codispfeature_importance_anomaly = np.imply(dim_codisp[avg_codisp>50,:],axis=0)
This tells us which vitamins are typically most inconsistent with the assigned Nutri-Scores for anomalous merchandise.
By leveraging the Open Meals Information database and unsupervised anomaly detection with Random Lower Forest, we will establish outlier merchandise the place the Nutri-Rating grade might not precisely replicate the dietary contents. This evaluation might help validate the Nutri-Rating system and floor potential points in how the scores are assigned. The code shared supplies a template for conducting any such consistency examine on open dietary datasets.