From 9696184cbcbc3906272bb8324fa8189c0dbccb14 Mon Sep 17 00:00:00 2001 From: jeremyarancio Date: Mon, 28 Oct 2024 09:34:33 +0100 Subject: [PATCH 1/9] Update data page --- lang/aa/texts/data.html | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) diff --git a/lang/aa/texts/data.html b/lang/aa/texts/data.html index d96d38820cc0..8310047a5dc0 100644 --- a/lang/aa/texts/data.html +++ b/lang/aa/texts/data.html @@ -65,6 +65,27 @@

CSV Data Export

The file encoding is Unicode UTF-8. The character that separates fields is <tab> (tabulation).

+

Parquet Data Export on Hugging Face

+ +

A cleaner version of the JSONL dump is also available in the Parquet format. This data format is optimized for columnar queries, which is particular convenient for data analysis.

+ +The dataset is available on Hugging Face, a collaborative Machine Learning ecosystem where developers and researchers can share models and datasets. +
+
Link
+
https://huggingface.co/datasets/openfoodfacts/product-database/resolve/main/products.parquet +
+
+ +

CSV Data Export

+

Data for all products, or some of the products, can be downloaded in the CSV format (readable with LibreOffice, Excel and many other spreadsheet software) through the advanced search form.

+ +
+
Links
+
https://static.openfoodfacts.org/data/en.openfoodfacts.org.products.csv.gz (compressed CSV in GZIP format: ~ 0.9 Gb, uncompressed: ~ 9 Gb)
+
+ +

The file encoding is Unicode UTF-8. The character that separates fields is <tab> (tabulation).

+

RDF Data Export

The database is also available in the RDF format. You can read the announcement in French.

From 3f51ae91656fb4c6ffe9e083f9649a715abead7f Mon Sep 17 00:00:00 2001 From: jeremyarancio Date: Mon, 28 Oct 2024 09:35:55 +0100 Subject: [PATCH 2/9] Update data page --- lang/aa/texts/data.html | 10 ---------- 1 file changed, 10 deletions(-) diff --git a/lang/aa/texts/data.html b/lang/aa/texts/data.html index 8310047a5dc0..cffab9722605 100644 --- a/lang/aa/texts/data.html +++ b/lang/aa/texts/data.html @@ -55,16 +55,6 @@

JSONL data export

A suitable way to exploit the database is to use DuckDB, an in-process analytical tool designed to process large amount of data in a fraction of seconds. You can read our blog post where we walk you through exploring and processing the Open Food Facts database with DuckDB

-

CSV Data Export

-

Data for all products, or some of the products, can be downloaded in the CSV format (readable with LibreOffice, Excel and many other spreadsheet software) through the advanced search form.

- -
-
Links
-
https://static.openfoodfacts.org/data/en.openfoodfacts.org.products.csv.gz (compressed CSV in GZIP format: ~ 0.9 Gb, uncompressed: ~ 9 Gb)
-
- -

The file encoding is Unicode UTF-8. The character that separates fields is <tab> (tabulation).

-

Parquet Data Export on Hugging Face

A cleaner version of the JSONL dump is also available in the Parquet format. This data format is optimized for columnar queries, which is particular convenient for data analysis.

From 35a080541305dcd6452677ecf3b8c2bfd39d8ebc Mon Sep 17 00:00:00 2001 From: Jeremy Arancio <97704986+jeremyarancio@users.noreply.github.com> Date: Tue, 29 Oct 2024 17:41:18 +0100 Subject: [PATCH 3/9] Pierre feedback Co-authored-by: Pierre Slamich --- lang/aa/texts/data.html | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lang/aa/texts/data.html b/lang/aa/texts/data.html index cffab9722605..8ea905ecd631 100644 --- a/lang/aa/texts/data.html +++ b/lang/aa/texts/data.html @@ -57,7 +57,7 @@

JSONL data export

Parquet Data Export on Hugging Face

-

A cleaner version of the JSONL dump is also available in the Parquet format. This data format is optimized for columnar queries, which is particular convenient for data analysis.

+

A simplified of the JSONL dump is also available in the Parquet format. This data format is optimized for column-oriented queries, which is particular convenient for data analysis. i.e. you can select just the columns you care about, reducing the size of the file you have to handle. You don't have to download all columns, simplifying your data-processing on entry-level computers.

The dataset is available on Hugging Face, a collaborative Machine Learning ecosystem where developers and researchers can share models and datasets.
From 5c8fb76cb72ad32cb3d958463046b1346c81de6f Mon Sep 17 00:00:00 2001 From: jeremyarancio Date: Tue, 29 Oct 2024 18:08:15 +0100 Subject: [PATCH 4/9] docs: :memo: Update description based on feedbacks --- lang/aa/texts/data.html | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/lang/aa/texts/data.html b/lang/aa/texts/data.html index 8ea905ecd631..9b537c3dbfb4 100644 --- a/lang/aa/texts/data.html +++ b/lang/aa/texts/data.html @@ -57,7 +57,15 @@

JSONL data export

Parquet Data Export on Hugging Face

-

A simplified of the JSONL dump is also available in the Parquet format. This data format is optimized for column-oriented queries, which is particular convenient for data analysis. i.e. you can select just the columns you care about, reducing the size of the file you have to handle. You don't have to download all columns, simplifying your data-processing on entry-level computers.

+

A simplified version of the JSONL dump is also available in the Parquet format. During the conversion, we filtered columns that contains duplicated information, are used for internal debugging, or are simply irrelevant for users. + +The Parquet format has proved to be handy: + +

    +
  • Data are organized by column, rather than by row, which saves storage space and speeds up analytics queries, i.e. you can select just the columns you care about, optimizing query performances, even on entry-level computers.
  • +
  • Highly efficient data compression and decompression, making it good for storing and sharing big data of any kind,
  • +
  • Supports complex data types and advanced nested data structures.
  • +
Hugging Face, a collaborative Machine Learning ecosystem where developers and researchers can share models and datasets.
@@ -66,6 +74,8 @@

Parquet Data Export on Hugging Face

+Find more information in the Wiki, including guidelines for data reuse and example queries to get started. +

CSV Data Export

Data for all products, or some of the products, can be downloaded in the CSV format (readable with LibreOffice, Excel and many other spreadsheet software) through the advanced search form.

From 49a3bcd7ecdec7322d4248d63478975fc75ff230 Mon Sep 17 00:00:00 2001 From: jeremyarancio Date: Tue, 5 Nov 2024 13:42:14 +0100 Subject: [PATCH 5/9] fix: :art: Wrong lang: aa -> en --- lang/aa/texts/data.html | 21 --------------------- lang/en/texts/data.html | 22 ++++++++++++++++++++++ 2 files changed, 22 insertions(+), 21 deletions(-) diff --git a/lang/aa/texts/data.html b/lang/aa/texts/data.html index 9b537c3dbfb4..d96d38820cc0 100644 --- a/lang/aa/texts/data.html +++ b/lang/aa/texts/data.html @@ -55,27 +55,6 @@

JSONL data export

A suitable way to exploit the database is to use DuckDB, an in-process analytical tool designed to process large amount of data in a fraction of seconds. You can read our blog post where we walk you through exploring and processing the Open Food Facts database with DuckDB

-

Parquet Data Export on Hugging Face

- -

A simplified version of the JSONL dump is also available in the Parquet format. During the conversion, we filtered columns that contains duplicated information, are used for internal debugging, or are simply irrelevant for users. - -The Parquet format has proved to be handy: - -

    -
  • Data are organized by column, rather than by row, which saves storage space and speeds up analytics queries, i.e. you can select just the columns you care about, optimizing query performances, even on entry-level computers.
  • -
  • Highly efficient data compression and decompression, making it good for storing and sharing big data of any kind,
  • -
  • Supports complex data types and advanced nested data structures.
  • -
Hugging Face, a collaborative Machine Learning ecosystem where developers and researchers can share models and datasets. -
-
Link
-
https://huggingface.co/datasets/openfoodfacts/product-database/resolve/main/products.parquet -
-
- -Find more information in the Wiki, including guidelines for data reuse and example queries to get started. -

CSV Data Export

Data for all products, or some of the products, can be downloaded in the CSV format (readable with LibreOffice, Excel and many other spreadsheet software) through the advanced search form.

diff --git a/lang/en/texts/data.html b/lang/en/texts/data.html index 6d45e7e14be9..9252b71cd2fb 100644 --- a/lang/en/texts/data.html +++ b/lang/en/texts/data.html @@ -55,6 +55,28 @@

JSONL data export

A suitable way to exploit the database is to use DuckDB, an in-process analytical tool designed to process large amount of data in a fraction of seconds. You can read our blog post where we walk you through exploring and processing the Open Food Facts database with DuckDB

+

Parquet Data Export on Hugging Face

+ +

A simplified version of the JSONL dump is also available in the Parquet format. During the conversion, we filtered columns that contains duplicated information, are used for internal debugging, or are simply irrelevant for users. + +The Parquet format has proved to be handy: + +

    +
  • Data are organized by column, rather than by row, which saves storage space and speeds up analytics queries, i.e. you can select just the columns you care about, optimizing query performances, even on entry-level computers.
  • +
  • Highly efficient data compression and decompression, making it good for storing and sharing big data of any kind,
  • +
  • Supports complex data types and advanced nested data structures.
  • +
+ +The dataset is available on Hugging Face, a collaborative Machine Learning ecosystem where developers and researchers can share models and datasets. + +
+
Link
+
https://huggingface.co/datasets/openfoodfacts/product-database/resolve/main/products.parquet +
+
+ +Find more information in the Wiki, including guidelines for data reuse and example queries to get started. +

CSV Data Export

Data for all products, or some of the products, can be downloaded in the CSV format (readable with LibreOffice, Excel and many other spreadsheet software) through the advanced search form.

From 53a42060041cf510f1bdd6d2fbffe4c7fcca2482 Mon Sep 17 00:00:00 2001 From: jeremyarancio Date: Tue, 5 Nov 2024 15:17:20 +0100 Subject: [PATCH 6/9] fix: :art: Fix

--- lang/en/texts/data.html | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/lang/en/texts/data.html b/lang/en/texts/data.html index 9252b71cd2fb..fc9541ca41a0 100644 --- a/lang/en/texts/data.html +++ b/lang/en/texts/data.html @@ -57,9 +57,9 @@

JSONL data export

Parquet Data Export on Hugging Face

-

A simplified version of the JSONL dump is also available in the Parquet format. During the conversion, we filtered columns that contains duplicated information, are used for internal debugging, or are simply irrelevant for users. +

A simplified version of the JSONL dump is also available in the Parquet format. During the conversion, we filtered columns that contains duplicated information, are used for internal debugging, or are simply irrelevant for users.

-The Parquet format has proved to be handy: +

The Parquet format has proved to be handy:

  • Data are organized by column, rather than by row, which saves storage space and speeds up analytics queries, i.e. you can select just the columns you care about, optimizing query performances, even on entry-level computers.
  • @@ -67,7 +67,7 @@

    Parquet Data Export on Hugging Face

  • Supports complex data types and advanced nested data structures.
-The dataset is available on Hugging Face, a collaborative Machine Learning ecosystem where developers and researchers can share models and datasets. +

The dataset is available on Hugging Face, a collaborative Machine Learning ecosystem where developers and researchers can share models and datasets.

Link
@@ -75,7 +75,7 @@

Parquet Data Export on Hugging Face

-Find more information in the Wiki, including guidelines for data reuse and example queries to get started. +

Find more information in the Wiki, including guidelines for data reuse and example queries to get started.

CSV Data Export

Data for all products, or some of the products, can be downloaded in the CSV format (readable with LibreOffice, Excel and many other spreadsheet software) through the advanced search form.

From f2a7d31235a044773763ab8e37ac7451db66e4f4 Mon Sep 17 00:00:00 2001 From: Jeremy Arancio <97704986+jeremyarancio@users.noreply.github.com> Date: Sun, 17 Nov 2024 14:48:28 +0100 Subject: [PATCH 7/9] Update lang/en/texts/data.html Co-authored-by: Pierre Slamich --- lang/en/texts/data.html | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lang/en/texts/data.html b/lang/en/texts/data.html index fc9541ca41a0..35b5137c507d 100644 --- a/lang/en/texts/data.html +++ b/lang/en/texts/data.html @@ -62,7 +62,7 @@

Parquet Data Export on Hugging Face

The Parquet format has proved to be handy:

    -
  • Data are organized by column, rather than by row, which saves storage space and speeds up analytics queries, i.e. you can select just the columns you care about, optimizing query performances, even on entry-level computers.
  • +
  • Data is organized by column, rather than by row, which saves storage space and speeds up analytics queries, i.e. you can select just the columns you care about, optimizing query performances, even on entry-level computers.
  • Highly efficient data compression and decompression, making it good for storing and sharing big data of any kind,
  • Supports complex data types and advanced nested data structures.
From f833a66c96d9fc2f6ae826defae0db0616bcbd2f Mon Sep 17 00:00:00 2001 From: Jeremy Arancio <97704986+jeremyarancio@users.noreply.github.com> Date: Sun, 17 Nov 2024 14:48:36 +0100 Subject: [PATCH 8/9] Update lang/en/texts/data.html Co-authored-by: Pierre Slamich --- lang/en/texts/data.html | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lang/en/texts/data.html b/lang/en/texts/data.html index 35b5137c507d..ba74b4e80e2d 100644 --- a/lang/en/texts/data.html +++ b/lang/en/texts/data.html @@ -63,7 +63,7 @@

Parquet Data Export on Hugging Face

  • Data is organized by column, rather than by row, which saves storage space and speeds up analytics queries, i.e. you can select just the columns you care about, optimizing query performances, even on entry-level computers.
  • -
  • Highly efficient data compression and decompression, making it good for storing and sharing big data of any kind,
  • +
  • Highly efficient data compression and decompression, making it good for storing and sharing big datasets of any kind,
  • Supports complex data types and advanced nested data structures.
From cb04b073f13af93787501142cd34d919bbda7319 Mon Sep 17 00:00:00 2001 From: jeremyarancio Date: Mon, 18 Nov 2024 08:18:41 +0100 Subject: [PATCH 9/9] Trigger workflow