Update for this project will be hold until until the voting period of SolidHack is over.
Indonesia’s geography makes it tricky to ensure schools are accessible across all regions. By mapping where people live alongside school locations, this study shows how spatial data can reveal patterns of educational access across the country. This approach helps create a clearer picture of how schools are distributed, making it easier to understand and address educational reach for communities throughout Indonesia.
Because of Mercator projections, it’s easy to get distorted view of how big some countries are. If you’re curious about true size of Indonesia, here’s how it compares to Europe:
This project is my experiment with interactive big data visualizations for some upcoming projects. My goal is to see how less reources I can use to manage, process, and parse big data visualizations but the interactivity high. Here’s a breakdown of the choices I made and the reasons behind them.
In this section, I’ll explain my stacks choices for this project, the focus is on finding the right balance between performance, usability, and future readiness.
You can check out the data section for more details, but here’s how I manage that data. I’m working with two datasets, 475.272 rows with 21 columns covering schools across Indonesia and 573.841 rows with 6 columns center of 1km² population grid. Not big, but still big enough when transferring it over a network. So, I needed a way to store this data, query parts of it, and minimize the data transfer and storage costs.
Instead of setting up a traditional server-side database like PostgreSQL or MongoDB, I decided against using a DBMS here. Running a full DBMS just to store static data seemed overkill since this data doesn’t change. Managing a server to handle heavy data loads, while keeping response times low, can be complex and costly, especially when I’m not even updating the data regularly.
So, I thought: why not just use static files? It's like good idea right, but formats like JSON and CSV didn’t fit in because:
- They’re too big to transfer efficiently.
- I can dynamic query to the data.
JSON.parse
on a massive file is a performance nightmare- They require loading the whole file at "once".*
So I need another static format that more effective and smaller, and I choose parquet. Parquet lets me compress the data effectively so the size will be smaller. Also we can query the data partially with SQL with duckdb-wasm (we will talk about it later) and we don't have to parse, because the data is in binary format and can be used easily with duckdb
and apache arrow
. duckdb-wasm enables more advanced features like partially consuming the data. as a result, we may not need to download the entire file for a query but only required bytes, then same request will not request again. Here comparison files size, in json, csv, and parquet for data used in this project:
With Parquet, I can get file sizes 8-20 times smaller than JSON or CSV. Storing it as a static asset is also easier—just put it in S3 or similar storage.
If you use cloudflare, Because DuckDB query rely on range header request, storing static files in cloudflare pages does not work because cloudflare pages / worker assets does not work with range request, but you can store it in R2, and add accept range headers.
I want this app to run with as few resources as possible, meaning no heavy servers if I can avoid it. I went with Solid Start as the framework because Solid Start offer more isomorphic way to create web apps, from what rendering method, server actions / functions, api handler, deployment preset and more. I started using only Client Side Rendering, and if I need to add new features that require server instead just SPA, I can do that easily, without much changes.
When this project is started, SolidHack event was starting up too, so I thought—why not join in?
Since the main goal is interactivity, I want users to be able to customize anything, including the data. With a big dataset like this, I needed a way to query it straight from the client side. DuckDB was the perfect fit because it lets me query remote files in chunks, handles data parsing, and manages network requests smoothly.
Duckdb offers extensions like SPATIAL extension that we can use to add more features to our apps, such as query data in specific location, radius-based searches, etc.
If you wonder why not just run DuckDB on a server and send the binary results to the client?. this approach is also possible and one of my options, but every query send to server, the data must be transfered back to client even if it is same query or logically same. If we use duckdb wasm in client, every request to duckdb does not alwasy trigger network request if the data or partial data is already available in memory. So the more query you made, the faster query result is.
Maplibre GL is fork of MapBox when mapbox use propertiary license. This fork offer features almost same in mapbox. We use maplibre and deck.gl to visualize hundreds thousand of Point Of Interest without lag and more performants.
After the data’s queried, we need to process the result, like calculating for legend values, generating colors, and so on. I use web workers for this, which keeps the main thread free and prevents blocking. Also utilizing binary data to minimize in data transfer between worker thread to main thread or vice versa. duckdb return query result in apache arrow, then this data transfered to worker as buffer. Transfering buffer is prefered because we can use Transferable Object to speed up the data handoff between main thread to worker thread or vice versa. Offloading process to web worker also prevent blocking in main thread.
geometry column in duckdb need to return as WKB in the query, then we can transform it to FLoat64Array. because, geometry column used in duckdb is not common format, so we need another binary format for geometry, CMIIW. (source)
To get better perfomance when visualizing data in maplibre, we utilizing binary format or TypedArray to store coordinates and colors data, like Uint8Array, Float64Array, etc, as stated here, it has more performance benefits. This operations happens in this web worker, check schools-worker and grid-worker.
Using web worker in solid is easy, just store worker object in solid store then you can listen worker message from multiple components.
This approach has some caveats, most of them in duckdb:
- DuckDB WASM file is big, compared to sqlite-wasm or pglite
- DuckDB WASM still does not manage request waterfall, and still cannot persist the data to local browser storage, like indexedDB or OPFS.
- This project works best if you just want to create visualizations with static data, for dynamic data where you need to update the data may be another approach is preferable.
I have plan to try something like this, may be using another database engine like sqlite-wasm or pglite, and experiment with dynamic data handling in both client and server contexts.
This website is built as playground, wtwo query handlers: one for querying school locations (Points of Interest) and the other for the population grid. and each has only one data source sekolah.parquet
and popgrid.parquet
. All query results will be processed before being displayed on the maps. You can query whatever you want, like filtering only specific conditions or display all of data, As long as you follow the rules, your query and the data will be display correctly on maps.
Sekolah is Indonesian words means 'schools' in english
We use duckdb as query engine to query our data directly from your browser, You can use a SQL-like dialect to create your queries, and we also utilize the SPATIAL EXTENSION. If you're unsure how to use it, you can open the DuckDB console to experiment with any queries or to run queries that you do not want to display on the maps.
You can return any columns from your query, but to display the data on the maps, your query results must include the following columns:
columns | required | default | comments |
---|---|---|---|
ST_AsWKB(location) as location |
yes | - |
Your query must return the location column using ST_AsWKB(location). ST_AsWKB is a special function from DuckDB that transforms geometry into Well-Known Binary data for later parsing. |
color |
no | - |
We will generate a legend based on this column. After the query is completed, we will compare it to generate colors based on the values and the total colors used. You can use any column, but return the column name as color |
radius |
no | 50 |
Similar to the color, you can customize the radius (in meters) of each point by returning a radius column. |
Additional columns from you query, such as nama
, pd
or others, etc (refers to this) and this column will be shown in tooltip only when you hover the data.
You can run any query similar to the one for school locations, but the query for population density will be displayed as a grid instead of points. The values for the legend will be transformed based on the total colors, so each cell corresponds to the maximum values. Here are the columns you can use to display the data:
columns | required | default | comments |
---|---|---|---|
ST_AsWKB(location) as location | yes | - |
Your query must return the location column using ST_AsWKB(location) . ST_AsWKB is a special function from DuckDB that transforms geometry into Well-Known Binary data for later parsing. |
value | yes | - |
The value column represents the population density data for each location. We will calculate the results and total colors to generate the color for each cell based on the value. |
Here is sample query if you want to use
- All Senior High Schools In Jakarta
SELECT 75 as radius, jenjang as color, ST_AsWKB(location) as location FROM sekolah.parquet WHERE kode_provinsi = '31' AND jenjang IN ['SMA', 'SMK', 'MA'] AND location_status = 'valid'
- Public Schools in Central Java
SELECT 500 as radius, jenjang as color, ST_AsWKB(location) as location FROM sekolah.parquet WHERE kode_provinsi = '33' AND status = 'negeri' AND location_status = 'valid'
- Population Grid In Bali
SELECT value, ST_AsWKB(location) as location FROM popgrid.parquet WHERE provinsi = 'Bali'
For this purpose, we use two data sources: Registered Indonesia Schools and a Population Density Map. These datasets are collected and processed before being presented on a web-based platform. For easy of use, we use parquet for final data
Indonesia has two primary ministries that handle education: the Ministry of Education and Culture (Kementerian Pendidikan dan Kebudayaan, or Kemdikbud) and the Ministry of Religious Affairs (Kementerian Agama, or Kemenag). Both ministries oversee schools across similar levels, from pre-elementary (TK) to elementary, junior, and senior high schools, with most schools following the curriculum set by Kemdikbud. The main difference lies in school management, as schools under Kemenag focus primarily on religious education. This includes Madrasahs (Arabic for "schools"), which provide additional religious subjects, as well as Boarding Schools (Pondok Pesantren), which may follow unique curricula tailored to their religious focus.
We collected data from both ministries websites and processed it to ensure consistency in the final dataset. This involved standardizing metadata, including the total number of students, teachers, and employees, along with the crucial coordinate locations of each school. Additionally, we verified that each school's Point of Interest (POI) matched the intended region by comparing coordinates with regional boundaries. Using this method, we were able to confirm that 380,144 out of 475,252 entries were valid.
Table Name : sekolah.parquet
Column Name | Data Type | Comments |
---|---|---|
nama | VARCHAR | Name of the school |
npsn | VARCHAR | National school registration number |
ptk | BIGINT | Total number of teachers |
ptk_laki | BIGINT | Number of male teachers |
ptk_perempuan | BIGINT | Number of female teachers |
pegawai | BIGINT | Total number of staff/employees |
pegawai_laki | BIGINT | Number of male staff/employees |
pegawai_perempuan | BIGINT | Number of female staff/employees |
pd | BIGINT | Total number of students |
pd_laki | BIGINT | Number of male students |
pd_perempuan | BIGINT | Number of female students |
rombel | BIGINT | Number of classes |
kode_provinsi | VARCHAR | Province code (Kemendagri) |
kode_kabupaten | VARCHAR | District/city code (Kemendagri) |
provinsi | VARCHAR | Province name (Kemendagri) |
kabupaten | VARCHAR | District/city name (Kemendagri) |
alamat | VARCHAR | School address |
status | VARCHAR | Status of the school, the value is either negeri (public) or swasta (Private) |
jenjang | VARCHAR | Educational level of the school, (e.g SMA, MA, SMP, etc) |
akreditasi | VARCHAR | Accreditation level of the school |
source | VARCHAR | Source of the data collected, kemenag or kemdikbud |
location | GEOMETRY | Geographic coordinates (lat, long) in geometry format |
location_status | VARCHAR | Verification status of the school's location |
Here information of possible values for location_status
after comparing with region geometry, we use 514 district / city boundaries to check, if coordinates is inside the reported region or outside.
location_status |
information |
---|---|
valid | Reported region and coordinates are valid for either kabupaten (district) or provinsi (province). |
potential-mistmatch | kabupaten (district) is incorrect, but the location is within the same provinsi (province). |
invalid | Neither kabupaten nor provinsi matches the coordinates, coordinates outside of region |
maybe | for Indonesia Schools Abroad (Luar Negeri), insufficient data for validation. |
not-mapped | Incorrect coordinate data, such as (0,0). |
We understand that no actual data for population density maps, all available data is only estimation based on it's location and statistics data. For population density maps, we used GHS-POP data from Global Human Settlement Layer.
We convert the data from raster to xyz values then combine them with regencies geometry to add more additional geographic information for each point.
Table Name : popgrid.parquet
Column Name | Data Type | Comments |
---|---|---|
value | DOUBLE | Population density estimation |
location | GEOMETRY | Center of 1km*1km grid (30 arc seconds) |
kode_provinsi | VARCHAR | Province code (Kemendagri) |
kode_kabupaten | VARCHAR | District/city code (Kemendagri) |
provinsi | VARCHAR | Province name (Kemendagri) |
kabupaten | VARCHAR | District/city name (Kemendagri) |
For this project, we use Datawan Maps, You can explore more about it here.