Thoughts on data SEO, curation and more from researching a contemporary data question (military support for Ukraine) #1182
Unanswered
rufuspollock
asked this question in
General
Replies: 1 comment 2 replies
-
@davidgasquez let me know if you have any thoughts 🙂 |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Wanted to see which countries have been providing the most military support for the Ukraine.
Did a search for:
Here's the journey i went on and some reflections from that. Quick tl;dr
First, reflections on data SEO (and how Statista have mastered this)
First results are specifically tagged as datasets and come from Statista:
Statista seem to have kind of mastered the after of "content-spamming" (? data spamming) search results very effectively. Essentially huge numbers of results each of which is a small (micro) slice of a larger dataset (? or do they really collect and manage data at this level).
Aside: Something one to look into, perhaps. And that we were already doing in a way with the github.com/datasets project (but that aspect of micro-slicing is another level - and something we thought about in 2016-2017 e.g. for a dataset like GDP per year and country we would not have the main page but sub pages for each country and each year and even each country in each year).
Statista has a nice simple page - just the data i want in visual form
Aside: another things to looke at - this creation of specific small datasets and graphs. This simplicity of data (and presentation) is a direction that DataHub has been going in (to an extent) for a decade now.
The limitations of Statista (IMO)
Once you get to Statista site from google though everything is locked down. Clicking on almost anything, certainly downloading anything, and even just checking the source of the data leads to this ...
And the irony is that Statista data is coming from Kiel academic research institute
It looks like Statista data is actually getting sourced from Kiel (i can't tell for sure as the Statista sources tab is only available with a paid account)
Which was actually the 4th result on that page (and the first non-dataset one)
First page is not the data, it's the analysis
As is usual for a think-tank / research institute.
And a couple of clicks onwards we do get ... yay 🎉
And here's the data
https://www.ifw-kiel.de/fileadmin/Dateiverwaltung/IfW-Publications/fis-import/f34881d0-26f2-4a47-885e-542fe168f9ad-Ukraine_Support_Tracker_Release_17.xlsx
Here's a cached version f34881d0-26f2-4a47-885e-542fe168f9ad-Ukraine_Support_Tracker_Release_17.xlsx
But ... in a "nice" xlsx, disaggregated and large, and not machine readable ...
So it's 6.7Mb xlsx. Even opening this is a hassle for me - it takes seconds, opens an app i never use (numbers on mac) i don't have a very good app for this (default numbers is so-so)
Plus its basically a "dataset in xlsx" replete with a "README" sheet, "updates and corrections" and data pipeline - there's a raw data sheet and then various computed subsheets derived from that and with additional integrated data e.g. GDP so that we can do aid as a percentage of GDP.
And finally a bunch of tables are messed up with human readable metadata ...
Dataset notes ... in Excel 😉
README .... in Excel 😉
And because
And of course, because this is prepped by analysts not data folks (engineers / wranglers / scientists) we have a nice chunk of human-readable metadata in our table - nicely merged cell over two rows and 4 columns!
Good for humans to read .... not so good for machines. (cf https://rufuspollock.com/2013/11/19/bad-data-real-world-examples-of-how-not-to-do-data/)
And buried in their xlsx there is the figure we wanted
It's sheet 10 afaict. Note how Statista have literally reproduced this on a single page that is SEO-able with better UX e.g. you can hover on the graph, it's just a graph ...)
Human-readable again ...
Note again how we have human-readable not machine-readable with mixed table and graph. Plus table is offset with title etc ...
Why we need data curators
For me, this is a classic demonstration of the value of data intermediaries who curate / refine / prepeare -- like DataHub / Data Collective (or Statista).
Kiel are a research institute. Their job is to do research. Their main output there is shiny PDFs (or shiny HTML if we're lucky). Internally they'll use xlsx or similar and if we're lucky in this "open science" / "open knowledge" day and age they'll dump out their xlsx as they do here. Any graphs will be in the PDF etc.
Their job is not to publish data. It's to publish research.
So it's the job of someone else to take that excellent raw data and make it accessible e.g. graphs and consumable e.g. nice simple CSV.
Of course, in the long-run i hope we get more "data literate" research -- and more research-literate data. But for now, this division of labor makes sense.
And it means there is a big role for DataHub and associated Data Collective.
And in conclusion ... here's a nice dataset with chart on DataHub ...
TODO 😉
And in conclusion ... what about our original question?
Answer: based on 2022-2024 it's Estonia at 1.6% followed by Denmark.
And the US is actually lower than Germany (and Canada). Overall the Nordics and EU countries in general are the largest contributors based on GDP.
Beta Was this translation helpful? Give feedback.
All reactions