In one of your data investigations, you talk about how, due to a change in methodology, three million poor people simply dropped out of Rosstat’s statistics. What methods do state statistics use to overestimate or underestimate indicators?This case shows the multidirectional trends within the Russian bureaucracy. Before the full-scale war started, there was a desire to make the methodology for calculating poverty more modern and similar to Western ones, i.e., to take as a basis the median salary in the country, rather than a fixed poverty threshold.
Rosstat even enacted these changes, but it turned out that with this approach, poverty in Russia is rising, not falling. I do not know what happened next, but the methodology was changed again and the concept of a “poverty line” reintroduced. The new formula is hard to justify, but it produces better numbers, pleasing the government, which has reported record-low poverty for two years in a row.
Since Rosstat publishes the methodology, we know exactly how these figures were obtained and we can calculate poverty ourselves using different methodologies and compare the results, which is what we did. It turned out that at least three million poor people were excluded from the official data. And if we use more multidimensional approaches, the figure doubles or even triples.
There are a lot of ways to get a desired figure – it all depends on the ingenuity of those doing the counting. Another example: the government is trying to reduce the number of schoolchildren who attend the second (late) session, as such an order has come from higher-ups. But in reality there are not enough seats, so the term “staggered schedule” has appeared. According to statistics, classes on a “staggered schedule” are part of the first (early) session, but in reality they begin at 1:00 in the afternoon (when the second session starts).
Sometimes data is distorted in a more “natural” way. For example, in Moscow there is “maternity tourism,” when women come from other regions to give birth, which inflates the birth rate in the capital. Another story is the “republics” in the North Caucasus, where there are also
quite a few anomalies in the statistics. Yet there are also innocent mistakes when firms input data.
You launched an open data portal Cedar, a nongovernmental digital archive of data about Russia. How is it different from other, existing platforms?Our project is aimed at academic researchers, experts and analysts who are engaged in Russian studies in the broad sense of the word. After 2022, many of them lost the ability to conduct field research in Russia, with the loss of sources and other problems associated with physical inaccessibility. All this negatively affects the level of expertise and reduces the quality of decision-making.
Cedar’s goal is to show that even in the current climate, there are ways to study Russia, and many of them lie in the realm of digital methods.
Sure, data and computational methods cannot replace traditional sociology and ethnography, but they can productively complement them and provide new insights. We use a variety of approaches, from OSINT techniques and critical work with official statistics to scraping social media data and natural language processing.
For example, we have a complete database of court decisions from the beginning of the 2000s, which we have scraped from the websites of a couple of thousand Russian courts. Or data on the results of federal elections since 2000 by precinct election commission with a calculated share of anomalous votes.
You collect a thematically wide array of datasets. What data have you found to be the most difficult to find or not available at all?I would identify four main stages in working with data: searching, downloading, processing and analyzing. Then “combinatorics” begins. There is data that is difficult to find but relatively easy to download. A big chunk of official data falls into this category, because it is not always easy to find the indicator you need on government websites, [since] they are not very user-friendly, and the form you need can be buried very deeply. Or it is spread across dozens of Word files, inside which scans of tables are inserted... And this becomes a processing problem.
Sometimes the data is not that hard to find but you need to get creative to download it. This was the case with pollution statistics, which we obtained using a hidden API, or with the latest data from the e-budget, which could be downloaded in a roundabout way through a graphic widget on the website. In the case of court data, for example, the difficulty is that it is located on many separate sites and is protected by captchas.
From the social networks that are popular in Russia – VK and Telegram – it is very easy to download data but not so easy to analyze it. We strive to make the work of researchers and journalists easier at all four stages.
You probably monitor views and citations in the media. What is the Russian-speaking reader following closely now?If we look at readers inside Russia, I would highlight two trends. First, there is war fatigue. Second, there is a feeling of the media in exile “losing touch” with readers who remain in the country. Obviously, the war in Ukraine remains the number one topic for Russian-language independent media, but it is important to respond to the trends that I mentioned.
Many newsrooms are trying to water down political content with social issues, add elements of solutions journalism and find “good news” as an alternative to doomscrolling. Everyone has their own answers, and I believe that data journalism can also be one of the “antidotes.” Numbers are often more credible, more objective and more accepted than opinions. Therefore, working with data can be a way to attract a less politicized reader.