Okay, it looks like an update from earlier this week broke table/log uploading (though the backend was still updated). I fixed it about halfway through it's current run ("Kentucky") so it will be incomplete tonight but should work fine tomorrow night. Thanks for everybody's patience, and please do (a)mention me in the future with any issues. I've tried to set up watchlist/emailing for this page but have never gotten it to work! audiodude (talk) 02:39, 9 April 2025 (UTC)[reply]
The API is only contacted in the case that an article that was previously rated can't be found, to check for article move metadata. So yes, maybe the API was broken, but why did the tool think the article was deleted/moved in the first place?
This is the current page in the tool, which would be uploaded to en wiki, after a manual run of Austria:
The only problem now is that the next time the bot runs, it will see all of these 32k articles as "new" and write that into the Wikipedia:Version_1.0_Editorial_Team/Austria_articles_by_quality_log, which will probably be too large to upload (like the current version, which is likely "everything got deleted" messages). So I think what we want to do is purge the logs after the next successful run, and they will start being created again on the next day.
Audiodude, if you think the process should be ok now, do you think the bot can be unblocked now, rather than just waiting for the block to expire? Nurg (talk) 01:00, 15 May 2025 (UTC)[reply]
No, I don't think it's a good idea to remove the block early. I honestly don't think it's a good idea to remove it at all until we have a better idea of what caused this and how to prevent it in the future. audiodude (talk) 03:17, 15 May 2025 (UTC)[reply]
I've merged and deployed https://github.com/openzim/wp1/pull/869 which addresses/fixes the issue I raised earlier, to hopefully prevent this from happening in the future. However, I still have no idea what caused it, and I don't have any ideas where or what to look for in that regard. The bot logs aren't helpful here. I guess I'm hoping to get an email along the lines of "FYI tool maintainers: the database was returning garbage briefly a day ago".
To be clear, as mentioned in that PR, it is plausible that if the replica database returned an empty list for "FooBar articles by quality", this could happen (which is what my PR aims to address). However, it's not clear why it would return an empty list and not an out-of-band database error. It's not an issue with any of the tables or our query, because those remain unchanged and as I pointed out, manual updates are working.
So really, I think I've narrowed down the problem to "enwiki_p returned an empty list" but I have no idea still what could have caused that. audiodude (talk) 04:57, 15 May 2025 (UTC)[reply]
I strongly recommend to open an upstream bug at Phabricator, as current fix in the bot looks like more like a workaround than a fix. Kelson (talk) 05:57, 15 May 2025 (UTC)[reply]
Thanks for the ping. @Firsfron can you unblock the bot now? I don't expect this to happen again, and we haven't gotten any real response on the phabricator ticket. audiodude (talk) 12:59, 19 May 2025 (UTC)[reply]
Thank you @Firsfron and @Nurg for the quick actions. Apologies that I wasn't able to have my finger on the absolute pulse of the bot as soon as it started this evening. We're all volunteers here. I will continue to pursue a permanent resolution to this problem. Thanks again for your patience. audiodude (talk) 00:07, 20 May 2025 (UTC)[reply]
Is it possible to block the bot in the User namespace while leaving it unblocked in the Wikipedia namespace, since the problematic edits only occur in the User namespace? Or would a partial block cause the bot to get stuck or prevent it from functioning properly? I'm asking because its edits to the 'Articles by Quality log' pages in the Wikipedia namespace, such as Special:Diff/1291205742, appear fine. Those pages are helpful to many, so it would be great if the bot could continue updating them normally, as it seems capable of doing so. 87.95.243.221 (talk) 16:23, 21 May 2025 (UTC)[reply]
Unfortunately this wouldn't work. The only reason those pages aren't broken is because the bot was stopped/banned before it could get to them. With the current bug, I anticipate that the bot would effectively destroy all pages. audiodude (talk) 16:51, 21 May 2025 (UTC)[reply]
Note that the table is still visible, and manually updatable at [1]. Doesn't help with a changelog, but it does at least let you click on each number to get the list of articles in each cell in the table matrix. Could the "last updated" timestamp be added as a footnote to the table on openzim? The-Pope (talk) 00:13, 22 May 2025 (UTC)[reply]
Right, the tables are being generated correctly, which is why it's all the more confusing why the on-wiki updates aren't working.
This morning, I moved the old logs directory out of the way and created a fresh new one. Then I queued and ran the entire update job in the same way that the cron job does.
It seems to have run successfully.
The update jobs all successfully completed, and no project had more than around 100 articles deleted. I spot checked a few of the tables on the website and they looked good. I also was logging the generated data for the on-wiki tables, and they all had totals except for the project Vital which had 0. But that seems right because this category is empty: https://en.wikipedia.org/wiki/Category:Vital_articles_by_quality.
I have no idea what caused the bot to blank out those tables, and it seems to me like it wouldn't happen again since I'm running it exactly as it gets run nightly. But I can't be sure because I haven't identified a root cause.
Currently, my idea is to turn off the automatic run of the bot (cron job), unblock it, then run the bot manually and closely monitor it. audiodude (talk) 17:53, 24 May 2025 (UTC)[reply]
Agree, but very worrying that we are not able to understand the root cause. I would recommend to open issue to improve the logging/tracing capabilities (if not already done). Kelson (talk) 15:51, 25 May 2025 (UTC)[reply]
Note that there were a couple of stray projects in the queue when the block was lifted, which accounts for the other changes in contribs.
These were run with the queueing system, not from the command line, so as close as possible to the actual mechanism used by the bot. The next step is to do a full, manually triggered run (and monitor it!). If that works, I will restore the database back to the May 11th backup (because all of the "original rating" dates for all ratings have now been lost), and then re-enable scheduled updating. audiodude (talk) 01:40, 26 May 2025 (UTC)[reply]
To be clear, I still have no idea of the "root cause" of this problem. I also believe that when it appeared that the problem was still happening (after the bot was unblocked the first time), the actual issue was that there was a backlog of "upload" jobs in the queue, that were uploading empty tables before the "update" jobs of the bot could properly run and refresh the data. The bot runs in two phases: update, then upload. audiodude (talk) 01:45, 26 May 2025 (UTC)[reply]
The manual full run seemed to be successful. I'm going to take down the tool and website for a couple of hours right now while I try to restore the data from a backup. Then I'll do another manual run. audiodude (talk) 14:59, 27 May 2025 (UTC)[reply]
Okay, I've restored the database (as mentioned above) and done a full manually kicked-off run. Everything looks good to me. I'm going to re-enable automatic upates and consider this issue closed. Please continue to look at things everyone and feel free to speak up if anything seems off again. Thanks! audiodude (talk) 00:31, 28 May 2025 (UTC)[reply]
Short summary for people who aren't familiar with Github: yesterday there was an update to something in the "Event" namespace for that project, and the code didn't understand that namespace and so gave up. Now it understands! --PresN23:39, 31 May 2025 (UTC)[reply]