User:WhatamIdoing has expressed a desire for data regarding popular WikiProjects, and those data interest me also. User talk:Mr.Z-man/Archive 12#Popular WikiProjects? has a record of the request. I would like to see a monthly report (of page views or edits or watchers; preferably one report for each of those three variables). I have added this talk page to my watchlist, and I will watch for a reply (or replies) here.
—Wavelength (talk) 19:25, 27 October 2010 (UTC)[reply]
Earlier this year the community introduced a new form of deletion for new unreferenced BLPs, there's a concern that we aren't tagging as many as we could. Please could we have the following regular report:
Red-linked categories with significant incoming links
In light of this discussion, would it be possible to generate a report listing red-linked categories (i.e., deleted or never created) which have significant incoming links (i.e., from any namespace except User: and Wikipedia:)? Thank you, -- Black Falcon(talk)17:31, 14 November 2010 (UTC)[reply]
I think Special:Wantedcategories generally updates regularly, which is why this report was never created, although that report only lists the thousand most populated categories. If someone were to go through and either bluelink or de-link the first couple hundred categories depending on their specifics, however, you would likely get everything with 2 category members or more. VegaDark (talk) 01:35, 15 November 2010 (UTC)[reply]
Ah, incoming links, for some reason I was thinking category members. Nope, guess there isn't a report for that. Although I bet there will be a ton of redlinked crap categories we've used in examples in deletion discussions over the years. VegaDark (talk) 10:06, 15 November 2010 (UTC)[reply]
I hadn't thought of those, but you're right: there will be a lot of those, as well as links from CfD nominations and notifications to users. I don't know if it is possible to generate a report that would ignore incoming links from the User: and Wikipedia: namespaces but, if it is possible, it would help to focus the list. -- Black Falcon(talk)17:06, 15 November 2010 (UTC)[reply]
So you want red-linked categories (with or without members) with incoming links from namespaces other than Wikipedia (4) or User (2)? What should the report title be? How often should it update? The more detail you can provide, the less thinking I have to do (and thus it's much easier for me to knock this out in a few minutes). :-) --MZMcBride (talk) 04:58, 23 February 2012 (UTC)[reply]
The more I think about it, the more I think that you actually want the inverse here regarding incoming links. That is, you don't want cases where there are incoming links from all but certain namespaces (page_namespace NOT IN (1,2,3,4,5,7,9,11,13,15,17,19,101,103,105)), you want cases where there are incoming links from certain namespaces (page_namespace IN (0, 6, 10, 14)), right? If I created a list of red-linked categories and put it somewhere in the Wikipedia namespace, it would screw up any subsequent reports, as they'd all have links from a page in the Wikipedia namespace, using your logic. But that's not what you're looking for, is it? You're looking for cases where there are red-linked categories from articles, files, templates, or other categories... I think? --MZMcBride (talk) 23:25, 24 February 2012 (UTC)[reply]
P.S. I kind of feel terrible now after realizing that this request has been sitting here for over a year and it took like ten minutes to complete. I'm the worst.
Your work and assistance are greatly appreciated, no matter the time it takes. :) The report looks perfect; thank you!
You're right about the logic of inclusion/exclusion, since my interest is in links from certain namespaces rather than all links except those from certain namespaces; ultimately, we're excluding more namespaces than we're including. The namespaces of interest would be the main namespace (0) and the file (6), template (10), help (12), category (14) and portal (100) namespaces. I suppose that links from the MediaWiki (8) and book (108) namespaces should be updated, too, but that's a theoretical consideration since pages in those namespaces tend not to link to categories. -- Black Falcon(talk)19:24, 5 March 2012 (UTC)[reply]
By the way, is it possible also to include category pages that contain no members – e.g., ones that have been deleted and emptied? -- Black Falcon(talk)19:27, 5 March 2012 (UTC)[reply]
Dunno. This report's logic has twisted my mind pretty badly.
It's my fault since, looking at the report now, I realize that I had not taken into account two key factors.
The report that I had in mind originally was of category pages which are red-linked and empty and have incoming links from namespaces 0, 6, 10, 12, 14, or 100. That report would have been useful in removing links to categories which should not have any incoming links.
This report is of category pages which are red-linked and populated and have incoming links from the above namespaces. It is arguably more useful since it identifies categories that need to be created (and properly subcategorized) or emptied – a task that is more important than link removal. I will work on the report over the next few days and provide feedback here.
(unindent) Perhaps having a "Members" column would be good? That way you could easily sort the report by ones that are empty (0 members) and ones that aren't. Thoughts? --MZMcBride (talk) 22:00, 17 March 2012 (UTC)[reply]
P.S. Just for my reference, this report takes 45 minutes to run currently on willow.
List of longest 'DISCUSSION' content.
Hi,
Is it possible to get a list of (Wikipedia-) articles which have the largest 'Discussion' content?
On these 'Talk:' pages you also see the amount of archives; for example the article about Barack Obama has 71 archives, and will end high in the list probably.
Thanks in advance. Kyra —Preceding unsigned comment added by 82.171.75.225 (talk) 10:55, 17 November 2010 (UTC)[reply]
Looks pretty good. Might be possible to limit to pages with "rchive" in them and then add the root page for a more precise count of actual discussion, but your method is probably about as accurate, relatively.
One thing. Perhaps it might be better to put the column in megabytes? For example, 40 230 kilobytes = 39.2871094 megabytes (according to Google); I think the megabytes figure is much easier to understand and appreciate.
Okay, let's ignore Titleblacklist then. Can we at least have a database report of all redirects from the file namespace with one or less incoming links stored somewhere (e.g. here)? :| TelCoNaSpVe :|22:00, 7 December 2010 (UTC)[reply]
On this version of the report, there is an entry for Head of the River (Victoria), with the entry noting 8 files in use. Yesterday, I removed [1] more than 400 uses of non-free files from this page. The report is showing the number of files, not the number of uses. This particular article has gone largely unnoticed for three years because "8" uses is not a high number. Yet this article is certainly one of the highest, if not the highest, abusers of non-free content we've ever had on this project.
I'd like to see this report modified, or a new report created, to show how many non-free file uses there are, rather than the number of non-free files used. --Hammersoft (talk) 14:18, 20 October 2010 (UTC)[reply]
The database only tracks binary usage data. This is also true of templates, categories, page links, and external links. It might be possible to add a column to this report, but it will not be possible to get number of uses for all pages. That might result in some false negatives: if a page is using only one non-free file, but uses it a few hundred times, it won't be listed.
In order to count uses, a regex will have to be run on the page text, I think. This will need to account for templates and other such things. I have an idea of how to do this in my head, but it might be a while before I get around to it. --MZMcBride (talk) 18:52, 20 October 2010 (UTC)[reply]
The Talk pages by size MfD
The Talk pages by size MfD raised a lot of good issues and ideas. A main problem is that the database reports have very little information about them on the page, which has lead to confusion (as brought out in the MfD). Most of the database reports pages are not categorized. In response to the MfD, I modified the Wikipedia:Database reports/Talk pages by size page text using posts from the MfD. I also created Category:Wikipedia database reports as a place to categorize all the pages listed at Wikipedia:Database reports and other database reports pages whereever they may be. Initially, I though of using {{infopage}} on the top of the database pages, but that template didn't fit because the database report pages do not describe communal consensus on some aspect of Wikipedia norms and practices. I added a generic one to Talk pages by size instead. In view of the talk page tagging with the admonishment "Perhaps this will motivate greater efficiency in the use of kilobytes," you may want to create a top template to distinguish a utility report from a statistical report. I added text to the top of Wikipedia:Database reports to address this. You may further want to expand the text of Wikipedia:Database reports to provide some Database reports consensus control over the use of utility reports and statistical reports. -- Uzma Gamal (talk) 16:54, 25 November 2010 (UTC)[reply]
The MFD was a complete waste of time and offered (almost) nothing of value, except to point out once again that some people posting here (on Wikipedia) are clueless. The category additions (and other additions) are going to be overwritten the next time the reports update. A better approach to the intro text of the reports is needed, but there are a lot of other projects I'd rather work on. --MZMcBride (talk) 22:49, 25 November 2010 (UTC)[reply]
Would it be possible to generate a report of "/doc" subpages in the Template namespace for which the main template page does not exist or is a redirect? Thanks, -- Black Falcon(talk)05:13, 29 November 2010 (UTC)[reply]
I've tried it for other types of subpages, it isn't useful. There are navboxes with titles like 1984/85_X_game that are still transcluded and <noinclude> use to not work with the preload function so those templates were left uncategorized. We could improve Wikipedia:Database reports/Unused templates so that it's more useful... — Dispenser18:31, 5 December 2010 (UTC)[reply]
I was wondering if Wikipedia:Database reports/Unused templates could possibly be changed so that templates which transclude {{substituted}} within them or within their /doc are either not included, or are separated into a separate section entitled "Substituted Templates" or something similar? This would make it much easier to see which templates intentionally have no transclusions (because they are supposed to be substituted) and those which are just unused. Thanks! Mhiji (talk) 18:23, 12 December 2010 (UTC)[reply]
Templates are more flexible than categories. They allow content to be included or not, they allow for easier redirects and renames, etc. Use a template. It doesn't matter if you use {{transclusionless}} or {{substituted}}. Either will work. --MZMcBride (talk) 02:15, 13 December 2010 (UTC)[reply]
Request: List of articles found in every language but English
I start an article recently and then found there were already 20+ other versions on other language Wikipedias. This got me wondering what other articles exist in multiple other languages but don't appear in English. Has anyone tried to make a list of topics which have not been translated into English, sorted by the number of other language Wikipedias it appears in? Or is anyone willing to try making such a list? Thanks in advance. —Pengo01:41, 25 December 2010 (UTC) (Crossposted from Wikipedia:Village_pump_(idea_lab)#Every_language_but_English.3F). And if you need more specifics I'd like I'd be happy to flesh out the idea.[reply]
I worked on this for 2 hours and I have not much to show for it . There are a lot of wikipedias (269), some of which are really small (Cheyenne Wikipedia has 57 articles). Also, on the toolserver, the wikipedia databases are served across 3 separate servers, which makes querying difficult. If you (MZMcBride) want to see my work, I've included the query below. Good luck! --05:04, 28 December 2010 (UTC)
Err, aren't you assuming that each wiki would have the article with the same title? Surely other language Wikipedias will have localized titles like "World War II".... You need to be focus on the langlinks table. --MZMcBride (talk) 05:05, 28 December 2010 (UTC)[reply]
I had already posted the query at the village pump and concluded that it would not work since 1) the interwikis aren't kept in good condition and 2) other languages use different page structure. On DE Quantum physics and Quantum mechanics are separate while on EN they are covered in the same article. — Dispenser05:45, 28 December 2010 (UTC)[reply]
I thought approaching this as “biggest interwiki networks that don't include enwiki”. But I think that would be difficult or impossible to do with just SQL. Svick (talk) 11:12, 28 December 2010 (UTC)[reply]
I created Spanish pond turtle yesterday for just that reason(having 7 other language pages). I think you need to look on the bigger other language Wikipedias (.es, .de) and check if it has no .en link but has many to others. Regards, SunCreator(talk)14:36, 28 December 2010 (UTC)[reply]
Re Intrawiki links not being in good condition. My experience with The death anomalies project has been quite the opposite - intrawiki link errors do exist, but are rare. I think this sort of report would generate a useful list of articles that we potentially should have, and where instead it identifies that we are missing an intrawiki link, then that is also useful info. So yes it would be worth doing this by checking for articles on DE, FR etc that have large numbers of intrawiki links but not an EN one. Whether the result is an extra article or an added intrawiki link to an existing article doesn't matter as both are positive outcomes. ϢereSpielChequers17:25, 28 December 2010 (UTC)[reply]
Is it possible to have a report that searches for media tagged {{mtc}} (and related)
but which don't have a recognisable license, or one that's incompatible with Commons (such as fair-use)?
Report to run on a dialy or weekly basis depending on volume of media concerned?
I don't know what "recognisable license" means. I also don't know which licenses are incompatible with Commons. Is there a category for such a thing? --MZMcBride (talk) 02:00, 16 January 2011 (UTC)[reply]
I have been running in to bots on the list from the jira:DBQ-87 query. I don't think bots are allowed to create articles except is special cases, but I've been seeing them. Thanks again. - Hydroxonium (H3O+) 23:50, 18 January 2011 (UTC)[reply]
I maintain a database that contains page_id, page_creator pairs. So rather than querying each page, you can just look up the stored info. It's at u_mzmcbride_enwiki_page_creators_p. This query uses that database, but it got interrupted today. This really is too obnoxious to put into one query. It needs to be split out into multiple parts using some sort of sane logic. I may do this tomorrow if I get a chance. --MZMcBride (talk) 07:22, 19 January 2011 (UTC)[reply]
I actually saw that when I was writing the query but didn't use it because the table is indexed so that it is optimized for pageid->creator, instead of the other way around. This means that MySQL has to do a table scan in order to find each page that the user created. Maybe you could create an index on top of the table to help with this? CREATE INDEX user_creations ON u_mzmcbride_enwiki_page_creators_p.page (page_creator); or something would do the trick. Tim1357talk04:01, 20 January 2011 (UTC)[reply]
I don't know what you're talking about. The schema looks like this:
I don't really have the time to do this right now, but I think the best method is to do a query to get all the page creators with over fifty article creations (should take about half an hour), then do a Pythonic set comparison with the bots/autopatrollers/sysops and then check last edit/first edit for each user using functions/individual queries. That's how I mapped it out in my head, at least. I probably won't get around to this for a while, though. --MZMcBride (talk) 04:09, 20 January 2011 (UTC)[reply]
Thanks very much for the help, MZMcBride. Yes, please split up the query if that will ease the task. If it ends up being too obnoxious, I would be fine just running the original jira:DBQ-87 query again. Then I can manually go through the list and check for the other things (6 months old, activity in the last 30 days, etc.). That's what I have been doing with the old list. It's just extremely labor intensive. I've spent over 20 hours manually sifting through that old list and I'm starting to go bonkers. Ughhh. Anyway, thanks again. I appreciate the help.- Hydroxonium (H3O+) 08:27, 19 January 2011 (UTC)[reply]
I don't know SQL or anything about the Wikipedia database, but I have been reading up on stuff. So this is likely to be an annoying question that can be ignored, but I'll ask it anyway. Is user_touched a field in the database, and could that be used to check for recent activity? Thanks again for the help. - Hydroxonium (H3O+) 08:58, 19 January 2011 (UTC)[reply]
HJ Mitchel, that won't slow it down much. We could just add AND MAX(rev_timestamp)>DATE_FORMAT(DATE_SUB(NOW(),INTERVAL 90 DAY),'%Y%m%d%H%i%s') to the HAVING statement. Tim1357talk04:17, 20 January 2011 (UTC)[reply]
┌──────────────────────────────┘
Hi guys. Thanks very much for working on this report. Could this be run monthly? It would help the guys assigning permissions if the workload came in smaller batches more frequently rather than one big batch. Thanks again. - Hydroxonium (H3O+) 18:11, 20 January 2011 (UTC)[reply]
One more. This is probably difficult, but would it be possible to search for and remove users that were declined in the previous month? This might be more approprate for a bot to do maybe. Thanks much. - Hydroxonium (H3O+) 18:20, 20 January 2011 (UTC)[reply]
Sorry, I wanted to tweak it a bit before posting here, but I didn't find the time until now. Specifically, I was trying to remove those that didn't create any article recently, but that caused the report to run too long. But I removed those that were declined this or last month.
MZMcBride, this is exactly the reason I don't want to take over – I don't always have the time.
The idea behind a multi-maintainer project is that hopefully the load could be distributed between people, allowing for a faster response time for new reports/queries. I think of it as a long-term goal, though. Nothing needs to be done this instant. --MZMcBride (talk) 23:03, 26 January 2011 (UTC)[reply]
use {{fullurl:}} instead of [http://en.wikipedia.org... ] for portability to other wikis and better links on the secure site (this also removes the need for Python's urllib module)
you want ~~~~~, not ~~~~ (it looks like this was already fixed)
if you're limiting the number of output results, say so in the report description text
if you're doing other checks (such as file protection status), say so in the report description text
"annother" is misspelled in the report description text currently
don't hardcode database names and host names in the configuration
don't hardcode namespace names (even if they are canonical, use toolserver.namespacename)
Other than that, looks good. It'd be nice to get a multi-maintainer project for these reports going at some point. I'll look into that at some point. --MZMcBride (talk) 05:33, 24 January 2011 (UTC)[reply]
Awesome. :D Thanks for the quick response. Last minor nitpick is that you're still importing urllib (when it isn't being used any longer), but that's no big deal. Thanks for helping out! I really appreciate it. --MZMcBride (talk) 06:00, 24 January 2011 (UTC)[reply]
Although the updated report is broken: You need to have a newline after table row delimiter |- and you need to add ns_id=6 to the SQL, so that it actually uses the File namespace. Svick (talk) 19:26, 26 January 2011 (UTC)[reply]
One more thing: I think you want to test that the images are not protected separately (i.e. using two leftjoins). This way, each duplicate pair will show up only once. Svick (talk) 13:52, 28 January 2011 (UTC)[reply]
User subpages for users indef blocked for spamming
There are many users that have been indefinitely blocked for spamming, for using Wikipedia for promotional purposes, or for having a promotional username. Quite often these users have subpages in their userspace that contain promotional material that hasn't been deleted. This material can show up in Google searches - which is probably why spammers put it there. Would it be technically possible to have a database report that lists these pages? If so, would others find this useful enough to be worth the effort? Peacock (talk) 17:46, 23 June 2010 (UTC)[reply]
Probably worth the effort. Do you have an example subpage (that hasn't been deleted) so that I can verify the results of any queries? --MZMcBride (talk) 17:53, 23 June 2010 (UTC)[reply]
That user was not blocked for spam, so it's not really a helpful example . I wrote the query below, but for some reason it does not work. Perhaps MZMcBride will be clever enough to write a better query. Tim1357talk01:18, 26 June 2010 (UTC)[reply]
I assumed PCock wanted current blocks, not all block actions, so I checked against the ipblocks table. It's an insanely slow query, though. Might be better to do it in sets in a programming language. I imagine you'd want to check the ipb_reason field for a few strings like "usernamehardblocked", "spam", etc. --MZMcBride (talk) 02:10, 26 June 2010 (UTC)[reply]
Hmm. I think this is one query that is not going to be done gracefully. I ran a new query that is set to save here when it finishes. (Note each page title lacks the 'User:' prefix). I'll check back tomorrow to see if the query had any success, otherwise I am out of ideas. Tim1357talk02:36, 26 June 2010 (UTC)[reply]
I really hope you were using logging_ts_alternative table instead of logging when you were running those queries on the Toolserver. — Dispenser14:56, 6 July 2010 (UTC)[reply]
Anyways, I bypassed the logging table all together and just used ipblocks. I limited the query to the first 200 results, which will be automatically saved here when the query completes. --Tim1357talk02:35, 7 August 2010 (UTC)[reply]
Quite. I can't do all the work, you see. [Though expanding the report to MFD discussions might be worthwhile as well] –xenotalk22:34, 3 August 2010 (UTC)[reply]
"User pages of non-contributing users" or something like that. That is, users who come here, register an account, create a user page (that's often spammy, vandalism, or a test page), and never edit anywhere else but on their user page. There are some added complications now (a lot of people land here from other Wikimedia wikis, especially with unified login now, e.g.) and there still isn't a very clear title, but I think the overall idea has merit. --MZMcBride (talk) 00:30, 30 September 2010 (UTC)[reply]
A biweekly (or weekly if it seems useful) report I'd like to see is a list of most-linked redirects. For example, the bot would count the number of times John Blain (football player) is linked, showing the top 500 or so redirects. While some would be useful and should be left alone, others could be found that were the result of page moves that should have fixed a long time ago. WizardmanOperation Big Bear21:23, 9 October 2010 (UTC)[reply]
Some don't need to be, but linking to an outdated link could cause created articles to follow that nonstandard format, leading to more work for the rest of us. That and while some redirects don't need to be cleaned up, many do. WizardmanOperation Big Bear04:15, 11 October 2010 (UTC)[reply]
Huh, you've actually asked about this before. And I've already given you the pagelinks table lecture, so now all I need to do is remember to actually write this report. :-) I'll try to get to it tomorrow. For what it's worth, I doubt it will be a weekly report. It'll likely be kind of "expensive" to generate/update. --MZMcBride (talk) 07:21, 11 October 2010 (UTC)[reply]
To Wos: that could work. To MZM: If I did ask earlier than I forgot about that, I understand that it would definitely be a tough list to create, and limiting it to monthly given the update expense is fine. WizardmanOperation Big Bear18:38, 11 October 2010 (UTC)[reply]
I just noticed my request several sections up. To clarify, this would just be for mainspace articles; no reason to bother with others. If you want to trim it more and remove other redirects which shouldn't be touched, then we can remove any that include the category "redirects with possibilities" which should knock time off running it. WizardmanOperation Big Bear18:42, 11 October 2010 (UTC)[reply]
Take a look at Most-linked redirects (configuration). The initial limit was 30 incoming links. I've now raised it to 200 incoming links and it's running currently. We'll see what it outputs, how useful it is, and what (if any) adjustments are needed once it finishes.
Thanks. Admittedly I was not expecting that many redirects to have such a large number of links. So far it's helped find a few issues that I've handled, even though most redirects there would be left alone. WizardmanOperation Big Bear02:57, 12 October 2010 (UTC)[reply]
It might be helpful to see if there are ways to refine the scope of the report. This can usually be done by looking at specific examples to see if there are distinguishable characteristics about the page that can be broadly applied to other similar pages. For the example you provided (John Blain (football player)), we can say a few things:
the page title contains parentheses (which usually signify disambiguation of some kind)
the page has 7 incoming links from other articles
the page title signifies that the person is a living person (using a common first name)
Some of these observations are completely useless. Some aren't. If you can try to give me a better idea of what kind of pages you're looking for (perhaps with some other specific titles), I can try to make a better report for you. This might mean that the report will ultimately be moved to a different title, but that's no big deal. --MZMcBride (talk) 03:03, 12 October 2010 (UTC)[reply]
Out of those, the main one that jumps out at me would be the move log; that is something I was hoping to work on, alongside the parentheses modifier as well. The many football player disambigs, for example, stem from a pretty big move war from 07-08, so there's a lot of articles that go to the wrong places still. WizardmanOperation Big Bear03:20, 12 October 2010 (UTC)[reply]
Long pages changes
The page has two edits ([5], [6]) in the history that probably has to be implemented with the bot. At same also the 140K threshold could be lowered to 130K. --Kslotte (talk) 11:14, 5 December 2010 (UTC)[reply]
All pages containing colons
I'd like to see a report that has information about:
The number of pseudo-namespace redirects out there.
Pages that can have possible conflicts with interlanguage/iso-code prefixes.
Maybe? You need to be much more specific, though. I assume you're talking about pages such as CAT:CSD? It's easy enough to list all page titles that contain a colon (":"), but that probably isn't what you want (even if your section header says otherwise). Inherently pages beginning with used language codes (such as "en") are forbidden. (Try making the page "En:foo" on this wiki.) So it's rather unclear what you're after and for what purpose. If you can clarify, that would help things along. --MZMcBride (talk) 03:04, 26 January 2011 (UTC)[reply]
I have been tackling some of the entries in this report but it is a bit difficult to see the wood for the trees sometimes as there are a lot of such categories which only contain pages in user space (e.g. Category:Capitalist Wikipedians and the like). Would it be possible to split the report (there are already 3 subpages but these seem to be just for size reasons) into say
Red-linked categories containing at least one page in article space.
Red-linked categories containing only pages in user space.
I don't have an objection to this per se, but it may be easier just to empty those user categories which do not start with User or Wikipedia. I've started working on this. -- Black Falcon(talk)22:13, 28 January 2011 (UTC)[reply]
Long stubs
Would be nice to be able to keep this at 500-1,000 items. Currently it has fewer then 100 entries that could still be stubs after removing over 500 !stubs. Vegaswikian (talk) 07:30, 3 February 2011 (UTC)[reply]
For the Cats listed at Wikipedia:Database reports/Deleted red-linked categories, would it be possible to either change the members column (showing how many pages are in the deleted cat) to use PAGESINCAT? Or, if the number is logged for the database report as how many were member pages at the time of the scan, perhaps a new column with a "current" heading? This would probably increase the page count, as I believe you can only call 500 parser functions at time (as opposed to the current 800), but I think it would help to process and clean up the entries there. Avicennasis @ 11:08, 30 Shevat 5771 / 4 February 2011 (UTC)
And secondly, a new repot that checks all the categories that transclude {{Template category}} to see if any of them contain article pages wouldbe nice.
I've done some SQL below but not certain if it will work or need some tweaking.
I've selectively archived this talk page again. If you want a new report or a tweak to an existing report, post (or re-post) here and someone will take care of it, eventually. --MZMcBride (talk) 17:30, 6 February 2011 (UTC)[reply]
I am not sure that I agree and WP:PROJCATS implies that this should not be done. The reason being that the article categories are for readers whereas templates are usually only of interest to editors. Boissière (talk) 22:30, 17 February 2011 (UTC)[reply]
I would argue that it is helpful to readers when a navigation template (intended for readers, not editors) for a particular topic is placed in a category for the same topic (this is, in my opinion, the only circumstance when such "template pollution" is useful). In general, it is quite common for a limited amount of project content to be categorized directly into a content category when the topic of the content and the category is the same. For instance, stub categories always appear in content categories, sorted after the main alphabetical list under the character "µ". The situation is similar with Wikipedia books, which are sorted under the character "β", and templates, sorted under the character "τ". -- Black Falcon(talk)17:58, 18 February 2011 (UTC)[reply]
Neat. :-) There are already a number of other Toolserver tools listed in the index on the subject-space page. Feel free to add a row for your tool/report if you'd like. I'm sure people would find it helpful. Cheers. --MZMcBride (talk) 23:24, 21 February 2011 (UTC)[reply]
I'm not sure how useful this would be, but a report like "talk pages that are redirects when parent article is not" may be helpful, particularly in finding poor moves or copy-paste articles. Avicennasis @ 02:49, 20 Adar I 5771 / 24 February 2011 (UTC)
Hah, it's configured for the 29th day of the month, too, so it missed February altogether. It should probably be changed to weekly. --MZMcBride (talk) 16:28, 2 March 2011 (UTC)[reply]
I tried that when I was creating the report, but failed, because the resulting query was too slow. I'm trying an alternative approach now: creating my own modified copy of MZMcBride's page creators table. Svick (talk) 22:10, 2 March 2011 (UTC)[reply]
When someone moves an article to a new name, they may not notice that sometimes there was a /to do subpage off the talk page, which then ends up being orphaned since it is not moved with the page and talk page. So just wondering if a query which lists pages that are redirects where there is also a /to do subpage that exists would be a useful report? -- WOSlinker (talk) 20:54, 2 April 2011 (UTC)[reply]
Thanks. Just wondering, maybe better to only list those without any links to them. So adding the the following to the where clase might help. -- WOSlinker (talk) 11:17, 24 April 2011 (UTC)[reply]
and not exists (
select *
from pagelinks
where pl_from=sub.page_id
and pl_title<>'Database_reports/Talk_subpages_with_redirect_parent'
)
You're not specifying page_namespace here. That must make it much slower, no? (Actually, just tested, it's 6 seconds vs. 0, but still would be much better to specify here. It would also be nice to make it abstract and use the value from report_title if possible.)
Yeah, you're right. I was thinking that there's no chance that there will be a page with the same name in another namespace, but forgot about indexes. User<Svick>.Talk();19:00, 24 April 2011 (UTC)[reply]
Welcome to Wikipedia. ;-) Stupid situations will happen a lot; it's good to be able to identify where things have gone wrong and fix them, automatically if possible. Mike Peel (talk) 07:50, 4 April 2011 (UTC)[reply]
I'm not sure I'd call needing an extra click "horribly annoying," but yeah, these should be fixed. At least a non-existent section still generally gets you to the appropriate page; it's not as though it returns an error page or something silly like that.
I'm kind of surprised WikiProject Check Wikipedia doesn't catch this already, but the extra requests needed to get the parsed section headers probably explain that. I'll try to get to this over the weekend. I have a thought or two about how to do it. --MZMcBride (talk) 06:30, 29 April 2011 (UTC)[reply]
I wrote this at Broken section anchors (configuration). There were about 700 results from the first 2500 redirects that contain an anchor checked. That's about 28% of redirects pointing a section that are affected, assuming the rate of false positives is low (which it seems to be). I'll try to get this report set up to update a bit more regularly in the coming days/weeks. For now, there's plenty of work to be done.
Briefly, I'll note that in a lot of these cases, the answer isn't to simply remove the anchor or find an alternate target header. Sometimes the answer is to point the redirect to a different page altogether or add an {{anchor}} to the target page. --MZMcBride (talk) 01:03, 2 May 2011 (UTC)[reply]
^\s*#REDIRECT:?.*?\s*\[\[([^|[\]]*?)(?:\|[^[\]]+)?\]\] is the regex I used for parsing redirects, I suggest you use it. — Dispenser22:39, 5 May 2011 (UTC)[reply]
In the current dump there are 203808 redirects pointing to sections. A limit of 100 hits is reached after checking 813 of these, so maybe there are 25000 bad section redirects in total. (This ratio is corresponding to the one seen on dewiki.)
Currently it's not, simply because it grew from a first try to a working solution without emphasis on style and readability. The basic approach for section parsing is to parse the wikicode for known section markers (section headlines, Template:Anchor, etc.) and additionally check the html output for id tags if the first step doesn't match. (Two step approach because because step one hits 90% and the results can be held in a local cache easier than html output.) The code should be quite stable unless in the redirecting page there is a crazy mix of different encodings of special characters. --Krd (talk) 07:07, 30 April 2011 (UTC)[reply]
tools:~dispenser/cgi-bin/rdcheck.py (linked from What links here) It also supports generators so categories can be checked or the thing I added to WP:FAC. It now part of the checklist for FAC, though coaching may be needed. Human review is needed: 1) section was spun off into its own article, 2) a bot "fixed" a double redirect, 3) similarly redirects were never pointed to any section. There's some good code in dabfix used for comparing similar links. — Dispenser05:53, 30 April 2011 (UTC)[reply]
Redesigning this project
As this project approaches 100 reports, there are some structural/design/maintability issues that should be addressed at some point. I put some notes at Wikipedia:Database reports/Redesign. If anyone has any comments/suggestions/etc., throw them on that page or this talk page or wherever. --MZMcBride (talk) 21:42, 30 April 2011 (UTC)[reply]
A caching system allowing user initiated report regeneration along with automatic regeneration. Might allow for simple queries (e.g. contribs tool) if we can work in user specified parameters. Run time histogram and statistics, so expensive reports aren't regenerated when no one is reading them. History of the past 5 report and size/row count. Support for multiple wikis and mirroring reports on wiki, useful for Page count by namespace. Include a dependance chain, e.g. report A require updated project_banner table done by report B. Custom CSS/JS, table filter, and sorting. Support for dump reports. Points system for games??? — Dispenser21:41, 5 May 2011 (UTC)[reply]
Notification options: Email, RSS, or a bot edit (think {{LSR}}) for when reports complete. A refresher script (similar to books) for < 30 minute reports with a graph of previous completion times. — Dispenser21:00, 3 June 2011 (UTC)[reply]
Some from an earlier design document: configurable .ini-like python file (i.e. Framework approach), statistics query run time, JS highlighter (they're faster), Form fields have auto-complete and regex validation is applied client and server side.
Controls programs (themselves reports): report data file viewer, report directory list, activity stream, and administration. — Dispenser22:16, 2 September 2011 (UTC)[reply]
It keeps getting killed for running so long, I think. It's a very expensive query. The report probably needs to be re-implemented using smarter logic.
One idea I had was to only check pages that have been edited in the past 30 days in the article namespace (using the recentchanges table). But even that is about a million queries. I'll have to give more thought about how to get this information efficiently. Dispenser, Svick, et al.: Do you have any ideas? --MZMcBride (talk) 04:32, 30 May 2011 (UTC)[reply]
You could try to split the query into two: one for User and the other for User talk namespace. But the most likely outcome would be that it's going to be even slower. Another option would be to do the distinct part in memory. (I'm assuming the pagelinks query is the slow one, not the templatelinks one.) User<Svick>.Talk();20:12, 30 May 2011 (UTC)[reply]
Gadget statistics
/* Gadget statistics - run time 1.5 hrs SLOW_OK */SELECTup_property,COUNT(*)FROMuser_properties_anonymWHEREup_propertyLIKE"gadget-%"ANDup_value=1GROUPBYup_property
With the same record keeping purpose as Page count by namespace, I propose the creation of a installed gadgets report (see previous: WT:Gadget#Usage-Stats). The table interesting properties the table includes are: gender, imagesize, diffonly, blank edit summary warning, recent change/watchlist flags bot, minor. We may also want to ask to derived columns such as user_active and a binned edit count. — Dispenser00:02, 26 May 2011 (UTC)[reply]
Skin notes: 0=default, 1=nostalgia, 2=cologneblue; amethyst introduced 1.4. removed 1.5; major of users don't have a skin set they are not counted, thus they are the difference in skin total and the current number of users. — Dispenser05:32, 11 July 2011 (UTC)[reply]
Hmm, you actually want a dump report. In order to do something like this, you need to scan the wikitext of every article (from a database dump) and look for that particular string using regular expressions. Of course then you also have to account for redirects and other bizarre shit that can appear within the templates themselves (like sub-templates and magic words and parser functions).
I was going to do a similar database report (for repeating newlines, actually). I can probably just do this one when I'm doing the other one. If you're bored, it'd be helpful to have a list of strings to test a regular expression against. I'll put some examples below so you can see what I'm talking about. --MZMcBride (talk) 00:03, 2 June 2011 (UTC)[reply]
Just so I don't forget, what's really needed here is a generic function that can take wikitext input and output pairs of "grabbed templates" and their respective positions within the wikitext. It needs to be able to catch odd cases like {{foo|bar={{baz}}}}, but once it reasonably can, it should be trivial to take each position, subtract it from the subsequent one, and then output only templates that are really close to each other (matching on template name too, of course). A generic template_grabber function would resolve this report and be very helpful for other projects (including the next step, which is taking the grabbed templates and then parsing their parameters). This isn't very complicated, but it will require a few hours of dedicated development time to do properly. --MZMcBride (talk) 01:33, 2 June 2011 (UTC)[reply]
AWB uses the regex {{(?>[^\{\}]+|\{(?<DEPTH>)|\}(?<-DEPTH>))*(?(DEPTH)(?!))}} to match top-level templates, that can have other templates nested inside them. Although I never understood how it works, and it's probably .Net-specific. User<Svick>.Talk();22:07, 2 June 2011 (UTC)[reply]
Hey MZMcBride. I'm not sure if this is exactly what you were looking for, but I wrote this template parser a while back to parse templates. It can handle an infinite number of nested templates (ect:{{foo|bar={{baz|foo={{bar}}}}}}). It can also handle embedded images and wikitables. The code is below. Hope it helps at all. Tim1357talk22:49, 20 June 2011 (UTC)[reply]
Template Parser
spaces=' \t\n\r'chars=['a','b','c','d','e','f','g','_','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z','A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z','0','1','2','3','4','5','6','7','8','9']classTemplate():def__init__(self,text=None):'''The parameter "text" is a string that starts with "{{". The parser will automatically find the end of the template'''self.start='{{'self.end='\n}}'self.wrapper=lambdag,p=None:'\n | %s = %s'%(p,g)ifp!=Noneelse'\n | %s'%giftext!=None:self.text=texttry:self.text=self.text.strip()except:raiseTypeError('The Class: "Template" requires text be passed to it.')self.r=Noneself.cursor=0cursor=0m=len(self.text)-1while1:if'|'==self.text[cursor]:self.standard()ifself.r!=None:self.r=iter(self.r)te=[]forp,ginself.r:ifp!=None:p=p.strip()te.append((p,g.strip()))self.r=tedeltebreakelifself.text[cursor:cursor+2]=='}}':breakcursor+=1ifcursor>=m:raiseTemplateSytnaxError('Could Not find Template Start')self.name=self.text[:cursor+1].strip('{}|')defstandard(self):max=len(self.text)cursor=0r=[]try:whileself.text[cursor]!='|':cursor+=1exceptIndexError:self.r=Nonereturncursor+=1last=cursorwhilecursor<max:n,cursor=self.checkname(cursor=cursor)last=cursorwhileTrue:ifself.text[cursor:cursor+2]=='{{':cursor=self.untill(cursor=cursor+1)elifself.text[cursor:cursor+2]=='[[':cursor=self.untill_img(cursor=cursor+1)elifself.text[cursor:cursor+2]=='}}':r.append((n,self.text[last:cursor]))self.r=rself.text=self.text[:cursor+2]returnelifself.text[cursor]=='|':r.append((n,self.text[last:cursor]))cursor=cursor+1last=cursorbreakcursor+=1ifcursor>=max:breakself.cursor=cursorraise'No ending brackets!'defcheckname(self,text=None,cursor=0):iftext==None:text=self.textn=cursorl=cursorwhiletext[cursor]inspaces:cursor+=1whiletext[cursor]inchars:cursor+=1n=text[n:cursor]whiletext[cursor]inspaces:cursor+=1iftext[cursor]=='=':cursor+=1return(n,cursor)return(None,l)defuntill_table(self,text=None,cursor=0):iftext==None:text=self.texts=cursorst=1whileTrue:iftext[cursor:cursor+2]=='|}':st-=1eliftext[cursor:cursor+2]=='{|':st+=1eliftext[cursor:cursor+2]=='{{':cursor=self.untill(cursor=cursor+1)eliftext[cursor:cursor+2]=='[[':cursor=self.untill_img(cursor=cursor+1)cursor+=1ifst==0:returncursorifcursor>=len(text):returncursordefuntill_img(self,text=None,cursor=0):iftext==None:text=self.texts=cursorst=1whileTrue:iftext[cursor:cursor+2]==']]':st-=1eliftext[cursor:cursor+2]=='[[':st+=1eliftext[cursor:cursor+2]=='{{':cursor=self.untill(cursor=cursor+1)eliftext[cursor:cursor+2]=='{|':cursor=self.untill_table(cursor=cursor+1)cursor+=1ifst==0:returncursorifcursor>=len(text):returncursordefuntill(self,text=None,cursor=0):iftext==None:text=self.textst=1whileTrue:iftext[cursor:cursor+2]=='}}':st-=1eliftext[cursor:cursor+2]=='{{':st+=1eliftext[cursor:cursor+2]=='[[':cursor=self.untill_img(cursor=cursor+1)eliftext[cursor:cursor+2]=='{|':cursor=self.untill_table(cursor=cursor+1)cursor+=1ifst==0:returncursorifcursor>=len(text):returncursordefold(self):returnself.text[:self.cursor+2]def__dict__(self):returndict(self.r)def__getitem__(self,key):a=dict([(p.lower()ifpisnotNoneelseNone,g)forp,ginself.r])asserta.has_key(key.lower()),KeyError('\'%s\''%key)returna[key.lower()]def__delitem__(self,key):'''Usage: del TemplateObject['key']'''fork,vinself.r:ifkisNone:continueifk.lower()==key.lower():self.r.remove((k,v))returnraiseKeyError('\'%s\''%key)def__iter__(self):returniter(self.r)def__str__(self):ifself.r==None:returnself.texti=self.start+self.nameforp,ginself.r:i+=self.wrapper(g,p)i+=self.wrapper(self.image,'image')i+=self.endreturni
"Someone," heh. cron errors are usually caught at insert/update. I have an e-mail from cron that says that the query was interrupted. This was likely due some kind of hiccup or (more likely) the query ran too long and got killed. The e-mail output is below. Honestly, this report shouldn't be a database report at all. It's using a rather hackish system that results in a lot of false negatives, as I recall. It would make a lot more sense as a dump report. Maybe I'll get around to rewriting it at some point. "Dump reports" are currently halted due to some kind of horrible speed issue in the current rewritten script, though. Until the speed problems can be addressed (in a demonstrable way), there won't be any new dump reports. Fun fun fun fun (lookin' forward to the weekend). --MZMcBride (talk) 15:16, 16 June 2011 (UTC)[reply]
Your "cron" job on willow
PYTHONPATH=$HOME/scripts python $HOME/scripts/database-reports/directstubs.py > /dev/null
produced the following output:
Traceback (most recent call last):
File "/home/mzmcbride/scripts/database-reports/directstubs.py", line 72, in <module>
''')
File "/opt/ts/python/2.7/lib/python2.7/site-packages/MySQLdb/cursors.py", line 173, in execute
self.errorhandler(self, exc, value)
File "/opt/ts/python/2.7/lib/python2.7/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
raise errorclass, errorvalue
_mysql_exceptions.OperationalError: (1317, 'Query execution was interrupted')
Hi Guys,
I made a SQLite3 database that indexes all the infoboxes used in the article space. I was finding that I was doing quite a bit of work with the API just to find a parameter in an infobox, so I decided to create a database to save time and bandwith. I was wondering if you think the database (about 60/90 MB with/without indexes, uncompressed) would be helpful here. If you think it might come in handy, I'll upload it to the MySQL database. Cheers. Tim1357talk18:44, 25 June 2011 (UTC)[reply]
Oh, here's the schema (if it matters):
sqlite> .schem
CREATE TABLE InfoboxParameters (Infobox INT, Arg Varbinary, Value Varbinary, Article Varbinary);
CREATE TABLE Infoboxes (Name Varbinary,ID int unsigned, PRIMARY KEY(ID));
CREATE INDEX Infobox_Arg_Index on InfoboxParameters (Infobox,Arg);
Could we add a report on the largest WikiProjects by the number of articles tagged with a project's banner? The data for each project is available on the toolserver when "All" is clicked but there is no option to sort the projects by number of articles. Once each month or two would probably be frequent enough for this task. This report would be useful for size comparisons and descriptions in the Signpost's WikiProject Report. -Mabeenot (talk) 23:56, 26 June 2011 (UTC)[reply]
Whoops, it looks like I was using the wrong (out-dated) database. I think the one in use now is u_enwp10. I don't have read-access to that database right now, but I askedCBM (the maintainer of the project) for access. Tim1357talk14:46, 14 July 2011 (UTC)[reply]
The Wikipedia:Database reports/Redirects containing red links report is flooded with talk pages that contain templates based on the {{WPBannerMeta}} meta-template, which automatically create links to "/Comment" subpages whether the subpage exists or not. In other words, Talk:TOPIC ends up in the report because it "contains" a red link to Talk:TOPIC/Comments, although the redlink actually only exists in the template code and wouldn't be visible to readers even if the talk page were not a redirect. Suggested tweak: exclude any link to a page in namespace 1 whose title ends in "/Comments" from the report. --R'n'B (call me Russ) 14:06, 29 June 2011 (UTC)[reply]
User talk pages which redirect to another namespace
Just what it sounds like. Maybe monthly? Some new users redirect move their user pages (and associated talkpages) to the mainspace when they are done drafting an article in the userspace. However, that means that automated tools like Twinkle which leave a note at the user's talk page are actually redirect to the article talk page. These user pages need to be un-redirected and having a list of them should make that easy. –Drilnoth (T/C) 21:48, 20 July 2011 (UTC)[reply]
Inactive users report
Is there any reason why some usergroups are listed twice? For example:
PC has ended, but it has not been turned off and there is no guarantee that it will not be used again. I think this is probably a good report to keep until the future of PC is clear. That said, we could probably move the updating to once or twice a month until that time. --Izno (talk) 05:23, 27 July 2011 (UTC)[reply]
Re-used image names
I would like a list of filenames -- both on en.wikipedia and on Commons - that have more than one image attached to them due to accidental overwriting, so as to figure out which ones need splitting into separate images.
This will probably generate a huge amount of false positives; it can be reduced by eliminating images which are uploaded by authorized image-modification bots like RotateBot, and by eliminating later images uploaded by the original uploader. It will still, however, require human attention to tag them for splitting. DS (talk) 01:08, 30 July 2011 (UTC)[reply]
If so, then where is it? Also, hashes won't allow us to detect if (image 2) is a cropped version of (image 1), will they? DS (talk) 18:27, 30 July 2011 (UTC)[reply]
Pages which have been semi-protected the longest
I'm not sure if this is possible, but what would be helpful would be a list of semi-protected pages by how long they have been semi-protected. It might be useful to see if those which have been semi-protected the longest could afford to be unprotected. –MuZemike16:31, 2 August 2011 (UTC)[reply]
We have this for articles (also here) and talk pages, if you sort by timestamp, you can view them chronologically. Avicennasis @ 22:02, 4 Av 5771 / 4 August 2011 (UTC)
Images identified as lacking NFUR
Would it be possible to add {{Has-NFUR}} to the list of templates on an image which the report skips over?
Couldn't most of those be converted to using templates? I'm not opposed to the new template, I'm just wondering if it is redundant. –Drilnoth (T/C) 16:28, 24 August 2011 (UTC)[reply]
Inactive bots
Further to Wikipedia talk:Bot Approvals Group/Archive 7#Bots & long term inactivity, could we have a report updated once monthly or quarterly that lists accounts with the 'bot' flag that have no edits or logged actions in over 1 year, sorted by activity with longest inactive at the top? (Am trying to think if there is an easy way to identify read-only bots that should not have any edits or logged actions, will let you know if I come up with anything.)
It got killed by the query killer, I imagine. Or there was a hiccup somewhere. Does it matter? I looked through a few of the pages in the list. There's plenty of work to be done (lots of cases where the coords overlap exactly). --MZMcBride (talk) 18:53, 17 September 2011 (UTC)[reply]
Oh, no. The Toolserver has been more overloaded lately, so the Toolserver administrators have implemented (or re-implemented, I guess) a "query killer." I haven't written much about it anywhere, but it's impacting half a dozen or so reports. If the replication lag on the server increases, the query killer gets more aggressive. Some very slow queries in some reports are now being killed before completing (emptycats.py, userlinksinarticles.py, templatedisambigs.py, etc.). Some of these reports will probably have to be rewritten. A few may have to be disabled. I'm not sure how many people realize how many reports there are nowadays (over 100 at this point, I think). It's a bit annoying to maintain all of them by myself. :-) --MZMcBride (talk) 21:52, 17 September 2011 (UTC)[reply]
For what it's worth, getting a Toolserver account was pretty much how I learned SQL. There's no better place, really. You have access to small and large constantly updating replicated databases that are set to read-only, so you can't break them even if you wanted to. It's a great place to play around with syntax, learn conventions, etc. I wrote some docs at tswiki:Queries and tswiki:Database access at some point that are helpful for new users. (In general, I think there's a somewhat glamorized view of Toolserver users. Most of them have no idea what they're doing.)
All of this is largely unrelated to database reports, though. They're run from the Toolserver, but the master copy of the reports code is on GitHub. A few people have access to the repo right now (Svick and maybe bjweeks?), but it'd be nice to have more people on it, assuming they won't break anything. :-) --MZMcBride (talk) 22:05, 17 September 2011 (UTC)[reply]
The status of this backlog has changed radically in the past few weeks, thanks to my colleague Redrose64. I'd like this report to become more frequent now. Perhaps twice a week? —Stepheng3 (talk) 00:41, 9 November 2011 (UTC)[reply]
Stepheng3: With the number of rows decreasing per report, I'd actually be a bit more inclined to make the updates fortnightly or something, to increase the value of an update. But for now, I think weekly is fine. The whole point is kind of moot if it's possible to get a dynamic list, which given the fact that this is just the intersection of two queries, it's really not very difficult. catscan seems a bit slow, but there's no rush. :-) --MZMcBride (talk) 03:59, 10 November 2011 (UTC)[reply]
Having such a report would help speed up what I am attempting to do from the toolserver list by not
including media description pages which do need to be local..
I noticed that some monthly reports show the last updates in April and July. Looks like something is slipping through the cracks. Also long stubs shows as having been updated a few days ago, but all on get on the wiki is the October 21 version. Vegaswikian (talk) 19:20, 31 October 2011 (UTC)[reply]
SPA Account edits
Can a list of articles which are more than 4 months old for which editors that have edits only to that article make up more than 1/3 of the total editors of the article be made? Smallman12q (talk) 01:53, 12 December 2011 (UTC)[reply]
You could make the scan more general to find two identically( or almost identically) named sections on image files. It would probably require a dump scan...but it seems like the script for the scan should be very short.Smallman12q (talk) 04:00, 24 December 2011 (UTC)[reply]
I have recently learned about WP:DBR. I saw Wikipedia:List of Wikipedians by article count, which is not currently included at DBR, and wondered if it would be worth producing similar reports for template space, image space, category space and project space. Also, since the report ingores redirects (and there are more redirects than non-redirects in mainspace), maybe a redirect list could be produced too. I think these reports could be produced weekly and listed at DBR. I have tried asking about this at WP:VPT and WP:HELPDESK.--TonyTheTiger (T/C/BIO/WP:CHICAGO/WP:FOUR) 15:09, 23 December 2011 (UTC)[reply]