Wikipedia talk:Database reports/Archive 5

Wikipedia:Database reports/Stubs included directly in stub categories and the Darius Dhlomo copyvio problem

Please exclude, from Wikipedia:Database reports/Stubs included directly in stub categories, any article transcluding Project:Contributor copyright investigations/Darius Dhlomo/Notice. These pages aren't then problem of WikiProject Stub Sorting; and when their problem is solved, these pages will either have been deleted or reverted to a version which probably doesn't have direct use of stub categories. עוד מישהו Od Mishehu 08:10, 21 October 2010 (UTC)[reply]

This should be updating now. --MZMcBride (talk) 19:05, 24 October 2010 (UTC)[reply]

Done, thank you. עוד מישהו Od Mishehu 08:57, 25 October 2010 (UTC)[reply]

Popular WikiProjects

User:WhatamIdoing has expressed a desire for data regarding popular WikiProjects, and those data interest me also. User talk:Mr.Z-man/Archive 12#Popular WikiProjects? has a record of the request. I would like to see a monthly report (of page views or edits or watchers; preferably one report for each of those three variables). I have added this talk page to my watchlist, and I will watch for a reply (or replies) here.
—Wavelength (talk) 19:25, 27 October 2010 (UTC)[reply]

Reports of WikiProject watchers and WikiProjects by changes are now up. Svick (talk) 17:30, 13 November 2010 (UTC)[reply]

Thank you very much.—Wavelength (talk) 20:52, 13 November 2010 (UTC)[reply]

New BLPs that might be eligible for a sticky prod

Earlier this year the community introduced a new form of deletion for new unreferenced BLPs, there's a concern that we aren't tagging as many as we could. Please could we have the following regular report:

Criteria: articles in Category:All unreferenced BLPs that were created after March the 18th 2010

Format: List of articles, marked if they are already in Category:BLP articles proposed for deletion or Category:Articles for deletion

Frequency: Weekly

Many thanks Ϣere SpielChequers 17:41, 28 October 2010 (UTC)[reply]

Done: Biographies of living people possibly eligible for deletion (configuration). --MZMcBride (talk) 05:48, 5 November 2010 (UTC)[reply]

Red-linked categories with significant incoming links

In light of this discussion, would it be possible to generate a report listing red-linked categories (i.e., deleted or never created) which have significant incoming links (i.e., from any namespace except User: and Wikipedia:)? Thank you, -- Black Falcon ^(talk) 17:31, 14 November 2010 (UTC)[reply]

I think Special:Wantedcategories generally updates regularly, which is why this report was never created, although that report only lists the thousand most populated categories. If someone were to go through and either bluelink or de-link the first couple hundred categories depending on their specifics, however, you would likely get everything with 2 category members or more. VegaDark (talk) 01:35, 15 November 2010 (UTC)[reply]
- Special:WantedCategories only lists red-linked categories with actual members, doesn't it? Does it also list empty categories which have incoming links? -- Black Falcon ^(talk) 05:38, 15 November 2010 (UTC)[reply]
  - Ah, incoming links, for some reason I was thinking category members. Nope, guess there isn't a report for that. Although I bet there will be a ton of redlinked crap categories we've used in examples in deletion discussions over the years. VegaDark (talk) 10:06, 15 November 2010 (UTC)[reply]
    - I hadn't thought of those, but you're right: there will be a lot of those, as well as links from CfD nominations and notifications to users. I don't know if it is possible to generate a report that would ignore incoming links from the User: and Wikipedia: namespaces but, if it is possible, it would help to focus the list. -- Black Falcon ^(talk) 17:06, 15 November 2010 (UTC)[reply]

So you want red-linked categories (with or without members) with incoming links from namespaces other than Wikipedia (4) or User (2)? What should the report title be? How often should it update? The more detail you can provide, the less thinking I have to do (and thus it's much easier for me to knock this out in a few minutes). :-) --MZMcBride (talk) 04:58, 23 February 2012 (UTC)[reply]

Yes, thank you. We should probably also exclude any talk namespaces as category links in talk page discussions generally need not or ought not to be updated. I think either Red-linked categories with links (configuration) or Red-linked categories with incoming links (configuration), or something similar, would be an adequately descriptive title. A weekly update should be sufficient, though even a monthly one may be adequate once the backlog is cleared. -- Black Falcon ^(talk) 17:20, 23 February 2012 (UTC)[reply]

Please take a look at Red-linked categories with incoming links (configuration) and let me know what you think. --MZMcBride (talk) 21:26, 24 February 2012 (UTC)[reply]

The more I think about it, the more I think that you actually want the inverse here regarding incoming links. That is, you don't want cases where there are incoming links from all but certain namespaces (page_namespace NOT IN (1,2,3,4,5,7,9,11,13,15,17,19,101,103,105)), you want cases where there are incoming links from certain namespaces (page_namespace IN (0, 6, 10, 14)), right? If I created a list of red-linked categories and put it somewhere in the Wikipedia namespace, it would screw up any subsequent reports, as they'd all have links from a page in the Wikipedia namespace, using your logic. But that's not what you're looking for, is it? You're looking for cases where there are red-linked categories from articles, files, templates, or other categories... I think? --MZMcBride (talk) 23:25, 24 February 2012 (UTC)[reply]

P.S. I kind of feel terrible now after realizing that this request has been sitting here for over a year and it took like ten minutes to complete. I'm the worst.

Your work and assistance are greatly appreciated, no matter the time it takes. :) The report looks perfect; thank you!

You're right about the logic of inclusion/exclusion, since my interest is in links from certain namespaces rather than all links except those from certain namespaces; ultimately, we're excluding more namespaces than we're including. The namespaces of interest would be the main namespace (0) and the file (6), template (10), help (12), category (14) and portal (100) namespaces. I suppose that links from the MediaWiki (8) and book (108) namespaces should be updated, too, but that's a theoretical consideration since pages in those namespaces tend not to link to categories. -- Black Falcon ^(talk) 19:24, 5 March 2012 (UTC)[reply]

By the way, is it possible also to include category pages that contain no members – e.g., ones that have been deleted and emptied? -- Black Falcon ^(talk) 19:27, 5 March 2012 (UTC)[reply]

Dunno. This report's logic has twisted my mind pretty badly.

I set up Red-linked categories with incoming links (configuration) to update weekly. Take a look at that report and then we can see what needs to be tweaked. --MZMcBride (talk) 00:58, 12 March 2012 (UTC)[reply]

It's my fault since, looking at the report now, I realize that I had not taken into account two key factors.

The report that I had in mind originally was of category pages which are red-linked and empty and have incoming links from namespaces 0, 6, 10, 12, 14, or 100. That report would have been useful in removing links to categories which should not have any incoming links.

This report is of category pages which are red-linked and populated and have incoming links from the above namespaces. It is arguably more useful since it identifies categories that need to be created (and properly subcategorized) or emptied – a task that is more important than link removal. I will work on the report over the next few days and provide feedback here.

Again, thank you! -- Black Falcon ^(talk) 02:56, 14 March 2012 (UTC)[reply]

(unindent) Perhaps having a "Members" column would be good? That way you could easily sort the report by ones that are empty (0 members) and ones that aren't. Thoughts? --MZMcBride (talk) 22:00, 17 March 2012 (UTC)[reply]

P.S. Just for my reference, this report takes 45 minutes to run currently on willow.

List of longest 'DISCUSSION' content.

Hi, Is it possible to get a list of (Wikipedia-) articles which have the largest 'Discussion' content? On these 'Talk:' pages you also see the amount of archives; for example the article about Barack Obama has 71 archives, and will end high in the list probably. Thanks in advance. Kyra —Preceding unsigned comment added by 82.171.75.225 (talk) 10:55, 17 November 2010 (UTC)[reply]

Here you go. Svick (talk) 20:40, 20 November 2010 (UTC)[reply]

Looks pretty good. Might be possible to limit to pages with "rchive" in them and then add the root page for a more precise count of actual discussion, but your method is probably about as accurate, relatively.

One thing. Perhaps it might be better to put the column in megabytes? For example, 40 230 kilobytes = 39.2871094 megabytes (according to Google); I think the megabytes figure is much easier to understand and appreciate.

Thanks for taking care of this. :-) --MZMcBride (talk) 20:59, 20 November 2010 (UTC)[reply]

I have changed it to show the size in megabytes.

I don't like including only archives the way you suggested, because it would mean that it would be the sum of sizes of the talk page including archives excluding other subpages, but including their archives (e.g. Wikipedia talk:Requests for arbitration/Badlydrawnjeff/Proposed decision/Archive 1). And some people are already confused what exactly does the number mean, so I don't want to complicate it more. Svick (talk) 23:05, 20 November 2010 (UTC)[reply]

I've taken this to MfD, as it's quickly been misinterpreted. There's nothing problematic about having long talk histories so long as the archives are orderly, so it's not clear what purpose this particular report fulfils. Chris Cunningham (user:thumperward: not at work) - talk 13:01, 21 November 2010 (UTC)[reply]

Two database reports for File redirects, please

Originally I had posted here on MZMcBride's talkpage but he has brought my request to this discussion board instead. Could someone construct two database reports for redirects in the File namespace with one or less incoming links, one for those that match Titleblacklist and another for those that don't? :| TelCo NaSp Ve :| 05:23, 24 November 2010 (UTC)[reply]

Note: The Titleblacklist is a pain since matching can only be correctly done with PHP (or maybe perl). — Dispenser 05:34, 24 November 2010 (UTC)[reply]

Okay, let's ignore Titleblacklist then. Can we at least have a database report of all redirects from the file namespace with one or less incoming links stored somewhere (e.g. here)? :| TelCo NaSp Ve :| 22:00, 7 December 2010 (UTC)[reply]

5803 files. That's quite long report, I think it would be worth trimming it down somehow, like report only those that have no incoming links at all? Svick (talk) 16:38, 16 January 2011 (UTC)[reply]

You're allowed to paginate. Plenty of current reports do. :P

That said, I'm not sure how good an idea this report is. What's the point of finding (internally) unused file redirects? --MZMcBride (talk) 00:54, 17 January 2011 (UTC)[reply]

High NFCC pages report

On this version of the report, there is an entry for Head of the River (Victoria), with the entry noting 8 files in use. Yesterday, I removed [1] more than 400 uses of non-free files from this page. The report is showing the number of files, not the number of uses. This particular article has gone largely unnoticed for three years because "8" uses is not a high number. Yet this article is certainly one of the highest, if not the highest, abusers of non-free content we've ever had on this project.

I'd like to see this report modified, or a new report created, to show how many non-free file uses there are, rather than the number of non-free files used. --Hammersoft (talk) 14:18, 20 October 2010 (UTC)[reply]

The database only tracks binary usage data. This is also true of templates, categories, page links, and external links. It might be possible to add a column to this report, but it will not be possible to get number of uses for all pages. That might result in some false negatives: if a page is using only one non-free file, but uses it a few hundred times, it won't be listed.

In order to count uses, a regex will have to be run on the page text, I think. This will need to account for templates and other such things. I have an idea of how to do this in my head, but it might be a while before I get around to it. --MZMcBride (talk) 18:52, 20 October 2010 (UTC)[reply]

The Talk pages by size MfD

The Talk pages by size MfD raised a lot of good issues and ideas. A main problem is that the database reports have very little information about them on the page, which has lead to confusion (as brought out in the MfD). Most of the database reports pages are not categorized. In response to the MfD, I modified the Wikipedia:Database reports/Talk pages by size page text using posts from the MfD. I also created Category:Wikipedia database reports as a place to categorize all the pages listed at Wikipedia:Database reports and other database reports pages whereever they may be. Initially, I though of using {{infopage}} on the top of the database pages, but that template didn't fit because the database report pages do not describe communal consensus on some aspect of Wikipedia norms and practices. I added a generic one to Talk pages by size instead. In view of the talk page tagging with the admonishment "Perhaps this will motivate greater efficiency in the use of kilobytes," you may want to create a top template to distinguish a utility report from a statistical report. I added text to the top of Wikipedia:Database reports to address this. You may further want to expand the text of Wikipedia:Database reports to provide some Database reports consensus control over the use of utility reports and statistical reports. -- Uzma Gamal (talk) 16:54, 25 November 2010 (UTC)[reply]

I populated Category:Wikipedia database reports with some of the database reports pages. If you like the category, please add the rest of the database reports pages. If not, you know what to do and there is no need to inform me of it : ). -- Uzma Gamal (talk) 17:08, 25 November 2010 (UTC)[reply]

The MFD was a complete waste of time and offered (almost) nothing of value, except to point out once again that some people posting here (on Wikipedia) are clueless. The category additions (and other additions) are going to be overwritten the next time the reports update. A better approach to the intro text of the reports is needed, but there are a lot of other projects I'd rather work on. --MZMcBride (talk) 22:49, 25 November 2010 (UTC)[reply]

Rename Long pages

Wikipedia:Database reports/Long pages should be renamed Wikipedia:Database reports/Talk pages by length consistent with Wikipedia:Database reports/Talk pages by size. "Long" is a subjective conclusion whereas Talk pages by length offers no opinion to characterize the page. -- Uzma Gamal (talk) 16:54, 25 November 2010 (UTC)[reply]

Long pages (configuration) has more than talk pages. --MZMcBride (talk) 22:46, 25 November 2010 (UTC)[reply]

"Pages by length" might work. It's up to you. Keep up the good work. -- Uzma Gamal (talk) 14:04, 26 November 2010 (UTC)[reply]

Orphaned template documentation

Would it be possible to generate a report of "/doc" subpages in the Template namespace for which the main template page does not exist or is a redirect? Thanks, -- Black Falcon ^(talk) 05:13, 29 November 2010 (UTC)[reply]

May even be useful to show all subpages and not just /doc. -- WOSlinker (talk) 07:59, 29 November 2010 (UTC)[reply]

I've tried it for other types of subpages, it isn't useful. There are navboxes with titles like 1984/85_X_game that are still transcluded and <noinclude> use to not work with the preload function so those templates were left uncategorized. We could improve Wikipedia:Database reports/Unused templates so that it's more useful... — Dispenser 18:31, 5 December 2010 (UTC)[reply]

/* Orphaned doc pages */
SELECT CONCAT("[[Template:", doc.page_title, "]]"),
   IF(rd_title IS NOT NULL, IF(trg.page_id, "Occupied", ""), "Delete") AS "Move",
   (SELECT COUNT(*) FROM templatelinks WHERE tl_namespace=doc.page_namespace AND tl_title=doc.page_title) AS "Transclusions",
   rd_title
FROM page AS doc
LEFT JOIN page AS tpl ON tpl.page_namespace=doc.page_namespace
    AND tpl.page_title=TRIM(TRAILING "/doc" FROM doc.page_title)
LEFT JOIN redirect ON rd_from=tpl.page_id
LEFT JOIN page AS trg ON trg.page_namespace=rd_namespace
    AND trg.page_title=CONCAT(rd_title, "/doc")

WHERE doc.page_namespace=10
AND   doc.page_title LIKE "%/doc"
AND   doc.page_is_redirect=0
AND (tpl.page_is_redirect=1 OR tpl.page_id IS NULL)

And we have 209 results with many which need to be merged/redirected to the new /doc page. — Dispenser 18:31, 5 December 2010 (UTC)[reply]

Proposed change

I was wondering if Wikipedia:Database reports/Unused templates could possibly be changed so that templates which transclude {{substituted}} within them or within their /doc are either not included, or are separated into a separate section entitled "Substituted Templates" or something similar? This would make it much easier to see which templates intentionally have no transclusions (because they are supposed to be substituted) and those which are just unused. Thanks! Mhiji (talk) 18:23, 12 December 2010 (UTC)[reply]

Tag the templates with {{transclusionless}}; I'll write some code to exclude pages tagged with that template. --MZMcBride (talk) 18:52, 12 December 2010 (UTC)[reply]

Thats great thanks. Though {{transclusionless}} redirects to {{substituted}} anyway? And {{substituted}} is used much more than it's redirect. Mhiji (talk) 23:38, 12 December 2010 (UTC)[reply]

Actually, we can use categories as the bug regarding <noinclude> has been fixed. — Dispenser 02:00, 13 December 2010 (UTC)[reply]

Templates are more flexible than categories. They allow content to be included or not, they allow for easier redirects and renames, etc. Use a template. It doesn't matter if you use {{transclusionless}} or {{substituted}}. Either will work. --MZMcBride (talk) 02:15, 13 December 2010 (UTC)[reply]

Request: List of articles found in every language but English

I start an article recently and then found there were already 20+ other versions on other language Wikipedias. This got me wondering what other articles exist in multiple other languages but don't appear in English. Has anyone tried to make a list of topics which have not been translated into English, sorted by the number of other language Wikipedias it appears in? Or is anyone willing to try making such a list? Thanks in advance. —Pengo 01:41, 25 December 2010 (UTC) (Crossposted from Wikipedia:Village_pump_(idea_lab)#Every_language_but_English.3F). And if you need more specifics I'd like I'd be happy to flesh out the idea.[reply]

I worked on this for 2 hours and I have not much to show for it

. There are a lot of wikipedias (269), some of which are really small (Cheyenne Wikipedia has 57 articles). Also, on the toolserver, the wikipedia databases are served across 3 separate servers, which makes querying difficult. If you (MZMcBride) want to see my work, I've included the query below. Good luck! --05:04, 28 December 2010 (UTC)

Tim1357's broken query

SELECT ff.page_title, count(ff.page_title)
FROM   (SELECT z.page_title 
        FROM   dewiki_p.page AS z 
        WHERE  z.page_namespace = 0 
        UNION ALL 
        SELECT y.page_title 
        FROM   frwiki_p.page AS y 
        WHERE  y.page_namespace = 0 
        UNION ALL 
        SELECT x.page_title 
        FROM   plwiki_p.page AS x 
        WHERE  x.page_namespace = 0 
        UNION ALL 
        SELECT w.page_title 
        FROM   itwiki_p.page AS w 
        WHERE  w.page_namespace = 0 
        UNION ALL 
        SELECT v.page_title 
        FROM   jawiki_p.page AS v 
        WHERE  v.page_namespace = 0 
        UNION ALL 
        SELECT u.page_title 
        FROM   eswiki_p.page AS u 
        WHERE  u.page_namespace = 0 
        UNION ALL 
        SELECT t.page_title 
        FROM   ptwiki_p.page AS t 
        WHERE  t.page_namespace = 0 
        UNION ALL 
        SELECT s.page_title 
        FROM   nlwiki_p.page AS s 
        WHERE  s.page_namespace = 0 
        UNION ALL 
        SELECT r.page_title 
        FROM   ruwiki_p.page AS r 
        WHERE  r.page_namespace = 0 
        UNION ALL 
        SELECT q.page_title 
        FROM   svwiki_p.page AS q 
        WHERE  q.page_namespace = 0 
        UNION ALL 
        SELECT p.page_title 
        FROM   zhwiki_p.page AS p 
        WHERE  p.page_namespace = 0 
        UNION ALL 
        SELECT o.page_title 
        FROM   cawiki_p.page AS o 
        WHERE  o.page_namespace = 0 
        UNION ALL 
        SELECT n.page_title 
        FROM   nowiki_p.page AS n 
        WHERE  n.page_namespace = 0 
        UNION ALL 
        SELECT m.page_title 
        FROM   fiwiki_p.page AS m 
        WHERE  m.page_namespace = 0 
        UNION ALL 
        SELECT l.page_title 
        FROM   ukwiki_p.page AS l 
        WHERE  l.page_namespace = 0) AS ff 
WHERE  NOT EXISTS (SELECT 1 
                   FROM   enwiki_p.page AS aa 
                   WHERE  aa.page_namespace = 0 
                          AND aa.page_title = ff.page_title) 
GROUP BY ff.page_title
LIMIT  1;

Err, aren't you assuming that each wiki would have the article with the same title? Surely other language Wikipedias will have localized titles like "World War II".... You need to be focus on the langlinks table. --MZMcBride (talk) 05:05, 28 December 2010 (UTC)[reply]

I had already posted the query at the village pump and concluded that it would not work since 1) the interwikis aren't kept in good condition and 2) other languages use different page structure. On DE Quantum physics and Quantum mechanics are separate while on EN they are covered in the same article. — Dispenser 05:45, 28 December 2010 (UTC)[reply]

I thought approaching this as “biggest interwiki networks that don't include enwiki”. But I think that would be difficult or impossible to do with just SQL. Svick (talk) 11:12, 28 December 2010 (UTC)[reply]

It has been done in the past, I think. [2] [3] [4]. Sole Soul (talk) 13:43, 28 December 2010 (UTC)[reply]

I created Spanish pond turtle yesterday for just that reason(having 7 other language pages). I think you need to look on the bigger other language Wikipedias (.es, .de) and check if it has no .en link but has many to others. Regards, SunCreator ^(talk) 14:36, 28 December 2010 (UTC)[reply]

Re Intrawiki links not being in good condition. My experience with The death anomalies project has been quite the opposite - intrawiki link errors do exist, but are rare. I think this sort of report would generate a useful list of articles that we potentially should have, and where instead it identifies that we are missing an intrawiki link, then that is also useful info. So yes it would be worth doing this by checking for articles on DE, FR etc that have large numbers of intrawiki links but not an EN one. Whether the result is an extra article or an added intrawiki link to an existing article doesn't matter as both are positive outcomes. Ϣere SpielChequers 17:25, 28 December 2010 (UTC)[reply]

I have created the report based on data from dumps. See Wikipedia:Dump reports/Missing articles. User<Svick>.Talk(); 14:51, 11 June 2011 (UTC)[reply]

Images with no recognisable license

Would it be possible to have a 'report' based version of http://en.wikipedia.org/wiki/User:MGA73/No_license which updated daily?

Sfan00 IMG (talk) 13:52, 5 January 2011 (UTC)[reply]

I don't know what you mean. What's deficient with that subpage? How would a separate report be different? --MZMcBride (talk) 01:59, 16 January 2011 (UTC)[reply]

Images not yet suitable for Commons

Is it possible to have a report that searches for media tagged {{mtc}} (and related) but which don't have a recognisable license, or one that's incompatible with Commons (such as fair-use)?

Report to run on a dialy or weekly basis depending on volume of media concerned?

Sfan00 IMG (talk) 13:56, 5 January 2011 (UTC)[reply]

I don't know what "recognisable license" means. I also don't know which licenses are incompatible with Commons. Is there a category for such a thing? --MZMcBride (talk) 02:00, 16 January 2011 (UTC)[reply]

Editors eligible for Autopatrol privilege

Per this conversation at the Village pump, run a query very similar to tools:DBQ-87 to find editors eligible for the Autopatrol privilege.

Search to find all editors on en.wikipedia who have created 50+ non-redirect articles
If editor already has administrator or autoreview priviliges, remove from list
If account is less than 6 months old, remove from list
If last edit was made more than 30 days ago, remove from list
Sort list by number of articles created

Is it possible to run a query like this? Thanks very much. - Hydroxonium (talk) 00:37, 14 January 2011 (UTC)[reply]

I came up with this, I'm not sure if it works. Tim 1357 ^talk 02:52, 18 January 2011 (UTC)[reply]

Query

SELECT rev_user_text 
FROM   revision 
       JOIN page 
         ON page_id = rev_page 
            AND page_namespace = 0 
            AND page_is_redirect = 0 
       LEFT JOIN user_groups 
         ON ug_group IN ( 'sysop', 'autoreviewer' ) 
            AND rev_user = ug_user 
WHERE  rev_user!=0 
       AND Isnull(ug_user) 
       AND rev_timestamp > Date_format(DATE_SUB(NOW(), INTERVAL 6 MONTH), 
                           '%Y%m%d%H%i%s') 
       AND rev_parent_id = 0 
       AND rev_deleted = 0 
GROUP  BY rev_user 
HAVING COUNT(*) > 50 
LIMIT  1;

Thanks very much, Tim. I'm not familiar with SQL, so I will let others chime in. Thanks again. - Hydroxonium (talk) 06:13, 18 January 2011 (UTC)[reply]

Can we add "bot" to the ug_group list so it looks like this?

Query

SELECT rev_user_text 
FROM   revision 
       JOIN page 
         ON page_id = rev_page 
            AND page_namespace = 0 
            AND page_is_redirect = 0 
       LEFT JOIN user_groups 
         ON ug_group IN ( 'sysop', 'autoreviewer', 'bot' ) 
            AND rev_user = ug_user 
WHERE  rev_user!=0 
       AND Isnull(ug_user) 
       AND rev_timestamp > Date_format(DATE_SUB(NOW(), INTERVAL 6 MONTH), 
                           '%Y%m%d%H%i%s') 
       AND rev_parent_id = 0 
       AND rev_deleted = 0 
GROUP  BY rev_user 
HAVING COUNT(*) > 50 
LIMIT  1;

I have been running in to bots on the list from the jira:DBQ-87 query. I don't think bots are allowed to create articles except is special cases, but I've been seeing them. Thanks again. - Hydroxonium (H₃O⁺) 23:50, 18 January 2011 (UTC)[reply]

Another version...

SELECT
  p2.page_creator,
  usrtmp.rev_timestamp AS first_edit,
  r2.rev_timestamp AS last_edit,
  COUNT(*)
FROM u_mzmcbride_enwiki_page_creators_p.page AS p1
JOIN enwiki_p.page AS p2
ON p1.page_id = p2.page_id
JOIN (SELECT
        user_id,
        user_name,
        user_editcount,
        rev_timestamp,
        GROUP_CONCAT(ug_group) AS groups
      FROM user
      LEFT JOIN user_groups
      ON ug_user = user_id
      JOIN revision
      ON rev_user = user_id
      AND user_id NOT IN (SELECT
                            ug_user
                          FROM user_groups
                          WHERE ug_group IN ('sysop', 'autoreviewer'))
      AND rev_timestamp = (SELECT
                             MIN(rev_timestamp)
                           FROM revision
                           WHERE rev_user = user_id)
      AND rev_timestamp < DATE_FORMAT(DATE_SUB(NOW(),INTERVAL 6 MONTH),'%Y%m%d%H%i%s')
      GROUP BY user_id) AS usrtmp
ON usrtmp.user_name = p1.page_creator
JOIN revision AS r2
ON p2.page_id = r2.rev_page
WHERE p2.page_namespace = 0
AND p2.page_is_redirect = 0
AND r2.rev_timestamp = (SELECT
                           MAX(rev_timestamp)
                         FROM revision
                         WHERE rev_user = usrtmp.user_id)
AND r2.rev_timestamp > DATE_FORMAT(DATE_SUB(NOW(),INTERVAL 1 MONTH),'%Y%m%d%H%i%s')
GROUP BY p1.page_creator
HAVING COUNT(*) > 49
ORDER BY COUNT(*) DESC;

I maintain a database that contains page_id, page_creator pairs. So rather than querying each page, you can just look up the stored info. It's at u_mzmcbride_enwiki_page_creators_p. This query uses that database, but it got interrupted today. This really is too obnoxious to put into one query. It needs to be split out into multiple parts using some sort of sane logic. I may do this tomorrow if I get a chance. --MZMcBride (talk) 07:22, 19 January 2011 (UTC)[reply]

I actually saw that when I was writing the query but didn't use it because the table is indexed so that it is optimized for pageid->creator, instead of the other way around. This means that MySQL has to do a table scan in order to find each page that the user created. Maybe you could create an index on top of the table to help with this? CREATE INDEX user_creations ON u_mzmcbride_enwiki_page_creators_p.page (page_creator); or something would do the trick. Tim 1357 ^talk 04:01, 20 January 2011 (UTC)[reply]

I don't know what you're talking about. The schema looks like this:

CREATE DATABASE u_mzmcbride_enwiki_page_creators_p;
CREATE TABLE page (
page_id INT UNSIGNED NOT NULL PRIMARY KEY AUTO_INCREMENT,
page_creator varchar(255) binary NOT NULL default ''
);
CREATE INDEX page_creator ON page (page_creator);

I don't really have the time to do this right now, but I think the best method is to do a query to get all the page creators with over fifty article creations (should take about half an hour), then do a Pythonic set comparison with the bots/autopatrollers/sysops and then check last edit/first edit for each user using functions/individual queries. That's how I mapped it out in my head, at least. I probably won't get around to this for a while, though. --MZMcBride (talk) 04:09, 20 January 2011 (UTC)[reply]

Thanks very much for the help, MZMcBride. Yes, please split up the query if that will ease the task. If it ends up being too obnoxious, I would be fine just running the original jira:DBQ-87 query again. Then I can manually go through the list and check for the other things (6 months old, activity in the last 30 days, etc.). That's what I have been doing with the old list. It's just extremely labor intensive. I've spent over 20 hours manually sifting through that old list and I'm starting to go bonkers. Ughhh. Anyway, thanks again. I appreciate the help.- Hydroxonium (H₃O⁺) 08:27, 19 January 2011 (UTC)[reply]

I don't know SQL or anything about the Wikipedia database, but I have been reading up on stuff. So this is likely to be an annoying question that can be ignored, but I'll ask it anyway. Is user_touched a field in the database, and could that be used to check for recent activity? Thanks again for the help. - Hydroxonium (H₃O⁺) 08:58, 19 January 2011 (UTC)[reply]

If it's not too much trouble, could we exclude users whose last creation was more than, say, 90 days ago? If it's too much faff, then never mind—it's easy enough to check manually. HJ Mitchell | Penny for your thoughts? 01:33, 20 January 2011 (UTC)[reply]

HJ Mitchel, that won't slow it down much. We could just add AND MAX(rev_timestamp)>DATE_FORMAT(DATE_SUB(NOW(),INTERVAL 90 DAY),'%Y%m%d%H%i%s') to the HAVING statement. Tim 1357 ^talk 04:17, 20 January 2011 (UTC)[reply]

MZMcBride: I'll try to run this on one of the slow servers to see if I can get any results. Tim 1357 ^talk 04:17, 20 January 2011 (UTC)[reply]

┌──────────────────────────────┘
Hi guys. Thanks very much for working on this report. Could this be run monthly? It would help the guys assigning permissions if the workload came in smaller batches more frequently rather than one big batch. Thanks again. - Hydroxonium (H₃O⁺) 18:11, 20 January 2011 (UTC)[reply]

One more. This is probably difficult, but would it be possible to search for and remove users that were declined in the previous month? This might be more approprate for a bot to do maybe. Thanks much. - Hydroxonium (H₃O⁺) 18:20, 20 January 2011 (UTC)[reply]

Thanks very much for working on this report, everybody. Svick has created this report at Wikipedia:Database reports/Editors eligible for Autopatrol privilege. Thanks Svick. - Hydroxonium (H₃O⁺) 03:41, 24 January 2011 (UTC)[reply]

I'm not sure why Svick didn't announce the new report here or at Wikipedia:Database reports/Announcements. I assume he had his reasons, though.... I peeked at the configuration and it looks pretty good. I'm still hoping Svick'll take over database reports altogether, but that's a different topic for a different day. --MZMcBride (talk) 04:34, 24 January 2011 (UTC)[reply]

Sorry, I wanted to tweak it a bit before posting here, but I didn't find the time until now. Specifically, I was trying to remove those that didn't create any article recently, but that caused the report to run too long. But I removed those that were declined this or last month.

MZMcBride, this is exactly the reason I don't want to take over – I don't always have the time.

Svick (talk) 16:25, 26 January 2011 (UTC)[reply]

You think I always have the time? :P

The idea behind a multi-maintainer project is that hopefully the load could be distributed between people, allowing for a faster response time for new reports/queries. I think of it as a long-term goal, though. Nothing needs to be done this instant. --MZMcBride (talk) 23:03, 26 January 2011 (UTC)[reply]

Yeah, I think you're right, multi-maintainer project is the way to go. Svick (talk) 16:15, 27 January 2011 (UTC)[reply]

New "Duplicate Files" report

A few comments about this new report:

report titles generally use sentence case
use {{fullurl:}} instead of [http://en.wikipedia.org... ] for portability to other wikis and better links on the secure site (this also removes the need for Python's urllib module)
include any crontab entry on the configuration subpage
you want ~~~~~, not ~~~~ (it looks like this was already fixed)
if you're limiting the number of output results, say so in the report description text
if you're doing other checks (such as file protection status), say so in the report description text
"annother" is misspelled in the report description text currently
don't hardcode database names and host names in the configuration
don't hardcode namespace names (even if they are canonical, use toolserver.namespacename)

Other than that, looks good. It'd be nice to get a multi-maintainer project for these reports going at some point. I'll look into that at some point. --MZMcBride (talk) 05:33, 24 January 2011 (UTC)[reply]

Ok, I think I've fixed all those things. Tim 1357 ^talk 05:56, 24 January 2011 (UTC)[reply]

Awesome. :D Thanks for the quick response. Last minor nitpick is that you're still importing urllib (when it isn't being used any longer), but that's no big deal. Thanks for helping out! I really appreciate it. --MZMcBride (talk) 06:00, 24 January 2011 (UTC)[reply]

Although the updated report is broken: You need to have a newline after table row delimiter |- and you need to add ns_id = 6 to the SQL, so that it actually uses the File namespace. Svick (talk) 19:26, 26 January 2011 (UTC)[reply]

One more thing: I think you want to test that the images are not protected separately (i.e. using two left joins). This way, each duplicate pair will show up only once. Svick (talk) 13:52, 28 January 2011 (UTC)[reply]

User subpages for users indef blocked for spamming

There are many users that have been indefinitely blocked for spamming, for using Wikipedia for promotional purposes, or for having a promotional username. Quite often these users have subpages in their userspace that contain promotional material that hasn't been deleted. This material can show up in Google searches - which is probably why spammers put it there. Would it be technically possible to have a database report that lists these pages? If so, would others find this useful enough to be worth the effort? Peacock (talk) 17:46, 23 June 2010 (UTC)[reply]

Probably worth the effort. Do you have an example subpage (that hasn't been deleted) so that I can verify the results of any queries? --MZMcBride (talk) 17:53, 23 June 2010 (UTC)[reply]

Here's one I came across yesterday and just blanked: User:Juntaomotors/Juntaomotors. Peacock (talk) 15:43, 25 June 2010 (UTC)[reply]

That user was not blocked for spam, so it's not really a helpful example

. I wrote the query below, but for some reason it does not work. Perhaps MZMcBride will be clever enough to write a better query. Tim 1357 ^talk 01:18, 26 June 2010 (UTC)[reply]

The Non-Functioning Query

SELECT Concat('User:', page_title) 
FROM   page 
       JOIN logging
         ON log_title = Substring_index(page_title, '/', 1)
            AND log_namespace = 2
WHERE  page_namespace = 2 
       AND page_title LIKE '%/%'
       AND log_action = 'block'
       AND log_comment LIKE '%spam%'
LIMIT  1;

I assumed PCock wanted current blocks, not all block actions, so I checked against the ipblocks table. It's an insanely slow query, though. Might be better to do it in sets in a programming language. I imagine you'd want to check the ipb_reason field for a few strings like "usernamehardblocked", "spam", etc. --MZMcBride (talk) 02:10, 26 June 2010 (UTC) [reply]

Working, but very slow, query

SELECT
  page_namespace,
  page_title,
  ipb_reason
FROM page
JOIN ipblocks
ON ipb_address = TRIM(SUBSTRING_INDEX(REPLACE(page_title, '_', ' '), '/', 1))
AND page_namespace IN (2,3)
WHERE ipb_expiry = 'infinity'
AND page_title LIKE '%/%'
LIMIT 1;

Hmm. I think this is one query that is not going to be done gracefully. I ran a new query that is set to save here when it finishes. (Note each page title lacks the 'User:' prefix). I'll check back tomorrow to see if the query had any success, otherwise I am out of ideas. Tim 1357 ^talk 02:36, 26 June 2010 (UTC)[reply]

New Query

CREATE TEMPORARY TABLE u_tim1357.blocked_users 
  (
     USER VARBINARY(255)
  );

INSERT INTO u_tim1357.blocked_users 
SELECT DISTINCT log_title 
FROM   logging 
WHERE  log_action = 'block'
       AND log_comment LIKE '%spam%'
ORDER  BY log_timestamp DESC; 

SELECT page_title 
FROM   page 
       JOIN u_tim1357.blocked_users
         ON page_title = Substring_index(USER, '/', 1)
WHERE  page_title LIKE '%/%' 
       AND page_namespace = 2;

I really hope you were using logging_ts_alternative table instead of logging when you were running those queries on the Toolserver. — Dispenser 14:56, 6 July 2010 (UTC)[reply]

Uh oh. Did I break anything? Nobody told me about that table! Tim 1357 ^talk 23:08, 7 July 2010 (UTC)[reply]

Anyways, I bypassed the logging table all together and just used ipblocks. I limited the query to the first 200 results, which will be automatically saved here when the query completes. --Tim 1357 ^talk 02:35, 7 August 2010 (UTC)[reply]

Guess not. --MZMcBride (talk) 05:54, 9 August 2010 (UTC)[reply]

Yea, dunno what happened there. Tim 1357 ^talk 02:41, 17 August 2010 (UTC)[reply]

Subpages of Wikipedia:Articles for deletion not transcluded to a daily log page

To find AFD discussions not properly transcluded. Maybe limit to recently created to eliminate noise. –xeno^talk 15:38, 1 August 2010 (UTC)[reply]

I assume you're already aware of Old deletion discussions (configuration) and Orphaned article deletion discussions (configuration). This is a slight variant, I guess. I'm not sure of a proper report title. I thought you were going to get a Toolserver account and start doing these tasks yourself? :P --MZMcBride (talk) 17:45, 1 August 2010 (UTC)[reply]

Untranscluded deletion discussions. As to your question... Kindasorta. But I know my limits! –xeno^talk 18:21, 3 August 2010 (UTC)[reply]

"Untranscluded article deletion discussions", you mean? --MZMcBride (talk) 22:33, 3 August 2010 (UTC)[reply]

Quite. I can't do all the work, you see. [Though expanding the report to MFD discussions might be worthwhile as well] –xeno^talk 22:34, 3 August 2010 (UTC)[reply]

Okay, so we'll do "Untranscluded deletion discussions" and include MFD and AFD then, I guess. You want this updated daily? --MZMcBride (talk) 16:44, 14 August 2010 (UTC)[reply]

More to-do

At some point...

Archive this talk page
http://en.wikipedia.org/w/index.php?title=User_talk:MZMcBride&diff=386929184&oldid=386768952
http://en.wikipedia.org/w/index.php?title=User_talk:Epbr123&diff=386743095&oldid=386707673
Leecher report!
Articles containing red-linked files (configuration) should mark files with deletion log entries or there should be a separate report
Templates containing links to redirects (configuration) or perhaps "Templates containing redirects"?
Finish Most-linked redirects (configuration)
~~Templates containing red-linked files (configuration) should include a transclusion count and link to the file~~ This should be done now.
Update on-wiki /Configuration subpages

--MZMcBride (talk) 22:15, 29 September 2010 (UTC)[reply]

What's a leecher report? Svick (talk) 23:13, 29 September 2010 (UTC)[reply]

"User pages of non-contributing users" or something like that. That is, users who come here, register an account, create a user page (that's often spammy, vandalism, or a test page), and never edit anywhere else but on their user page. There are some added complications now (a lot of people land here from other Wikimedia wikis, especially with unified login now, e.g.) and there still isn't a very clear title, but I think the overall idea has merit. --MZMcBride (talk) 00:30, 30 September 2010 (UTC)[reply]

There is some merit in that sort of report, maybe not real often though. --Rocksanddirt (talk) 20:32, 22 October 2010 (UTC)[reply]

Transclusions of deleted templates (configuration) — needs to join against the page table or something?

--MZMcBride (talk) 19:41, 4 March 2011 (UTC)[reply]

Most-linked redirects (2)

A biweekly (or weekly if it seems useful) report I'd like to see is a list of most-linked redirects. For example, the bot would count the number of times John Blain (football player) is linked, showing the top 500 or so redirects. While some would be useful and should be left alone, others could be found that were the result of page moves that should have fixed a long time ago. Wizardman _{Operation Big Bear} 21:23, 9 October 2010 (UTC)[reply]

Why should be redirects that are results of page moves fixed? See WP:NOTBROKEN. Svick (talk) 12:38, 10 October 2010 (UTC)[reply]

Some don't need to be, but linking to an outdated link could cause created articles to follow that nonstandard format, leading to more work for the rest of us. That and while some redirects don't need to be cleaned up, many do. Wizardman _{Operation Big Bear} 04:15, 11 October 2010 (UTC)[reply]

Huh, you've actually asked about this before. And I've already given you the pagelinks table lecture, so now all I need to do is remember to actually write this report. :-) I'll try to get to it tomorrow. For what it's worth, I doubt it will be a weekly report. It'll likely be kind of "expensive" to generate/update. --MZMcBride (talk) 07:21, 11 October 2010 (UTC)[reply]

There is already a report called Templates containing links to disambiguation pages. Perhaps a Templates containing links to redirect pages may be useful as well? -- WOSlinker (talk) 08:49, 11 October 2010 (UTC)[reply]

To Wos: that could work. To MZM: If I did ask earlier than I forgot about that, I understand that it would definitely be a tough list to create, and limiting it to monthly given the update expense is fine. Wizardman _{Operation Big Bear} 18:38, 11 October 2010 (UTC)[reply]

I just noticed my request several sections up. To clarify, this would just be for mainspace articles; no reason to bother with others. If you want to trim it more and remove other redirects which shouldn't be touched, then we can remove any that include the category "redirects with possibilities" which should knock time off running it. Wizardman _{Operation Big Bear} 18:42, 11 October 2010 (UTC)[reply]

Take a look at Most-linked redirects (configuration). The initial limit was 30 incoming links. I've now raised it to 200 incoming links and it's running currently. We'll see what it outputs, how useful it is, and what (if any) adjustments are needed once it finishes.

The idea of creating another report for "Templates containing links to redirects" seems to be a good one. --MZMcBride (talk) 23:57, 11 October 2010 (UTC)[reply]

Thanks. Admittedly I was not expecting that many redirects to have such a large number of links. So far it's helped find a few issues that I've handled, even though most redirects there would be left alone. Wizardman _{Operation Big Bear} 02:57, 12 October 2010 (UTC)[reply]

It might be helpful to see if there are ways to refine the scope of the report. This can usually be done by looking at specific examples to see if there are distinguishable characteristics about the page that can be broadly applied to other similar pages. For the example you provided (John Blain (football player)), we can say a few things:

the page has a move log
the page's target is a living person
the page title contains parentheses (which usually signify disambiguation of some kind)
the page has 7 incoming links from other articles
the page title signifies that the person is a living person (using a common first name)

Some of these observations are completely useless. Some aren't. If you can try to give me a better idea of what kind of pages you're looking for (perhaps with some other specific titles), I can try to make a better report for you. This might mean that the report will ultimately be moved to a different title, but that's no big deal. --MZMcBride (talk) 03:03, 12 October 2010 (UTC)[reply]

Out of those, the main one that jumps out at me would be the move log; that is something I was hoping to work on, alongside the parentheses modifier as well. The many football player disambigs, for example, stem from a pretty big move war from 07-08, so there's a lot of articles that go to the wrong places still. Wizardman _{Operation Big Bear} 03:20, 12 October 2010 (UTC)[reply]

Long pages changes

The page has two edits ([5], [6]) in the history that probably has to be implemented with the bot. At same also the 140K threshold could be lowered to 130K. --Kslotte (talk) 11:14, 5 December 2010 (UTC)[reply]

All pages containing colons

I'd like to see a report that has information about:

The number of pseudo-namespace redirects out there.
Pages that can have possible conflicts with interlanguage/iso-code prefixes.

Can a query be done for this? Thanks, :| TelCo NaSp Ve :| 00:16, 26 January 2011 (UTC)[reply]

Maybe? You need to be much more specific, though. I assume you're talking about pages such as CAT:CSD? It's easy enough to list all page titles that contain a colon (":"), but that probably isn't what you want (even if your section header says otherwise). Inherently pages beginning with used language codes (such as "en") are forbidden. (Try making the page "En:foo" on this wiki.) So it's rather unclear what you're after and for what purpose. If you can clarify, that would help things along. --MZMcBride (talk) 03:04, 26 January 2011 (UTC)[reply]

Wikipedia:Database reports/Deleted red-linked categories

I have been tackling some of the entries in this report but it is a bit difficult to see the wood for the trees sometimes as there are a lot of such categories which only contain pages in user space (e.g. Category:Capitalist Wikipedians and the like). Would it be possible to split the report (there are already 3 subpages but these seem to be just for size reasons) into say

Red-linked categories containing at least one page in article space.
Red-linked categories containing only pages in user space.
Any remaining red-linked categories.

Thanks. Boissière (talk) 22:36, 27 January 2011 (UTC)[reply]

I don't have an objection to this per se, but it may be easier just to empty those user categories which do not start with User or Wikipedia. I've started working on this. -- Black Falcon ^(talk) 22:13, 28 January 2011 (UTC)[reply]

Long stubs

Would be nice to be able to keep this at 500-1,000 items. Currently it has fewer then 100 entries that could still be stubs after removing over 500 !stubs. Vegaswikian (talk) 07:30, 3 February 2011 (UTC)[reply]

Okay, I fixed the logic to be a bit more reasonable. You can see the changes in this diff. --MZMcBride (talk) 16:44, 6 February 2011 (UTC)[reply]

Thanks. Vegaswikian (talk) 08:35, 7 February 2011 (UTC)[reply]

Possible change to Deleted red-linked categories

For the Cats listed at Wikipedia:Database reports/Deleted red-linked categories, would it be possible to either change the members column (showing how many pages are in the deleted cat) to use PAGESINCAT? Or, if the number is logged for the database report as how many were member pages at the time of the scan, perhaps a new column with a "current" heading? This would probably increase the page count, as I believe you can only call 500 parser functions at time (as opposed to the current 800), but I think it would help to process and clean up the entries there. ^Avic _ennasis @ 11:08, 30 Shevat 5771 / 4 February 2011 (UTC)

Done. --MZMcBride (talk) 21:06, 6 February 2011 (UTC)[reply]

Awesome. :) This will help out a lot, I'm sure. Thanks. ^Avic _ennasis @ 00:05, 4 Adar I 5771 / 8 February 2011 (UTC)

Polluted category

Two things, first of all in Wikipedia:Database reports/Polluted categories/Configuration, AND tl_title = 'Pollutedcat' should be updated to AND tl_title = 'Polluted_category' since the template was renamed.

Thanks. -- WOSlinker (talk) 11:16, 6 February 2011 (UTC)[reply]

Sure, done: https://github.com/mzmcbride/database-reports/commit/1e1188ca0cad9b1cd32e856fa63ff563cd2f1840. --MZMcBride (talk) 16:49, 6 February 2011 (UTC)[reply]

New category-related report

And secondly, a new repot that checks all the categories that transclude {{Template category}} to see if any of them contain article pages wouldbe nice.

I've done some SQL below but not certain if it will work or need some tweaking.

SELECT page_title
FROM page AS pg1
JOIN templatelinks AS tl
ON pg1.page_id = tl.tl_from
WHERE pg1.page_namespace = 14
AND tl.tl_namespace = 10
AND tl.tl_title = 'Template_category'
AND EXISTS (SELECT 1
            FROM page AS pg2
            JOIN categorylinks AS cl
            ON pg2.page_id = cl.cl_from
            WHERE pg2.page_namespace = 0
            AND pg1.page_title = cl.cl_to)
LIMIT 250;

Thanks. -- WOSlinker (talk) 11:16, 6 February 2011 (UTC)[reply]

Do you have a report title and update frequency in mind? --MZMcBride (talk) 16:50, 6 February 2011 (UTC)[reply]

How about Template categories containing articles & Weekly would be fine. -- WOSlinker (talk) 16:55, 6 February 2011 (UTC)[reply]

Done: Template categories containing articles (configuration). --MZMcBride (talk) 18:09, 6 February 2011 (UTC)[reply]

Thanks -- WOSlinker (talk) 18:18, 6 February 2011 (UTC)[reply]

Selectively archived talk page (again)

I've selectively archived this talk page again. If you want a new report or a tweak to an existing report, post (or re-post) here and someone will take care of it, eventually. --MZMcBride (talk) 17:30, 6 February 2011 (UTC)[reply]

Article categories containing templates

The new report Template categories containing articles (configuration) is good for some wikignoming but I was wondering if a report on the converse situation, i.e. where an article category contains a template, would also be possible (and useful)? Boissière (talk) 20:59, 17 February 2011 (UTC)[reply]

Such a report could be useful, but there are times when articles categories should contain a template (mainly when the topic of the template corresponds exactly to the topic of a topic category). See e.g. Category:Cars (film). -- Black Falcon ^(talk) 21:03, 17 February 2011 (UTC)[reply]

I am not sure that I agree and WP:PROJCATS implies that this should not be done. The reason being that the article categories are for readers whereas templates are usually only of interest to editors. Boissière (talk) 22:30, 17 February 2011 (UTC)[reply]

I would argue that it is helpful to readers when a navigation template (intended for readers, not editors) for a particular topic is placed in a category for the same topic (this is, in my opinion, the only circumstance when such "template pollution" is useful). In general, it is quite common for a limited amount of project content to be categorized directly into a content category when the topic of the content and the category is the same. For instance, stub categories always appear in content categories, sorted after the main alphabetical list under the character "µ". The situation is similar with Wikipedia books, which are sorted under the character "β", and templates, sorted under the character "τ". -- Black Falcon ^(talk) 17:58, 18 February 2011 (UTC)[reply]

Move watch

Here's a report I've made: Page moves grouped by user. Updated hourly, it groups all moves by user in the last 24 hours, week, or month. It's been very useful in catching pagemove sockmasters such as User:Tobias Conradi. --JaGa^talk 18:58, 21 February 2011 (UTC)[reply]

Neat. :-) There are already a number of other Toolserver tools listed in the index on the subject-space page. Feel free to add a row for your tool/report if you'd like. I'm sure people would find it helpful. Cheers. --MZMcBride (talk) 23:24, 21 February 2011 (UTC)[reply]

Done. Thanks. --JaGa^talk 09:44, 24 February 2011 (UTC)[reply]

Possible new report

I'm not sure how useful this would be, but a report like "talk pages that are redirects when parent article is not" may be helpful, particularly in finding poor moves or copy-paste articles. ^Avic _ennasis @ 02:49, 20 Adar I 5771 / 24 February 2011 (UTC)

Wikipedia:Database reports/Meta-Wiki rights changes

MZ, is there a way to make the "Comments" field do some sort of word-wrapping? If not, I forgive you. Killiondude (talk) 21:51, 25 February 2011 (UTC)[reply]

KD, you're a comment. And I forgive you. --MZMcBride (talk) 05:10, 26 February 2011 (UTC)[reply]

You are a god among men. Killiondude (talk) 05:46, 27 February 2011 (UTC)[reply]

Editors eligible for Autopatrol privilege (2)

How often is the list of Editors eligible for Autopatrol privilege updated ?

Monthly ?! *** in fact *** ( contact ) 14:47, 2 March 2011 (UTC)[reply]

Hah, it's configured for the 29th day of the month, too, so it missed February altogether. It should probably be changed to weekly. --MZMcBride (talk) 16:28, 2 March 2011 (UTC)[reply]

Would it be possible to filter it to exclude those who haven't created an article in the last month? Ϣere SpielChequers 18:11, 2 March 2011 (UTC)[reply]

I tried that when I was creating the report, but failed, because the resulting query was too slow. I'm trying an alternative approach now: creating my own modified copy of MZMcBride's page creators table. Svick (talk) 22:10, 2 March 2011 (UTC)[reply]

Thanks, that would make it much more useful. Ϣere SpielChequers 22:13, 2 March 2011 (UTC)[reply]

Thank you all. *** in fact *** ( contact ) 04:21, 3 March 2011 (UTC)[reply]

I should mention that we are still working on the original list from January. I am looking at automating more of the task and think breaking the it in to seperate parts may help. I started a thread at Wikipedia:Village pump (proposals)#Automating submissions for Autopatrol right. Everybody is encouraged to comment. Thanks. - Hydroxonium (talk) 13:27, 12 March 2011 (UTC)[reply]

A new version of the report, containing only those that created an article in the last 30 days is finally up. User<Svick>.Talk(); 11:09, 24 April 2011 (UTC)[reply]

Thank you, great work. - Hydroxonium (T•C•V) 13:45, 25 April 2011 (UTC)[reply]

Orphaned /to do pages

When someone moves an article to a new name, they may not notice that sometimes there was a /to do subpage off the talk page, which then ends up being orphaned since it is not moved with the page and talk page. So just wondering if a query which lists pages that are redirects where there is also a /to do subpage that exists would be a useful report? -- WOSlinker (talk) 20:54, 2 April 2011 (UTC)[reply]

It's similar to Wikipedia:Database reports/Orphaned talk subpages but with the root page being a redirect rather than not existing. -- WOSlinker (talk) 21:02, 2 April 2011 (UTC)[reply]

Or another way could be to list all talk subpages without links to or transclusions. -- WOSlinker (talk) 17:50, 4 April 2011 (UTC)[reply]

It seems there is quite a lot of such pages. First thousand is at Wikipedia:Database reports/Talk subpages with redirect parent. User<Svick>.Talk(); 03:36, 24 April 2011 (UTC)[reply]

Thanks. Just wondering, maybe better to only list those without any links to them. So adding the the following to the where clase might help. -- WOSlinker (talk) 11:17, 24 April 2011 (UTC)[reply]

and not exists (
  select *
  from pagelinks
  where pl_from=sub.page_id
  and pl_title<>'Database_reports/Talk_subpages_with_redirect_parent'
)

Right, done. User<Svick>.Talk(); 12:45, 24 April 2011 (UTC)[reply]

Thanks for handling this one, Svick.

This bit of the query looks strange:

select page_id
     from page
     where page_title = 'Database_reports/Talk_subpages_with_redirect_parent')

You're not specifying page_namespace here. That must make it much slower, no? (Actually, just tested, it's 6 seconds vs. 0, but still would be much better to specify here. It would also be nice to make it abstract and use the value from report_title if possible.)

There's a minor typo in the description text as well ("exclusing" --> "excluding"). Other than that, looks good. :-) --MZMcBride (talk) 18:02, 24 April 2011 (UTC)[reply]

Yeah, you're right. I was thinking that there's no chance that there will be a page with the same name in another namespace, but forgot about indexes. User<Svick>.Talk(); 19:00, 24 April 2011 (UTC)[reply]

Commons cat redirects

Could someone run a query for intersecting Commons:Category:Category redirects with the interwiki or iwlinks table. The table should have from (wikipedia title page), to(commons category), fix (commons category redirect target). The table could be read by a bot to resolve Wikipedia:BOTR#Check_commons_links_and_fix_cases_where_pages.2Fcategories_have_been_moved...not sure how often it should run.Smallman12q (talk) 11:51, 3 April 2011 (UTC)[reply]

Replied there. Seems like a stupid situation. --MZMcBride (talk) 00:15, 4 April 2011 (UTC)[reply]

Welcome to Wikipedia. ;-) Stupid situations will happen a lot; it's good to be able to identify where things have gone wrong and fix them, automatically if possible. Mike Peel (talk) 07:50, 4 April 2011 (UTC)[reply]

Red links

Could someone write a report for articles with red links, and templates with red links sorted by number of red links?Smallman12q (talk) 12:18, 7 April 2011 (UTC)[reply]

User:RussBot/Templates with red links may provide part of what you wanted. --R'n'B (call me Russ) 16:22, 29 June 2011 (UTC)[reply]

Redirects to sections that don't exist

For example. Is this feasible? --Closedmouth (talk) 05:41, 29 April 2011 (UTC)[reply]

Please yes. This is horribly annoying. Killiondude (talk) 05:45, 29 April 2011 (UTC)[reply]

I'm not sure I'd call needing an extra click "horribly annoying," but yeah, these should be fixed. At least a non-existent section still generally gets you to the appropriate page; it's not as though it returns an error page or something silly like that.

I'm kind of surprised WikiProject Check Wikipedia doesn't catch this already, but the extra requests needed to get the parsed section headers probably explain that. I'll try to get to this over the weekend. I have a thought or two about how to do it. --MZMcBride (talk) 06:30, 29 April 2011 (UTC)[reply]

I wrote this at Broken section anchors (configuration). There were about 700 results from the first 2500 redirects that contain an anchor checked. That's about 28% of redirects pointing a section that are affected, assuming the rate of false positives is low (which it seems to be). I'll try to get this report set up to update a bit more regularly in the coming days/weeks. For now, there's plenty of work to be done.

Briefly, I'll note that in a lot of these cases, the answer isn't to simply remove the anchor or find an alternate target header. Sometimes the answer is to point the redirect to a different page altogether or add an {{anchor}} to the target page. --MZMcBride (talk) 01:03, 2 May 2011 (UTC)[reply]

Poked at this a bit more. Updated the script to be much, much faster and to exclude a few false positives. Woo! --MZMcBride (talk) 02:09, 4 May 2011 (UTC)[reply]

^\s*#REDIRECT:?.*?\s*\[\[([^|[\]]*?)(?:\|[^[\]]+)?\]\] is the regex I used for parsing redirects, I suggest you use it. — Dispenser 22:39, 5 May 2011 (UTC)[reply]

I'm doing this task at dewiki (see: de:Benutzer:Krdbot/RedirectDeeplink) based on the dump [7]. I think it could be quite easily adapted to en: if desired. --Krd (talk) 06:52, 29 April 2011 (UTC)[reply]

In the current dump there are 203808 redirects pointing to sections. A limit of 100 hits is reached after checking 813 of these, so maybe there are 25000 bad section redirects in total. (This ratio is corresponding to the one seen on dewiki.)

List created at User:Krdbot/RedirectDeeplink. --Krd (talk) 11:16, 29 April 2011 (UTC)[reply]

Is the script you're using public? I'm curious how you're parsing section headers. --MZMcBride (talk) 05:19, 30 April 2011 (UTC)[reply]

Currently it's not, simply because it grew from a first try to a working solution without emphasis on style and readability. The basic approach for section parsing is to parse the wikicode for known section markers (section headlines, Template:Anchor, etc.) and additionally check the html output for id tags if the first step doesn't match. (Two step approach because because step one hits 90% and the results can be held in a local cache easier than html output.) The code should be quite stable unless in the redirecting page there is a crazy mix of different encodings of special characters. --Krd (talk) 07:07, 30 April 2011 (UTC)[reply]

tools:~dispenser/cgi-bin/rdcheck.py (linked from What links here) It also supports generators so categories can be checked or the thing I added to WP:FAC. It now part of the checklist for FAC, though coaching may be needed. Human review is needed: 1) section was spun off into its own article, 2) a bot "fixed" a double redirect, 3) similarly redirects were never pointed to any section. There's some good code in dabfix used for comparing similar links. — Dispenser 05:53, 30 April 2011 (UTC)[reply]

Redesigning this project

As this project approaches 100 reports, there are some structural/design/maintability issues that should be addressed at some point. I put some notes at Wikipedia:Database reports/Redesign. If anyone has any comments/suggestions/etc., throw them on that page or this talk page or wherever. --MZMcBride (talk) 21:42, 30 April 2011 (UTC)[reply]

A caching system allowing user initiated report regeneration along with automatic regeneration. Might allow for simple queries (e.g. contribs tool) if we can work in user specified parameters. Run time histogram and statistics, so expensive reports aren't regenerated when no one is reading them. History of the past 5 report and size/row count. Support for multiple wikis and mirroring reports on wiki, useful for Page count by namespace. Include a dependance chain, e.g. report A require updated project_banner table done by report B. Custom CSS/JS, table filter, and sorting. Support for dump reports. Points system for games??? — Dispenser 21:41, 5 May 2011 (UTC)[reply]

Notification options: Email, RSS, or a bot edit (think {{LSR}}) for when reports complete. A refresher script (similar to books) for < 30 minute reports with a graph of previous completion times. — Dispenser 21:00, 3 June 2011 (UTC)[reply]

What do you mean? There is always a bot edit when a report completes. Or do you mean having common page for all reports? User<Svick>.Talk(); 22:11, 3 June 2011 (UTC)[reply]

Some from an earlier design document: configurable .ini-like python file (i.e. Framework approach), statistics query run time, JS highlighter (they're faster), Form fields have auto-complete and regex validation is applied client and server side.

Controls programs (themselves reports): report data file viewer, report directory list, activity stream, and administration. — Dispenser 22:16, 2 September 2011 (UTC)[reply]

Userspace links Update

Any chance of an update for Wikipedia:Database reports/Articles containing links to the user space, couple of weeks since the last one--Jac16888 ^Talk 23:40, 21 May 2011 (UTC)[reply]

It keeps getting killed for running so long, I think. It's a very expensive query. The report probably needs to be re-implemented using smarter logic.

One idea I had was to only check pages that have been edited in the past 30 days in the article namespace (using the recentchanges table). But even that is about a million queries. I'll have to give more thought about how to get this information efficiently. Dispenser, Svick, et al.: Do you have any ideas? --MZMcBride (talk) 04:32, 30 May 2011 (UTC)[reply]

P.S. We need a "Broken database reports (configuration)" report. (-;

You could try to split the query into two: one for User and the other for User talk namespace. But the most likely outcome would be that it's going to be even slower. Another option would be to do the distinct part in memory. (I'm assuming the pagelinks query is the slow one, not the templatelinks one.) User<Svick>.Talk(); 20:12, 30 May 2011 (UTC)[reply]

Gadget statistics

/* Gadget statistics - run time 1.5 hrs SLOW_OK */
SELECT up_property, COUNT(*)
FROM user_properties_anonym
WHERE up_property LIKE "gadget-%"
AND   up_value=1
GROUP BY up_property

With the same record keeping purpose as Page count by namespace, I propose the creation of a installed gadgets report (see previous: WT:Gadget#Usage-Stats). The table interesting properties the table includes are: gender, imagesize, diffonly, blank edit summary warning, recent change/watchlist flags bot, minor. We may also want to ask to derived columns such as user_active and a binned edit count. — Dispenser 00:02, 26 May 2011 (UTC)[reply]

User preferences (configuration) --MZMcBride (talk) 22:06, 29 May 2011 (UTC)[reply]

Skin notes: 0=default, 1=nostalgia, 2=cologneblue; amethyst introduced 1.4. removed 1.5; major of users don't have a skin set they are not counted, thus they are the difference in skin total and the current number of users. — Dispenser 05:32, 11 July 2011 (UTC)[reply]

Repeated {{citation needed}}

Could we get a list of pages where a {{citation needed}} is tagged itself with {{citation needed}} as in Facebook_features#Networks.2C_Groups.2C_and_Like_Pages. Smallman12q (talk) 23:17, 1 June 2011 (UTC)[reply]

Hmm, you actually want a dump report. In order to do something like this, you need to scan the wikitext of every article (from a database dump) and look for that particular string using regular expressions. Of course then you also have to account for redirects and other bizarre shit that can appear within the templates themselves (like sub-templates and magic words and parser functions).

I was going to do a similar database report (for repeating newlines, actually). I can probably just do this one when I'm doing the other one. If you're bored, it'd be helpful to have a list of strings to test a regular expression against. I'll put some examples below so you can see what I'm talking about. --MZMcBride (talk) 00:03, 2 June 2011 (UTC)[reply]

I'm doing this right now using the dump from April 5. [8] Tim 1357 ^talk 17:55, 25 June 2011 (UTC)[reply]

Apparently Smackbot was going to fix these, but for some reason didn't. [9] Tim 1357 ^talk 23:53, 25 June 2011 (UTC)[reply]

{{citation needed|reason=foo}}{{cn|date=2010-04}}
{{cn|date=2009-02|reason={{PAGENAME}} sucks}}   {{citation needed}}

Implementation notes

Just so I don't forget, what's really needed here is a generic function that can take wikitext input and output pairs of "grabbed templates" and their respective positions within the wikitext. It needs to be able to catch odd cases like {{foo|bar={{baz}}}}, but once it reasonably can, it should be trivial to take each position, subtract it from the subsequent one, and then output only templates that are really close to each other (matching on template name too, of course). A generic template_grabber function would resolve this report and be very helpful for other projects (including the next step, which is taking the grabbed templates and then parsing their parameters). This isn't very complicated, but it will require a few hours of dedicated development time to do properly. --MZMcBride (talk) 01:33, 2 June 2011 (UTC)[reply]

AWB uses the regex {{(?>[^\{\}]+|\{(?<DEPTH>)|\}(?<-DEPTH>))*(?(DEPTH)(?!))}} to match top-level templates, that can have other templates nested inside them. Although I never understood how it works, and it's probably .Net-specific. User<Svick>.Talk(); 22:07, 2 June 2011 (UTC)[reply]

Hey MZMcBride. I'm not sure if this is exactly what you were looking for, but I wrote this template parser a while back to parse templates. It can handle an infinite number of nested templates (ect:{{foo|bar={{baz|foo={{bar}}}}}}). It can also handle embedded images and wikitables. The code is below. Hope it helps at all. Tim 1357 ^talk 22:49, 20 June 2011 (UTC)[reply]

Template Parser

spaces=' \t\n\r'
chars=['a', 'b', 'c', 'd', 'e', 'f', 'g','_','h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9']


class Template():
        def __init__(self,text=None):
                '''The parameter "text" is a string that starts with "{{". The parser will automatically find the end of the template'''
                self.start='{{'
                self.end='\n}}'
                self.wrapper=lambda g,p=None:'\n | %s = %s'%(p,g) if p!=None else '\n | %s'%g
                if text != None:
                        self.text=text
                try:
                        self.text=self.text.strip()
                except:
                        raise TypeError('The Class: "Template" requires text be passed to it.')
                self.r=None
                self.cursor=0
                cursor=0
                m = len(self.text)-1
                while 1:
                        if '|'== self.text[cursor] :
                                self.standard()
                                if self.r!=None:
                                        self.r=iter(self.r)
                                        te=[]
                                        for p,g in self.r:
                                                if p !=None:p=p.strip()
                                                te.append((p,g.strip()))
                                        self.r = te
                                        del te
                                break
                        elif self.text[cursor:cursor+2]=='}}':break
                        cursor+=1
                        if cursor>=m:
                                raise TemplateSytnaxError('Could Not find Template Start')
                
                self.name=self.text[:cursor+1].strip('{}|')
        
        def standard(self):
                
                max=len(self.text)
                cursor=0
                r=[]
                try:
                        while self.text[cursor]!='|':cursor+=1
                except IndexError:
                        self.r=None
                        return
                cursor+=1
                last=cursor
                while cursor<max:
                        
                        n, cursor =self.checkname(cursor=cursor)
                        last=cursor
                        while True:
                                
                                if self.text[cursor:cursor+2]=='{{':
                                        cursor = self.untill(cursor=cursor+1)
                                elif self.text[cursor:cursor+2]=='[[':
	                                        cursor = self.untill_img(cursor=cursor+1)
                                
                                elif self.text[cursor:cursor+2]=='}}':
                                        r.append((n,self.text[last:cursor]))
                                        self.r=r
                                        self.text=self.text[:cursor+2]
                                        return
                                elif self.text[cursor]=='|':
		                                        r.append((n,self.text[last:cursor]))
		                                        cursor =cursor+1
		                                        last=cursor
		                                        break
                                cursor+=1
                                if cursor>=max: break
				self.cursor=cursor
                raise 'No ending brackets!'
        
        def checkname(self,text=None,cursor=0):
                if text ==None: text=self.text
                n=cursor
                l=cursor
                while text[cursor] in spaces:cursor+=1
                while text[cursor] in chars: cursor+=1
                n=text[n:cursor]
                while text[cursor] in spaces:cursor+=1
                if text[cursor]=='=':
                        cursor+=1
                        return (n,cursor)
                return (None,l)
        
        def untill_table(self,text=None,cursor=0):
            if text ==None: text=self.text
            s=cursor
            st=1
            while True:
                    if text[cursor:cursor+2] == '|}':
                            st-=1
                    
                    elif text[cursor:cursor+2] == '{|':
                            st+=1
                    elif text[cursor:cursor+2] == '{{':
	                               cursor=self.untill(cursor=cursor+1)
                    elif text[cursor:cursor+2]=='[[':
	                        	cursor=self.untill_img(cursor=cursor+1)
                    
                    cursor+=1
                    if st==0:
						return cursor
                    if cursor >=len(text):
						return cursor
        def untill_img(self,text=None,cursor=0):
		                if text ==None: text=self.text
		                s=cursor
		                st=1
		                while True:
		                        if text[cursor:cursor+2] == ']]':
		                                st-=1
		                        
		                        elif text[cursor:cursor+2] == '[[':
		                                st+=1
		                        elif text[cursor:cursor+2] == '{{':
				                               cursor=self.untill(cursor=cursor+1)
		                        elif text[cursor:cursor+2] == '{|':
			                               cursor=self.untill_table(cursor=cursor+1)


		                        
		                        cursor+=1
		                        if st==0:
									return cursor
		                        if cursor >=len(text):
									return cursor
        def untill(self,text=None,cursor=0):
                if text ==None: text=self.text
                st=1
                while True:
                        if text[cursor:cursor+2] == '}}':
                                st-=1
                        
                        elif text[cursor:cursor+2] == '{{':
                                st+=1
                        elif text[cursor:cursor+2]=='[[':
                        	cursor=self.untill_img(cursor=cursor+1)
                        
                        elif text[cursor:cursor+2] == '{|':
	                               cursor=self.untill_table(cursor=cursor+1)
                        
                        
                        cursor+=1
                        if st==0: return cursor
                        if cursor >=len(text):return cursor
						
        def old(self):
			return self.text[:self.cursor+2]
			
			
        def __dict__(self):
			return dict(self.r)
		
        def __getitem__(self,key):
			a=dict([(p.lower() if p is not None else None,g) for p,g in self.r])
			assert a.has_key(key.lower()), KeyError('\'%s\''%key)
			return a[key.lower()]
        def __delitem__(self,key):
			'''Usage: del TemplateObject['key']'''
			for k,v in self.r:
				if k is None: continue
				if k.lower()==key.lower():
					self.r.remove((k,v))
					return
			raise KeyError('\'%s\''%key)
        
        def __iter__(self):
        	return iter(self.r)
        def __str__(self):
			if self.r == None:
				return self.text
			i=self.start+self.name      
			for p,g in self.r:
				i+=self.wrapper(g,p)
			i+= self.wrapper(self.image,'image')
			i+=self.end
			return i

Report not updating

Can someone please determine why this is no longer updating each month? Wikipedia:Database reports/Stubs included directly in stub categories. I'm guessing maybe an error in cron syntax, but I don't see the problem. Dawynn (talk) 13:29, 16 June 2011 (UTC)[reply]

"Someone," heh. cron errors are usually caught at insert/update. I have an e-mail from cron that says that the query was interrupted. This was likely due some kind of hiccup or (more likely) the query ran too long and got killed. The e-mail output is below. Honestly, this report shouldn't be a database report at all. It's using a rather hackish system that results in a lot of false negatives, as I recall. It would make a lot more sense as a dump report. Maybe I'll get around to rewriting it at some point. "Dump reports" are currently halted due to some kind of horrible speed issue in the current rewritten script, though. Until the speed problems can be addressed (in a demonstrable way), there won't be any new dump reports. Fun fun fun fun (lookin' forward to the weekend). --MZMcBride (talk) 15:16, 16 June 2011 (UTC)[reply]

Your "cron" job on willow
PYTHONPATH=$HOME/scripts python $HOME/scripts/database-reports/directstubs.py > /dev/null

produced the following output:

Traceback (most recent call last):
  File "/home/mzmcbride/scripts/database-reports/directstubs.py", line 72, in <module>
    ''')
  File "/opt/ts/python/2.7/lib/python2.7/site-packages/MySQLdb/cursors.py", line 173, in execute
    self.errorhandler(self, exc, value)
  File "/opt/ts/python/2.7/lib/python2.7/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
    raise errorclass, errorvalue
_mysql_exceptions.OperationalError: (1317, 'Query execution was interrupted')

Highly used template redirects

Suggestion: redirects to highly visible templates which aren't protected, perhaps sortable by number of transclusions. Is that possible? Par exemple: Template:Infobox Organization --Closedmouth (talk) 16:31, 17 June 2011 (UTC)[reply]

They are included in Wikipedia:Database reports/Unprotected templates with many transclusions which covers all templates, not just redirects. Infobox Organization is 1022 on the list. -- WOSlinker (talk) 18:11, 17 June 2011 (UTC)[reply]

Thank you. --Closedmouth (talk) 06:10, 18 June 2011 (UTC)[reply]

Suggestion: you're the worst. --MZMcBride (talk) 20:48, 17 June 2011 (UTC)[reply]

Your mother. --Closedmouth (talk) 06:10, 18 June 2011 (UTC)[reply]

no u hare j 18:54, 18 June 2011 (UTC)[reply]

Database for Infoboxes

Hi Guys, I made a SQLite3 database that indexes all the infoboxes used in the article space. I was finding that I was doing quite a bit of work with the API just to find a parameter in an infobox, so I decided to create a database to save time and bandwith. I was wondering if you think the database (about 60/90 MB with/without indexes, uncompressed) would be helpful here. If you think it might come in handy, I'll upload it to the MySQL database. Cheers. Tim 1357 ^talk 18:44, 25 June 2011 (UTC)[reply]

Oh, here's the schema (if it matters):

sqlite> .schem
CREATE TABLE InfoboxParameters (Infobox INT, Arg Varbinary, Value Varbinary, Article Varbinary);
CREATE TABLE Infoboxes (Name Varbinary,ID int unsigned, PRIMARY KEY(ID));
CREATE INDEX Infobox_Arg_Index on InfoboxParameters (Infobox,Arg);

Tim 1357 ^talk 18:46, 25 June 2011 (UTC)[reply]

Will you be keeping it more up to date than Templatetiger, which is currently 6 months old? — Dispenser 19:28, 25 June 2011 (UTC)[reply]

Holy crap! That's awesome. Do we have query access to the database? Tim 1357 ^talk 00:01, 26 June 2011 (UTC)[reply]

(We do) Tim 1357 ^talk 16:28, 27 June 2011 (UTC)[reply]

WikiProjects by size

Could we add a report on the largest WikiProjects by the number of articles tagged with a project's banner? The data for each project is available on the toolserver when "All" is clicked but there is no option to sort the projects by number of articles. Once each month or two would probably be frequent enough for this task. This report would be useful for size comparisons and descriptions in the Signpost's WikiProject Report. -Mabeenot (talk) 23:56, 26 June 2011 (UTC)[reply]

Erm, here goes:

SELECT p_project, p_count,p_qcount,((p_qcount+.0)/p_count)*100 as percent_qualityAssessed ,p_icount,((p_icount+.0)/p_count)*100 as percent_importanceAssessed 
FROM u_titoxd_wp10_p.projects
ORDER BY p_count DESC;

The output (after I formatted it a bit) is here. The report is very lightweight and should be no problem to run monthly. Tim 1357 ^talk 04:33, 13 July 2011 (UTC)[reply]

Thanks! -Mabeenot (talk) 22:25, 13 July 2011 (UTC)[reply]

Whoops, it looks like I was using the wrong (out-dated) database. I think the one in use now is u_enwp10. I don't have read-access to that database right now, but I asked CBM (the maintainer of the project) for access. Tim 1357 ^talk 14:46, 14 July 2011 (UTC)[reply]

He generated the report for me. Is this better? Tim 1357 ^talk 18:49, 15 July 2011 (UTC)[reply]

Much better. Thanks for your hard work. -Mabeenot (talk) 20:58, 18 July 2011 (UTC)[reply]

Images without FUR report

Please consider addding {{Non-free image data}} and ((tl|Non-free image rationale}} Images to the exclusions list for this report, if not already done so . Sfan00 IMG (talk) 10:48, 27 June 2011 (UTC)[reply]

Redirects containing red links

The Wikipedia:Database reports/Redirects containing red links report is flooded with talk pages that contain templates based on the {{WPBannerMeta}} meta-template, which automatically create links to "/Comment" subpages whether the subpage exists or not. In other words, Talk:TOPIC ends up in the report because it "contains" a red link to Talk:TOPIC/Comments, although the redlink actually only exists in the template code and wouldn't be visible to readers even if the talk page were not a redirect. Suggested tweak: exclude any link to a page in namespace 1 whose title ends in "/Comments" from the report. --R'n'B (call me Russ) 14:06, 29 June 2011 (UTC)[reply]

User talk pages which redirect to another namespace

Just what it sounds like. Maybe monthly? Some new users redirect move their user pages (and associated talkpages) to the mainspace when they are done drafting an article in the userspace. However, that means that automated tools like Twinkle which leave a note at the user's talk page are actually redirect to the article talk page. These user pages need to be un-redirected and having a list of them should make that easy. –Drilnoth (T/C) 21:48, 20 July 2011 (UTC)[reply]

Inactive users report

Is there any reason why some usergroups are listed twice? For example:

No.	User	Last edit	User groups
863	LawBot	20101017062645	bot,bot
1491	Tb	20110115193417	rollbacker,reviewer,rollbacker,reviewer,rollbacker,reviewer,rollbacker,reviewer

Is this a bug, or an undocumented feature? :-) ^Avic _ennasis @ 06:31, 20 Tamuz 5771 / 22 July 2011 (UTC)

Yeah, it is a bug. It only happens when the two (or more) newest contributions of an user have the exact same timestamp. User<Svick>.Talk(); 23:13, 22 July 2011 (UTC)[reply]

That would do it. :-) Thanks for the info. ^Avic _ennasis @ 05:08, 25 Tamuz 5771 / 27 July 2011 (UTC)

Most viewed pages

Maybe it is not the best place, but is it possible to generate list of 1000 pages with most views in last year? Bulwersator (talk) 09:23, 22 July 2011 (UTC)[reply]

Remove some reports now that PC is gone?

Now that Pending Changes has ended, shouldn't we remove Wikipedia:Database reports/Potential reviewer candidates? — Train2104 (talk • contribs • count) 15:41, 23 July 2011 (UTC)[reply]

PC has ended, but it has not been turned off and there is no guarantee that it will not be used again. I think this is probably a good report to keep until the future of PC is clear. That said, we could probably move the updating to once or twice a month until that time. --Izno (talk) 05:23, 27 July 2011 (UTC)[reply]

Re-used image names

I would like a list of filenames -- both on en.wikipedia and on Commons - that have more than one image attached to them due to accidental overwriting, so as to figure out which ones need splitting into separate images.

This will probably generate a huge amount of false positives; it can be reduced by eliminating images which are uploaded by authorized image-modification bots like RotateBot, and by eliminating later images uploaded by the original uploader. It will still, however, require human attention to tag them for splitting. DS (talk) 01:08, 30 July 2011 (UTC)[reply]

File have hashes so we can detect if they're identical. I would think that somebody has already written this... — Dispenser 14:37, 30 July 2011 (UTC)[reply]

If so, then where is it? Also, hashes won't allow us to detect if (image 2) is a cropped version of (image 1), will they? DS (talk) 18:27, 30 July 2011 (UTC)[reply]

Pages which have been semi-protected the longest

I'm not sure if this is possible, but what would be helpful would be a list of semi-protected pages by how long they have been semi-protected. It might be useful to see if those which have been semi-protected the longest could afford to be unprotected. –MuZemike 16:31, 2 August 2011 (UTC)[reply]

We have this for articles (also here) and talk pages, if you sort by timestamp, you can view them chronologically. ^Avic _ennasis @ 22:02, 4 Av 5771 / 4 August 2011 (UTC)

Images identified as lacking NFUR

Would it be possible to add {{Has-NFUR}} to the list of templates on an image which the report skips over?

Also {{Standard rationale}} and {{short rationale}} were removed a while back and should thus not be being used to skip images.

Sfan00 IMG (talk) 10:28, 22 August 2011 (UTC)[reply]

What images would this new template be used on? –Drilnoth (T/C) 12:48, 23 August 2011 (UTC)[reply]

Images that meet NFCC criteria but which uses a template form not currently recognised by the report concerned.

Such as rationales written directly as text.

Sfan00 IMG (talk) 10:50, 24 August 2011 (UTC)[reply]

Couldn't most of those be converted to using templates? I'm not opposed to the new template, I'm just wondering if it is redundant. –Drilnoth (T/C) 16:28, 24 August 2011 (UTC)[reply]

Inactive bots

Further to Wikipedia talk:Bot Approvals Group/Archive 7#Bots & long term inactivity, could we have a report updated once monthly or quarterly that lists accounts with the 'bot' flag that have no edits or logged actions in over 1 year, sorted by activity with longest inactive at the top? (Am trying to think if there is an easy way to identify read-only bots that should not have any edits or logged actions, will let you know if I come up with anything.)

If possible can you identify the owner and list their last edit or logged action in the table? –xeno^talk 13:29, 2 September 2011 (UTC)[reply]

Articles containing overlapping coordinates - missed update

I notice that Wikipedia:Database reports/Articles containing overlapping coordinates has not been updated in over nine days. Does anyone know why? —Stepheng3 (talk) 18:44, 17 September 2011 (UTC)[reply]

It got killed by the query killer, I imagine. Or there was a hiccup somewhere. Does it matter? I looked through a few of the pages in the list. There's plenty of work to be done (lots of cases where the coords overlap exactly). --MZMcBride (talk) 18:53, 17 September 2011 (UTC)[reply]

I was just worried it might indicate a persistent issue with report updating. I agree there's plenty of work to do. —Stepheng3 (talk) 21:13, 17 September 2011 (UTC)[reply]

Oh, no. The Toolserver has been more overloaded lately, so the Toolserver administrators have implemented (or re-implemented, I guess) a "query killer." I haven't written much about it anywhere, but it's impacting half a dozen or so reports. If the replication lag on the server increases, the query killer gets more aggressive. Some very slow queries in some reports are now being killed before completing (emptycats.py, userlinksinarticles.py, templatedisambigs.py, etc.). Some of these reports will probably have to be rewritten. A few may have to be disabled. I'm not sure how many people realize how many reports there are nowadays (over 100 at this point, I think). It's a bit annoying to maintain all of them by myself. :-) --MZMcBride (talk) 21:52, 17 September 2011 (UTC)[reply]

Thanks. I'm learning SQL right now. At some point I may apply for a toolserver account and try to help you with reports. Until then ... —Stepheng3 (talk) 21:57, 17 September 2011 (UTC)[reply]

For what it's worth, getting a Toolserver account was pretty much how I learned SQL. There's no better place, really. You have access to small and large constantly updating replicated databases that are set to read-only, so you can't break them even if you wanted to. It's a great place to play around with syntax, learn conventions, etc. I wrote some docs at tswiki:Queries and tswiki:Database access at some point that are helpful for new users. (In general, I think there's a somewhat glamorized view of Toolserver users. Most of them have no idea what they're doing.)

All of this is largely unrelated to database reports, though. They're run from the Toolserver, but the master copy of the reports code is on GitHub. A few people have access to the repo right now (Svick and maybe bjweeks?), but it'd be nice to have more people on it, assuming they won't break anything. :-) --MZMcBride (talk) 22:05, 17 September 2011 (UTC)[reply]

The status of this backlog has changed radically in the past few weeks, thanks to my colleague Redrose64. I'd like this report to become more frequent now. Perhaps twice a week? —Stepheng3 (talk) 00:41, 9 November 2011 (UTC)[reply]

I posted a real-time version of the report using catscan2 at WP:GEO/TODO. — Dispenser 18:54, 9 November 2011 (UTC)[reply]

That's helpful. Thanks, Dispenser! —Stepheng3 (talk) 20:02, 9 November 2011 (UTC)[reply]

Dispenser: <3.

Stepheng3: With the number of rows decreasing per report, I'd actually be a bit more inclined to make the updates fortnightly or something, to increase the value of an update. But for now, I think weekly is fine. The whole point is kind of moot if it's possible to get a dynamic list, which given the fact that this is just the intersection of two queries, it's really not very difficult. catscan seems a bit slow, but there's no rush. :-) --MZMcBride (talk) 03:59, 10 November 2011 (UTC)[reply]

Understood. —Stepheng3 (talk) 21:10, 10 November 2011 (UTC)[reply]

Local media description pages for F2 deletion.

I've been going through the following list over on Toolserver : http://toolserver.org/~magnus/image_pages_without_image.php?language=en&project=wikipedia and very slowly weeding out local media description pages that meets F2 criteria..

Would it be possible to have a report that compiles a list of media files for which F2 is applicable namely?

Page is in File: namespace
Page is a local description page for which the actual media file is on Commons.
Page does not contain templates:
- {{badimage}} ,
- {{DYKfile}} ,
- {{FeaturedImage}} and related etc...

Having such a report would help speed up what I am attempting to do from the toolserver list by not including media description pages which do need to be local..

Sfan00 IMG (talk) 21:54, 18 September 2011 (UTC)[reply]

Could you list all of the templates you want excluded? User<Svick>.Talk(); 17:29, 20 November 2011 (UTC)[reply]

Short filenames

To be run on a weekly basis , find media files with Very short filenames.

See: User:Tom_Morris/todo#File_name_length_query

Sfan00 IMG (talk) 14:04, 28 September 2011 (UTC)[reply]

Grep tool /^.{0,9}$/ — Dispenser 15:06, 28 September 2011 (UTC)[reply]

Monthly reports not being updated?

I noticed that some monthly reports show the last updates in April and July. Looks like something is slipping through the cracks. Also long stubs shows as having been updated a few days ago, but all on get on the wiki is the October 21 version. Vegaswikian (talk) 19:20, 31 October 2011 (UTC)[reply]

SPA Account edits

Can a list of articles which are more than 4 months old for which editors that have edits only to that article make up more than 1/3 of the total editors of the article be made? Smallman12q (talk) 01:53, 12 December 2011 (UTC)[reply]

Duplicate licensing sections in file

Could a list of files which have 2 or more "Licensing" sections be made?Smallman12q (talk) 20:56, 15 December 2011 (UTC)[reply]

I could make one to have two search for two different Licensing templates on the same page, but otherwise that would require a much more intensive dump scan. Tim 1357 ^talk 16:02, 23 December 2011 (UTC)[reply]

You could make the scan more general to find two identically( or almost identically) named sections on image files. It would probably require a dump scan...but it seems like the script for the scan should be very short.Smallman12q (talk) 04:00, 24 December 2011 (UTC)[reply]

Images with conflicting Categorization

Wikipedia:Database_reports/Files_with_conflicting_categorization

I'm finding that a number of these are

User taken photos of art - being tagged as {{photo of art}}
Images where the conflict is because of an NFUR block on an image with a nominally 'free' license, beign tagged as {{NFUR not needed}}

Would it be possible to drop images with the relevant templates present from the report, as they have already been identified and put in process?

Sfan00 IMG (talk) 12:27, 22 December 2011 (UTC)[reply]

Add file date sort option to Unrationaled files report

Would it be possible to add a file-date sort option to : Wikipedia:Database reports/Non-free files missing a rationale so that older uploads can be handled in a more sensitive manner? Sfan00 IMG (talk) 00:24, 23 December 2011 (UTC)[reply]

Page count lists

I have recently learned about WP:DBR. I saw Wikipedia:List of Wikipedians by article count, which is not currently included at DBR, and wondered if it would be worth producing similar reports for template space, image space, category space and project space. Also, since the report ingores redirects (and there are more redirects than non-redirects in mainspace), maybe a redirect list could be produced too. I think these reports could be produced weekly and listed at DBR. I have tried asking about this at WP:VPT and WP:HELPDESK.--TonyTheTiger (T/C/BIO/WP:CHICAGO/WP:FOUR) 15:09, 23 December 2011 (UTC)[reply]