This is an archive of past discussions on Wikipedia:Database reports. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.
Would there be any way to locate instances where both (X Y) and (user:X Y) exist, and (X Y) is a living person? DS (talk) 14:21, 7 June 2010 (UTC)
I did fuzzy matching between article creators and article titles at some point. The data is available here. It'd be neat if something similar could be done with usernames and BLP titles, though you're dealing with much larger sets, so I'm not sure how technically feasible it is.
In general, there are two important considerations to make with a task like this: false positives and false negatives. The false positives come from common names (most are), causing coincidental collisions that aren't of any importance. The false negatives come from two sources that I can think of off-hand: the addition of middle names or inclusion of full names in page titles and page titles that disambiguate (e.g. "John Smith (minister)"). The middle name issue is offset by fuzzy matching; the disambiguation issue is offset by stripping. --MZMcBride (talk) 18:25, 7 June 2010 (UTC)
Link to the userrights screen would be keen. Filling in the reason as 'autoconfirmed' even moreso. –xenotalk01:28, 14 June 2010 (UTC)
I've changed the report to use {{dbr link}}, allowing any user to customize the output. Other reports will hopefully eventually use this template as well. --MZMcBride (talk) 15:28, 16 June 2010 (UTC)
This is not a request, it is a suggestion concerning all applicable reports. The suggestion is to add a column which marks when an item was first added to the report.
The rationale: If an item has been reported in the last 10 updates of a daily report, chances are that more than one user has checked the item.
I can't see any simple or clean way of doing this right now. I'll have to give this more thought. In general, I'd say that better reports are preferred to just adding a timestamp column. That is, if certain results keep showing up and are getting in the way, it'd be good to find a way to make that not happen rather than simply point out to the user how long these items have been getting in the way. Is there a particular report or set of reports where you think this feature would be useful? --MZMcBride (talk) 04:45, 13 July 2010 (UTC)
I'm not sure I understand the question. The MediaWiki database doesn't track when items are added to a page. It would have to be stored in a separate table somewhere, I imagine. --MZMcBride (talk) 05:52, 13 July 2010 (UTC)
When I made my suggestion I presumed that it would be done by programming BernsteinBot to compare the last update of a report with the new update, and I just asked to know if my understanding was correct or not. Sole Soul (talk) 06:05, 13 July 2010 (UTC)
Yes, it would do something like that. BernsteinBot would probably maintain a database in which it stored database report title, a key of some sort (maybe a cryptographic hash of the table row]]), and a timestamp. Then the bot would compare any updates to the database table and update the database table or wiki table as necessary. Or something like that. It's not impossible to code, but it's not a trivial task and I'm not sure I see the benefit (yet). --MZMcBride (talk) 06:08, 13 July 2010 (UTC)
Users go through a large report either from the top down or the bottom up or randomly. With time, this will make the bulk of the top and bottom items stabilize (i.e. they will be repeated because many users have checked them). This means that small percent of items are checked many times while the rest of items checked a few times or never. Sole Soul (talk) 06:37, 13 July 2010 (UTC)
More long pages
Is it possible to get more long talk pages for Wikipedia:Database reports/Long pages in section "Specified talk pages"? It would be preciated, since the list is good to check for pages that needs to be archived and reduced in size. What about change the size limit from 175,000 bytes to 150,000 bytes? --Kslotte (talk) 16:06, 12 July 2010 (UTC)
I don't see any point why to exclude that. If it is about getting some Wikipedia talk "permanent archives" excluded, then excluding historical template alone isn't enough. There exist also many other templates. --Kslotte (talk) 10:52, 16 July 2010 (UTC)
Sounds a bit like "perfect is the enemy of done." If we can say definitively that pages marked with {{historical}} should be excluded, then there's no reason to list them, even if we may miss other pages that aren't properly marked or are marked using a different template. --MZMcBride (talk) 01:31, 18 July 2010 (UTC)
Perhaps exclude all noindexed pages. Most of the permenant archive templates dump the pages into the noindex category. Tim1357talk01:01, 18 July 2010 (UTC)
I did revert it. Seems like there is too many transculations on one page. Links at end of page won't work. Possible some wiki function restrictions. --Kslotte (talk) 18:31, 20 September 2010 (UTC)
The pagelinks table is one of the largest. As a basis for comparison, all non-deleted revisions equal roughly 328,876,236 rows. The pagelinks table is currently roughly 507,145,981 rows. So any queries on it generally don't work so well. I'm testing the query to get this data now, but it's been running for a few hours and I'm not sure if it will finish before it gets killed. We'll see. --MZMcBride (talk) 20:58, 17 July 2010 (UTC)
Diddo MZMcBride. Finding things like most linked require a function that takes friggin forever to complete. This is because it needs to find how many links exist for every single redirect before it can spit out a list of the most linked. I suggest a list of redirects with over a certian ammount of links (say 500 links), limited to the first 100 found. Tim1357talk01:00, 18 July 2010 (UTC)
So, I ran this query, but it took about 20 hours. And the result set wasn't ordered correctly, so I have to run it again. I'm not sure I see the need for this to be a proper database report. I may just throw it in a sandbox. Dunno. --MZMcBride (talk) 17:11, 18 July 2010 (UTC)
20 hours seems like a bit much for such a low priority type of report. Maybe just have this be a "as requested" report. Tim1357talk18:46, 18 July 2010 (UTC)
3 hours? Can you head or tail the file and post results somewhere? I'm also not sure what "filter the set" means. Maybe paste the query you used? :P --MZMcBride (talk) 01:12, 22 July 2010 (UTC)
I looked at the query you used. I think you missed the "in articles" part. Or maybe I mis-read the request. It seems like he only wants to count links from other articles, though. --MZMcBride (talk) 11:10, 22 July 2010 (UTC)
It takes 5.2 hours searching only the article namespace, but only 15 minutes when both the page and redirect ends with ')'. However, the requester hasn't come back to clarify what he's going to do with it. — Dispenser03:10, 29 July 2010 (UTC)
Images without categories
Would it be possible to have a request run to find images with no categories whatsoever, including categories from templates? In theory, every image should be categorised based on its licensing; if it's not, it'll fail NFCC#10b, as all the copyright tags include licensing categories (or should). This should help us find problem images (or perhaps even problem templates) that the bots and other checks are missing. I don't know the scale of this problem, but I know that it exists. I guess having it checked weekly would be good. J Milburn (talk) 02:42, 27 July 2010 (UTC)
In case it wasn't clear, I was thinking about images hosted on enwp, not images hosted on Commons that happen to have some details on pages here. Can't see any harm in including the latter, but just thought I'd note that. J Milburn (talk) 02:45, 27 July 2010 (UTC)
Interestingly, it didn't seem to catch the images I was after, which is odd. Very useful though, thanks again. J Milburn (talk) 12:14, 29 July 2010 (UTC)
Would it be worth producing a report on problematic interwiki links? I have recently come across the following errors
Asymmetric links. i.e Article A points to article B but the en interwiki on the latter doesn't point back to article A.
Interwiki links to deleted articles. (These don't appear red like internal links do).
Interwiki links to redirects.
Now the third one may not really be a problem if we obey the rule that we don't fix redirects but the other two are definitely worth fixing. Boissière (talk) 20:47, 28 July 2010 (UTC)
There is lots of bots that fix the first error. The second one might be worth looking into, if no bot does this already. Also, I think that bot that immediately fixes such errors would be better than just a report. Svick (talk) 21:23, 28 July 2010 (UTC)
Bot requests are a bit different from database reports. Though obviously there is some overlap, as some reports are fed into bots and scripts to automatically or semi-automatically fix issues in pages.
Off-hand, I'm inclined to say that listing interwikis to redirects isn't going to be helpful and would create the illusion that they should be fixed. I don't see any reason for the redirects to bypassed unless the target page isn't accurate (the same title being re-purposed for a disambiguation page or whatever), obviously. I don't think there's really a way for a bot to know that, though.
I agree with Svick regarding points 1 and 2. Bots either exist (such as pywikipedia's interwiki.py, Google it for more info) or could exist to do some of this work. I imagine most of these bots would generate their own lists, though I'd be happy to help if I can. --MZMcBride (talk) 23:31, 28 July 2010 (UTC)
OK, I would agree that number 3 is not necessary and number 2 is fairly easy for bots to handle (though it would be interesting to know how good a job they are doing). What I can't see in the case of number 1 is how a bot could work out which link was faulty. I would have thought it would require some examination of the articles in question to decide where the links should actually go. How easy would it be to knock up a query for number 1? If it is not too hard then we could see what the current situation is. Boissière (talk) 20:07, 29 July 2010 (UTC)
I have a tool tools:~merl/reverselanglinks which is used by a gadget on dewiki to search interwikis linking to a single article. Because of these experiences i think the output of this query would be very long.
If you create this report you also have to keep in mind that jbowiki is not first-letter-upper-case like all other wikipedia. Merlissimo 20:46, 29 July 2010 (UTC)
Your comment about the output being long does imply that it is something that a bot cannot easily fix. It has occurred to me that a database report might not be the best approach as presumably you have to attach to each language wiki's own database. I am now wondering if using the API would be easier and therefore a reporting (but not fixing) bot would be better. Boissière (talk) 15:31, 2 August 2010 (UTC)
Is there any reason why that report hasn't been updated for nearly three months? Does it have anything to do with server load? I for one find that report interesting. I wonder if pending changes has had much of an effect on the data. Graham8709:16, 29 July 2010 (UTC)
It's supposed to run on the 28th day of each month. It keeps getting killed because it runs for so long, I think. I'm not really sure if there's anything to be done about this right now. --MZMcBride (talk) 02:44, 31 July 2010 (UTC)
I've (selectively) archived this talk page. I was going to write a script to do it, but it's not worth the development time when it only takes a few minutes to do it manually.
Some of the threads I removed were either regarding reports I didn't really feel like generating or the original poster failed to respond in a while.
If you actually want a report, re-post here and someone will take care of it, eventually. You may have to nag, though. --MZMcBride (talk) 02:41, 31 July 2010 (UTC)
births and births
Been finding a lot of biographies of dead people mistakenly put in two births categories (par exemple), could we get a report of people in two separate births categories? --Closedmouth (talk) 13:30, 14 August 2010 (UTC)
If there isn't one already, you should create a (maintenance or talk page) category for biographies that cover more than one individual. It would allow for much easier filtering in the future, for this list and other lists. Once that's in place, I'll write up a report that regularly updates. I think this is a reasonable compromise. --MZMcBride (talk) 20:27, 14 August 2010 (UTC)
If you change the WHERE from WHERE c1.cl_to != c2.cl_to to WHERE c1.cl_to < c2.cl_to it would stop the double rows. -- WOSlinker (talk) 20:48, 14 August 2010 (UTC)
Been mulling this over for a while; similar to Wikipedia:Database reports/File description pages containing no templates, I'm looking for "File description pages containing no license" - but the actual finding of them...? Theoretically each image description page would include more than one template - i.e. Logo & FUR - which would probably lead to tons of false positives. Another option would be to scan the image page to see if any of the templates from Category:Wikipedia image copyright templates is present, but it's a huge category. Another possibility - checking for "very short" page length instead of "no" template at all... Suggestions would be greatly appreciated :) Skier Dude (talk07:21, 19 August 2010 (UTC)
Changed my mind - it's picking up only the short pages - that means that if there's one template (say, the basic info) it isn't included by the tool. So, my request for "very short pages" isn't the right answer for this :( Skier Dude (talk04:17, 21 August 2010 (UTC)
If that report isn't really what you want, you should find some specific example files that are problematic. I can take a look at them and see if some patterns are evident. --MZMcBride (talk) 04:31, 21 August 2010 (UTC)
This was actually one of the many reasons for creating {{Asbox}} - in that the List Comparer from WP:AWB can be used to compare Category:Stub message boxes with links from Wikipedia:WikiProject Stub sorting/Stub types. Unfortunately there are only about 3400 on the page and there are about 16000 in the category. It is necessary to pick up some subpages too namely
I'm gonna let the WP:BOTREQ have first stab at this. If no one bites, I'll take a look. At a quick glance, it doesn't look like a bot was updating Wikipedia:Featured articles/Cleanup listing. And I'm not really sure what you define as a "cleanup category." Is there a master category somewhere of cleanup categories?
Well, the it was updated by a bot until March, and when the problems were dealt with, they were struck off manually between the month. I guess if there is a parent cat, the usual ones needed are unsourced, dead links, tagged to POV/weasel, undue weight and all that YellowMonkey (new photo poll) 04:42, 24 August 2010 (UTC)
Report request- transclusions of deleted templates
I would like to request that a DBR be created: Pages transcluding deleted or nonexistant templates, run Sunday and Wednesday. Would be nice to catch stuff that slips through TFD/H. Note: all pages, regardless of namespace, should be listed. —Train2104 (talk·contribs·count·email)00:25, 26 August 2010 (UTC)
I just quickly peeked at this yesterday but I was having an issue mapping this out from my mind to the database, so I left it for another day. There are a few points I wanted to make while I'm here, though:
Generally I favor "red-linked" instead of "deleted or nonexistant" [sic!]; this is mostly because a template can be red-linked due to never having been created or due to having been deleted (and this can be an important distinction in the context of some reports); the generic term covers both cases fairly well, in my view;
There are some oddities in some (complex or "esoteric") templates that cause them to transclude non-existent templates; this is mostly due to poor coding from what I've seen, but it doesn't change the fact that there are glaring red links in reports like Templates transcluded on the most pages (configuration) currently (e.g., Template:Pubchemcite is listed as having over 4,000 transclusions but has never existed, as far as I can see);
When you talk about "templates", the reality is that a template can exist in nearly any namespace (with a few exceptions); {{User:AbusiveAdmin/Racist userbox}} is technically a template just like {{!}}; I'm going to assume you're only talking about transclusions of pages of in the Template namespace; if that's not the case, please let me know;
[Related to my second point] I'm not sure you realize the volume of such a report; it will likely be thousands upon thousands of results, which leaves two options: (1) limit output to 1,000 results (truncate), or (2) paginate; I'm generally hesitant to paginate if it will be more than a few pages as I don't see that providing much value and it just adds needless clutter; Articles containing red-linked files (configuration) is a good example of this phenomenon.
Some of the data is just bad. The templatelinks table is referencing non-existent page IDs. I imagine some old bug didn't delete the rows in the templatelinks table appropriately when a page was deleted via the user interface. That seems like a reasonable explanation for the high count for a template like Template:Pending deletion. It may be possible to account for these bad rows in future queries.
Transclusions outside template namespace should be fairly few, excepting userboxen, and there's no reason they shouldn't be cleaned up (redirect to {{Tl:User box deleted}} maybe?). It would reduce the report size if you could ignore transclusions from TfD. RichFarmbrough, 15:20, 31 August 2010 (UTC).
Medical articles frequently edited by non-autoconfirmed users
Hello. At Wikipedia_talk:Invitation_to_edit we're looking for the medical articles that are most frequently edited by non-autoconfirmed users (to trial a "how to edit a medical article" tutorial). Would it be possible to generate such a list? The top 40 should be sufficient. Anthony (talk) 17:57, 29 August 2010 (UTC)
Well, this is fairly complicated:
What constitutes a "medical article"? Pages whose talk pages are tagged a particular way? Pages in a particular category? Pages from a particular list?
Auto-confirmed status is implicit, not explicit, and it has changed over time. It used to be only four days; now it's four days + 10 edits. You could roughly approximate what the auto-confirmed standards of today are when looking at the database (at this specific moment in time), but you can't be exact. That is, you can say "if these people tried to edit today, we could say with a high degree of confidence that they wouldn't be able to on a semi-protected article due to their current edit count and registration date." But when you look at page histories, the stored edit count will remain current and the registration date will remain constant. So when you evaluate older edits, you won't know if the user was auto-confirmable then without manually calculating it for each edit. Other odd extensions like TorBlock completely change the auto-confirmed settings under certain conditions.
If you wanted to limit the report to something more easily definable (like all pages in Category:X sorted by number of IP edits), that would be much simpler. --MZMcBride (talk) 18:07, 29 August 2010 (UTC)
I'm back! Would it be possible to find the 100 articles with the greatest number of unique IP edits in the last year from the C, B, GA and FA class med articles here? Anthony (talk) 19:19, 29 August 2010 (UTC)
#! /usr/bin/env pythonimportMySQLdbcategories=['FA-Class_medicine_articles','FL-Class_medicine_articles','GA-Class_medicine_articles','B-Class_medicine_articles','C-Class_medicine_articles']conn=MySQLdb.connect(host='sql-s1',db='enwiki_p',read_default_file='~/.my.cnf')cursor1=conn.cursor()cursor2=conn.cursor()forcategoryincategories:cursor1.execute('''SELECT page_title FROM page JOIN categorylinks ON cl_from = page_id WHERE page_namespace = 1 AND cl_to = %s;''',category)print'\n== [[:Category:%s]] =='%categoryprint'{| class="wikitable sortable"\n! Page\n! Unique IP edits\n|-'forrowincursor1.fetchall():cursor2.execute('''SELECT page_title, COUNT(DISTINCT rev_user_text) FROM page JOIN revision ON page_id = rev_page WHERE page_namespace = 0 AND page_title = %s AND rev_user = 0 AND rev_timestamp > DATE_FORMAT(DATE_SUB(NOW(),INTERVAL 1 YEAR),'%%Y%%m%%d%%H%%i%%s') GROUP BY rev_page;''',row[0])forresultincursor2.fetchall():print'| [['+result[0]+']]\n'+'| '+str(result[1])+'\n|-'print'|}'cursor1.close()cursor2.close()conn.close()
Non-free images with multiple files in the history?
Hey, would it be possible to get a report of all the files tagged as non-free which have multiple files in the history? For an example of what I mean, see File:Manchester Mark2.jpg- the image was reduced, but the old version is still visible (and will be until deleted). The old versions fail the NFCC (as they are not in use) and so some cleanup is required in every case. J Milburn (talk) 00:02, 19 September 2010 (UTC)
Well, it says "0 12 * * 4" on the configuration page. That'd be 0 minutes past hour 12 on the fourth day of the week. So... maybe on Thursday? I'm not really sure why it's been over a week since it updated. I'll try to remember to peek at this later today. --MZMcBride (talk) 16:23, 28 September 2010 (UTC)
I poked at this a bit. I verified that the crontab entry is "0 12 * * 4 python ~/scripts/database-reports/coordoverlap.py > /dev/null" and that the script works properly. (I just ran it manually, so the report is now updated.) I think the report should update again this week. Please let me know if you continue to have issues. It's comparing two large sets ({{Coord/display/inline,title}} has 202,431 transclusions while {{Coord/display/title}} has 349,287 transclusions), but it shouldn't take more than a few minutes to execute and finish. I don't have any e-mail from cron indicating an issue from the script, so I guess we'll just have to wait and see if this continues to be a problem. --MZMcBride (talk) 02:39, 29 September 2010 (UTC)
Would it be possible to have a (one-off) report of all the uncategorised templates of the form "Template:Location map xxx"? Every time the uncategorised templates report is updated, a few are listed in the report, but I suspect there are others that aren't listed (noting that the uncategorized templates report is not a complete list), so I would like to have a complete list of all the uncategorised location map templates, so I can address them all in one go. If there's an easy way I can obtain this information myself, please advise. Thanks. DH85868993 (talk) 04:07, 9 August 2012 (UTC)
I would like a list of discussion pages of WikiProjects (so I'm talking about pages called "Wikipedia talk:WikiProject Whatever") ranked by the frequency with which they are edited---in effect the most active WikiProjects listed first. Can that be done? Michael Hardy (talk) 03:46, 24 October 2012 (UTC)
There are a couple of relevant reports already being produced. Do either of:
I've spoken with Michael Hardy previously at the WikiProject Council page and recommended he come here. In trying to measure a project's activity or "liveliness", I can see the limits of the two reports you've listed. The current report on how many people are watching gives an impression that some projects are busy when in reality there has been little discussion in years (the dead WikiProject Contents tops the list). The other list provides the number of edits for articles under a project's scope, not necessarily edits made by the project's members. I think Michael Hardy is looking to see how many times the project's talk page is edited in a particular timeframe (maybe six months or a year) to gauge which projects are active places for discussion. –Mabeenot (talk) 15:44, 26 October 2012 (UTC)
Okay, initial report at User:Topbanana/WPtalk. It seems to be a good measure of activity for some projects, and an awful one for others. I suspect the parameters need tweaked to make it more useful before we make it a regular thing. - TB (talk) 17:25, 26 October 2012 (UTC)
I'm trying to improve Wikipedia by removing article categories from user pages per WP:USERNOCAT, and have found the Polluted categories report to be very helpful in identifying categories to fix. Although the Database reports list states that this report is updated weekly, it appears that it hasn't been updated since August. Could some kind developer please look to see how this report can be put back on schedule? Thanks! GoingBatty (talk) 02:17, 4 November 2012 (UTC)
All fixed up now. I rewrote the report, which fixed a few outstanding issues: the output limit is now 1000 instead of 250; the pipe trick has been killed; there are now some helper links for users to see which pages are polluting the category; other various fixes. Thanks for a clear and discrete problem to solve! ;-) --MZMcBride (talk) 01:19, 5 November 2012 (UTC)