This is an archive of past discussions on Wikipedia:Database reports. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.
Well, fuck. Seems you're right, if only barely. The rules surrounding hyphenation seem to be entirely arbitrary and made-up for the most part, but there seems to be (somewhat) wide agreement that adverbs ending in -ly are a special case. I'll fix the page titles in the next few days (it requires updating the scripts, moving some subpages, and adjusting some redirects). Cheers. --MZMcBride (talk) 04:21, 24 September 2009 (UTC)
This is for project categories that should not be deleted even when empty, e.g. categories populated by templates. It's a problem if such categories are deleted when they shouldn't be. PC78 (talk) 22:34, 30 September 2009 (UTC)
Eep. Thanks for the bump. With thesetwo edits, I think your issue is resolved. The next update should reflect these changes. (I hope I didn't break anything....) --MZMcBride (talk) 13:37, 27 October 2009 (UTC)
I guess that does it, thanks! You should also exclude categories containing "-priority" from the report as well. PC78 (talk) 11:50, 28 October 2009 (UTC)
I don't see anything wrong with deleting the "uncategorised xyz" categories. Once they are empty, it would be just as much work to simply add the proper categories than the "uncategorised" category, there's no reason these should ever be populated again. VegaDark (talk) 00:01, 5 November 2009 (UTC)
Uncategorised articles are continually sorted into those categories; they shouldn't be deleted when empty. PC78 (talk) 00:43, 5 November 2009 (UTC)
Could I ask for a couple of minor tweaks, please? Now that we've got the number of articles down to a manageable size, could we drop the lower limit on the report from 10 to ... well, something less than 10 that won't leave us with tens of thousands of articles in the list. How many would be returned if we dropped the lower limit to 8? Or even 6? Also, could the list be sorted descending by default so that the most obvious violations are at the top? Many thanks, Black Kite06:29, 1 October 2009 (UTC)
Shall we start with 6 as the limit? That would seem to bridge the gap beteen articles which are probably infringing the limit, and a list that's too long. Cheers, Black Kite22:27, 3 October 2009 (UTC)
Weird; at least on one day there were a lot of these. I cleaned a lot of them up, but it's weird... MediaWiki didn't even indicate that the redirects existed when I went to the image page. –Drilnoth (T • C • L) 17:16, 5 October 2009 (UTC)
Yes, there's some sort of bug with image redirects not showing the "redirected from" text when using a foreign repo (or something). --MZMcBride (talk) 20:21, 5 October 2009 (UTC)
I think a report of all fully protected templates, which aren't redirects and don't use {{Documentation}} would be useful. Even if they don't actually have any documentation, transcluding {{Documentation}} will allow all registered users to create it. I think once a month is probably enough for this one. עוד מישהוOd Mishehu10:38, 14 October 2009 (UTC)
Xeno said I should ask here...
The other day I was in a particularly impish mood, and thought it would be nice to revisit some old articles that haven't seen the light of day for a while... Go back, and check older articles from when ref, citation, etc. policies weren't as heavily enforced... Is there any way to get a list of articles that were created a long time ago, and have been untouched since (e.g. created more than 5 years ago, and have had less than a dozen edits in their lifetime, and/or haven't had any new edits in the last 2 years)? If so, I'm sure that number is in the thousands, so is there any way to then break said list down into seperate lists by year or month/year? - Adolphus79 (talk) 02:00, 26 October 2009 (UTC)
Actually, the bot part is the only piece that seemed infeasible to me. Doing a MIN(rev_timestamp) for the old articles and doing a COUNT(*) for the number of revisions both seemed easy enough. I think I have something similar that I wrote for BLPs for east somewhere.... --MZMcBride (talk) 22:24, 27 October 2009 (UTC)
Bot edits wouldn't matter to me... if one of those dozen edits were a bot, but it has not been edited at all in the last 2 years, then the bot edits could very well be outdated... - Adolphus79 (talk) 04:43, 28 October 2009 (UTC)
(outdent) But it would skew the data, if the article were to have not been edited for 2 years, then SmackBot came along and fixed them, it would make the article look less "stale."Tim1357 (talk) 22:31, 28 October 2009 (UTC)
That's a good start... thanks... With WP:DUSTY, and as far as the Smackbot concern, what if we take off the recent edits variable, and look at just the articles created >5 years and with <15 edits? - Adolphus79 (talk) 00:23, 29 October 2009 (UTC)
The query that I was running to generate this report was taking too long. I'm going to try another (hopefully smarter) query. If that doesn't work, you'll have to try a database dump. --MZMcBride (talk) 08:43, 15 November 2009 (UTC)
Indefinitely protected templates without many transclusions
To catch stale or unwarranted high-risk-template protections. Could someone run a query to see what it would look like? Thanks. -- zzuuzz(talk)20:40, 3 November 2009 (UTC)
Normally, biographies of living persons should be directly in Category:Living people (not through a template), so that filter 189 work for example. But some are not directly in, approximately 1000 based on this search. I wonder if it is possible to retrieve those. Thanks, Cenarium (talk) 23:42, 8 November 2009 (UTC)
Hmmmm, if you had a list of all pages directly using the text string "category:living people" and a list of all pages in the category, it should be simple enough to do a comparison of the two, right? This involves database dumps, which there is now very limited support for (haven't announced it here yet). I'll see what I can do. I can definitely say it won't be automated anytime soon, but a one-time report for now should be simple enough. --MZMcBride (talk) 23:46, 8 November 2009 (UTC)
A template which still transcluded it has been modified to no longer do so. Actually, the difference between the counts may be entirely due to the delay in the indexation. Cenarium (talk) 04:57, 14 November 2009 (UTC)
Well, the category description page and the PAGESINCATEGORY magic word give you the stored value of category.cat_pages, which is often wrong for large categories. For example, compare:
mysql> SELECT * FROM category WHERE cat_title = 'Living_people';
+--------+---------------+-----------+-------------+-----------+------------+
| cat_id | cat_title | cat_pages | cat_subcats | cat_files | cat_hidden |
+--------+---------------+-----------+-------------+-----------+------------+
| 173 | Living_people | 416368 | 3 | 12 | 0 |
+--------+---------------+-----------+-------------+-----------+------------+
1 row in set (0.00 sec)
mysql> SELECT COUNT(*) FROM page JOIN categorylinks ON page_id = cl_from WHERE cl_to = 'Living_people';
+----------+
| COUNT(*) |
+----------+
| 416396 |
+----------+
1 row in set (2.18 sec)
416368 vs. 416396. That's too bad compared to some other categories, but it likely helps explain some of the discrepancy you notice. I don't know whether the search index number is accurate (or precise), so that's another factor that could be altering your perception of what's right. --MZMcBride (talk) 19:56, 14 November 2009 (UTC)
Uncategorized non-redirect pages in Wikipedia namespace orphaned in Wikipedia namespace with a single contributor
Such pages are often nonsense, test pages, abandoned pages or pages never used for anything meaningful. By orphaned in Wikipedia namespace, I mean with no link in that namespace, but there may be links from other namespaces (this would catch forgotten afds linked from an article for example), if possible. This is along the lines of Orphaned article deletion discussions but larger in scope and with more restrictions. I've looked in the recent page creations, and I don't think it'll include any big set of pages (spam reports for example, are categorized), but it may still, then we'd need to refine the conditions to get a usable report. Cenarium (talk) 04:57, 14 November 2009 (UTC)
Thanks, I've looked into the sample; a report seems worth it. I had forgotten to mention it shouldn't be transcluded into Wikipedia namespace either (it would exclude 'account suspensions', and many self-closed xfds). Cenarium (talk) 17:47, 15 November 2009 (UTC)
Needs a better (shorter) title, preferably something incorporating "single-author." And I'm not sure what you mean about Wikipedia transclusions. I'll need an example. --MZMcBride (talk) 05:15, 20 November 2009 (UTC)
The page should not be transcluded in another project page (additionally to not be linked from other project pages), because otherwise it would include many things like the 'account suspensions' above, all transcluded in Wikipedia:Account suspensions, xfds closed by the nominator before other editors could comment and similar. But it may still be transcluded in other namespaces (typically, user). As for the name, it could be 'orphan single-author project pages', it leaves out specifications but gives the general idea. Cenarium (talk) 00:00, 21 November 2009 (UTC)
15,000 results (which is why the script kept failing when it tried to output to one page). Would you like paginated results or just the first 1,000? --MZMcBride (talk) 20:05, 27 November 2009 (UTC)
Additional note: I should mention that the query logic is a bit confusing for this, so while I'm reasonably sure this is the data you want, I'm not positive. Please be sure to double-check the results and let me know if there are issues. --MZMcBride (talk) 22:33, 28 November 2009 (UTC)
I expect it could be useful. I've nominated Template:Uncategorized template because it's really not practical, and database reports would be more usable and efficient. I don't think we need to impose restrictions, unless there's really too many, except it should probably exclude templates with / in title, like the report for unused templates. A monthly or biweekly report should be sufficient. Cenarium (talk) 18:46, 17 December 2009 (UTC)
There are a lot of uncategorized templates. The first 100 get to "Ar" in the alphabet. I'll just make a limited report (first 1000 entries)? --MZMcBride (talk) 19:31, 17 December 2009 (UTC)
Yes, I've been contemplating a "Potential biographies of dead people (X)." Can you do me a favor and list a few sample bios so I know whether the query I write is returning good data? --MZMcBride (talk) 01:35, 23 December 2009 (UTC)
Excluded those four words (plus "quintuplets"). I modified the schedule for the dead people lists to run twice a week instead of daily. --MZMcBride (talk) 06:37, 30 December 2009 (UTC)
superimposed title coordinates
I'd be interested in seeing a list of articles that have multiple coordinates in the title area. An example would be this rev which gets one set of title coordinates from {{Infobox Place Ireland}} and another from {{Coord|...}}. This fairly common problem often results in illegible coordinates, and I don't know any systematic way to find instances of it. One approach might be to search for pages that transclude {{Coord/display/title}} more than once.
The immediate issue here is that multiple instances of a template are only recorded once. For example, if I have a page like this, the templatelinks look like this:
mysql> SELECT templatelinks.* FROM templatelinks JOIN page ON page_id = tl_from WHERE page_namespace = 2 AND page_title = 'MZMcBride/Sandbox_8';
+----------+--------------+----------------+
| tl_from | tl_namespace | tl_title |
+----------+--------------+----------------+
| 14871616 | 10 | Infobox |
| 14871616 | 10 | Infobox/row |
| 14871616 | 10 | Infobox_person |
+----------+--------------+----------------+
3 rows in set (0.00 sec)
As you can see, each template is only referenced to the page_id once, so if {{coord}} is used once or 200 hundred times on a particular page, there's no easy way to count the transclusions. I'll keep looking for a way to do what you want to do, though really it would probably be best if all "auto-coords" in infoboxes and other similar templates were simply outright banned. --MZMcBride (talk) 03:30, 27 December 2009 (UTC)
That's a helpful list, though it has a lot of false positives. I'll study it and see if I get any ideas for how to reduce the F+ rate. As for Infoboxes that generate Coords, I happen to know of some good reasons why they do, and I don't expect we'll see the ban you desire any time soon. --Stepheng3 (talk) 20:31, 27 December 2009 (UTC)
What I was thinking were false positives are actually cases where both {{Coords}} generate identical text, so legibility is unaffected. From a maintenance standpoint, these transclusions ought to be merged, so the list is actually quite useful. Would you be so kind as to make this a regular report -- monthly perhaps? --Stepheng3 (talk) 20:41, 27 December 2009 (UTC)
A report that shows templates where the start of the name of the template is the same as the name of the editor who initial created it. For example if I had created Template:WOSlinker userboxes then it would be included in the list. These could then be moved to userspace or renamed. -- WOSlinker (talk) 19:14, 29 December 2009 (UTC)
This sounds like a bitch to implement. Personally, I've never seen anyone do this; is there a sample page somewhere I can test with? --MZMcBride (talk) 21:15, 29 December 2009 (UTC)
Yeah, it's a bit more complicated than that. page.page_title stores spaces as underscores while user.user_name stores spaces as spaces. So you've got to account for that. You also originally requested that the report only look at page creators. Your query doesn't account for only focusing on page creators. (In addition, "rev_user_text+'%'" isn't valid, as far as I know, at least not in MySQL.) All of your examples also seem to focus on a "Template:Username/..." pattern, which I imagine is most common, though I bet there are some without slashes. I gave this some thought today and I'm fairly sure I know how to implement it. It'll just take another day or so. xeno: as long as you leave me a test case or two, you're free to clean all you like. --MZMcBride (talk) 22:18, 30 December 2009 (UTC)
I know that my query doesn't restrict it to just page creators. It's probably not too vital to restrict it to just the page creator if that helps to simplify the query. Thanks. -- WOSlinker (talk) 22:23, 30 December 2009 (UTC)
Toolserver copy of English Wikipedia's database is corrupt
The Toolserver's copy of the English Wikipedia has gone corrupt (announcement here). This may cause quirky reports until this issue is resolved. --MZMcBride (talk) 03:08, 31 December 2009 (UTC)
User reports
A couple of reports that could be useful for tidying up userspace items are:
Redirects of Template: pages to User: pages
Transclusions of User: pages into Articles or Template pages.
For the first, "User template redirects"? For the second, "Cross-pollinated user templates"? "Transplanted user templates"? "Anachronistic user templates"? --MZMcBride (talk) 16:44, 16 January 2010 (UTC)
-- Where everything's been fixed and hasn't broken since and WildBot's the only contributorexplainselectpage_title,page_namespacefrompage,revisionrevwherepage_id=rev.rev_pageandrev_user=11187190andrev_comment="No ambiguous links left"-- page is fixedandnotexists(select*fromrevisioninner1whereinner1.rev_user=11187190andinner1.rev_page=page_idandinner1.rev_timestamp>rev.rev_timestamp)-- no subsequent editsandnotexists(select*fromrevisioninner2whereinner2.rev_user<>11187190andinner2.rev_page=page_id)-- no other contributorsorderbypage_namespace,page_title;
I had an idea for tracking down vandalism/spam/unencyclopedic content. It would be great to have a DB report that searches for the "!" symbol in the mainspace.
Conditions
All occurrences within 'single quotes', "double quotes", italics and bold are excluded.
Any ! symbol within [single brackets] that begin with a url after the [ is excluded. e.g.: [http://example.com Example Gaming! official site]
Anything inside <ref></ref> is excluded.
Certain articles should be added to a whitelist: i.e., Punctuation (I can compile a list).
I know, this will likely be turned down as too broad of a search. And we'll definitely get false positives. *shrug* I don't know much about db reports, but at least it's worth proposing – it would be really helpful if we could have something to detect those pesky "!" unencyclopedic comments. JamieS9314:56, 7 February 2010 (UTC)
I hadn't thought of that, PC78. That does throw a wrench into things. I'm not sure how these reports are gathered, maybe the bot could not collect pages that have a === Fair use in [[ARTICLE NAME]] === header? Or something like that. All the non-template examples given on WP:FURG and WP:FURE have that, so that would probably weed out the majority of files that use non-template fair use rationales. I'm not even sure how large of a problem it would be if the files with non-template fair use rationales were included... Do many people use those? Killiondude (talk) 06:10, 1 December 2009 (UTC)
I've been told that dealing with pagetext is not the best way to go about doing things. Oh well. It would have been a great idea had we only used templated fair use rationales. Killiondude (talk) 00:32, 4 December 2009 (UTC)
Started working on this. Out of the first 500 files checked (that met the template requirements given above), 49 didn't contain "fair use" in their wikitext. Results: http://en.wikipedia.org/w/index.php?oldid=334828581 There are some other words / phrases / expressions that could be checked to make this list have fewer false positives. Please suggest them and I'll try to get this finished before the new year. Cheers! --MZMcBride (talk) 01:34, 30 December 2009 (UTC)
fair_use_strings=[r'=.*(fair[ -]?use|non[ -]?free|rationale).*=',r'rationale for the fair use',r'qualifies as fair use',r'fair use in \[\[',r'\'\'\'fair use rationale\'\'\'',r'the doctrine of fair use',r'the purpose of this image',r'this low quality image',r'use of this image will not decrease',r'conforms with the requirements',r'is a low resolution screenshot'r'is a low resolution of the original',r'used here for purely encyclopedic and informational purposes',r'use of this low-resolution version',r'does not in any way limit the ability of the copyright',r'rationale for use on',r'image is suitable for fair use on',]
File:1251MZMS.jpg "used under a claim of fair user" or some smaller string.
File:Admin logo.gif is this in here because of Category:Files copyrighted by the Wikimedia Foundation? Dunno what to do about this one, really. I think we're able to use Wikipedia/Wikimedia logos without a fair use rationale... if they were strictly fair use, we couldn't use them in the userspace. Perhaps exclude that category from this search?
I need to sleep now. But I noticed that many of the files from the first few hundred now had fair use rationales. Perhaps running this report again (maybe filtering with a few of the strings I listed here?) would cut back on the false positives or whatever. Thanks for your hard work. Killiondude (talk) 09:14, 26 January 2010 (UTC)
I hope you enjoyed your rest! I've updated the fair use strings list and bumped the limit on the report. I'm going to archive this thread if there's nothing further. --MZMcBride (talk) 22:53, 21 April 2010 (UTC)
@MZ: Is the duration of the sound file stored anywhere in the API or whatever you use? It looks like it says in two places on that example file the duration (in seconds) of it. If so, you could query for non-free sound files that have a really long (over 30 seconds or whatever) duration. Killiondude (talk) 00:35, 6 January 2010 (UTC)
At least in the database, it's stored as a blob:
mysql> SELECT * FROM image WHERE img_name = 'NobodyLovesMeLikeYou.ogg'\G
*************************** 1. row ***************************
img_name: NobodyLovesMeLikeYou.ogg
img_size: 333652
img_width: 0
img_height: 0
img_metadata: a:4:{s:7:"version";i:2;s:7:"streams";a:1:{i:435487386;a:8:{s:6:"serial";i:435487386;s:5:"group";i:0;s:4:"type";s:6:"Vorbis";s:6:"vendor";s:29:"Xiph.Org libVorbis I 20020717";s:6:"length";d:19.3836961451247162813160684891045093536376953125;s:4:"size";i:333652;s:6:"header";a:9:{s:14:"vorbis_version";i:0;s:14:"audio_channels";i:2;s:17:"audio_sample_rate";i:44100;s:15:"bitrate_maximum";i:0;s:15:"bitrate_nominal";i:160003;s:15:"bitrate_minimum";i:0;s:11:"blocksize_0";i:8;s:11:"blocksize_1";i:11;s:12:"framing_flag";i:0;}s:8:"comments";a:0:{}}}s:6:"length";d:19.3836961451247162813160684891045093536376953125;s:6:"offset";N;}
img_bits: 0
img_media_type: AUDIO
img_major_mime: application
img_minor_mime: ogg
img_description: fair use clip from The Flamingos' "Nobody Loves Me Like You"
img_user: 8351
img_user_text: TUF-KAT
img_timestamp: 20040211052420
img_sha1: g9tujggvyekmnhrajf7m1zu2twmk3nq
1 row in set (0.00 sec)
I don't see much utility in such a report, though that won't really stop me from generating it. I've put some sample results below. The least revisions is 1, and by "articles" I assumed you meant non-redirects in the article namespace. Due to the way the page and revision tables are structured, the list is the first 100 articles with 1 revision in alphabetical order.
If you still want a report, please choose an acceptable report title and I'll try to knock this out sometime this week. Cheers. --MZMcBride (talk) 02:56, 11 March 2010 (UTC)
While occasionally this may be the result of a discussion being open that long, it's more likely because the nomination wasn't completed, or because of a partial closure of it. I think that this should be done weekly. עוד מישהוOd Mishehu06:58, 15 March 2010 (UTC)
Do you have a report name in mind? The results that I think you want are below. I'm not sure how reliable categorylinks.cl_timestamp is, but it's the only way I know of to generate this data. I also assumed 1 month was 30 days. --MZMcBride (talk) 07:16, 15 March 2010 (UTC)
List of entries looks good, except that I think that pages in Category:Articles for deletion should only be listed if they are in the article namespace. The cl_timestamp appears to be correct - I specificly checked that the time for File:TTC Bessarion - Digital Rendering.PNG looks correct; the time corresponds to the time frame of a template update which moved the Files for deletion pages to their current category. I think that the order should be alphabetical by page name. עוד מישהוOd Mishehu08:17, 15 March 2010 (UTC)
Set to weekly and ordered by number of transclusions, descending. Though, it'll still be limited results and due to the way the table is ordered, it'll be only the first X% of the alphabet. --MZMcBride (talk) 18:54, 16 March 2010 (UTC)
Orphaned talk pages
I think a monthly report of orphaned talk pages would be useful. A good definition of an orphaned talk page would be:
A talk page of a non-existent page, which isn't in the User talk: namespace, and which isn't the subpage of an existing talk page on any level (i.e, if there's a page called "Talk:Foo", then a page called "Talk:Foo/bar/1/2/3" shouldn't be listed; if that's impracticle, then not containing a slash would be good enough.) עוד מישהוOd Mishehu17:59, 18 March 2010 (UTC)
This is the same thing you posted about here? It's on the list. MySQL doesn't really have a split string function, so I need to write a Pythonic function. --MZMcBride (talk) 21:15, 21 March 2010 (UTC)
I don't think there's any policy in place that covers that. Stewards would not act. Maybe arbcom motion? Maybe the account should just be unblocked, if the account becomes compromised they'll do that anyway. –xenotalk16:37, 22 March 2010 (UTC)
E-mail ArbCom. I seem to remember a similar situation arising previously. As far as I know, these accounts always have all rights removed. Though, as xeno notes, the directive has to come from ArbCom. --MZMcBride (talk) 17:39, 22 March 2010 (UTC)