You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Analytics/Archive/2015 data warehouse experiments/2015-01-14 verifications

From Wikitech-static
Jump to navigation Jump to search

Using a new script for this, which is actively being updated. However, for the investigation below, I'll link to the specific queries used.

Edit Fact and Page Dimension

Checking records that should have made it from the revision and archive tables into the warehouse (line 8 in the script):

+--------+---------------------+--------------------+----------------------+
| wiki   | date_with_bad_match | value_in_warehouse | value_in_original_db |
+--------+---------------------+--------------------+----------------------+
| dewiki | 20140829            |              25278 |                25282 |
| dewiki | 20140830            |              24311 |                24312 |
| dewiki | 20140831            |              28807 |                28811 |
| dewiki | 20140901            |              30938 |                30941 |
| dewiki | 20140902            |              35705 |                35708 |
| dewiki | 20140903            |              40823 |                40825 |
| dewiki | 20140905            |              24393 |                24394 |
| dewiki | 20140906            |              23405 |                23412 |
| dewiki | 20140908            |              27503 |                27506 |
| dewiki | 20140910            |              29123 |                29125 |
| dewiki | 20140911            |              28001 |                28005 |
| dewiki | 20140912            |              30820 |                30823 |
| dewiki | 20140914            |              30595 |                30597 |
| dewiki | 20140915            |              28672 |                28675 |
| dewiki | 20140916            |              26216 |                26220 |
| dewiki | 20140917            |              27513 |                27516 |
| dewiki | 20140919            |              27570 |                27571 |
| dewiki | 20140920            |              24764 |                24779 |
| dewiki | 20140921            |              29747 |                29753 |
| dewiki | 20140922            |              29125 |                29128 |
| dewiki | 20140923            |              26579 |                26580 |
| dewiki | 20140924            |              27716 |                27724 |
| dewiki | 20140925            |              27318 |                27331 |
| dewiki | 20140926            |              26074 |                26108 |
| dewiki | 20140927            |              25270 |                25275 |
| dewiki | 20140928            |              27093 |                27097 |
| dewiki | 20140929            |              28185 |                28188 |
| dewiki | 20140930            |              26761 |                26764 |
| dewiki | 20141001            |                 13 |                   14 |  <- should be easiest to look at
| enwiki | 20140924            |             129260 |               129264 |
| enwiki | 20140927            |             121163 |               121165 |
+--------+---------------------+--------------------+----------------------+

The @from and @to parameters in the query were 20140829000000 and 20141001000100. In enwiki, most days match except September 24 and 27. In dewiki, all days do not match. Two additional problems were found. The page table has archive = 0 for all records, but it should be 1 for records coming from the archive table. The investigating query was formulated to account for this but the problem was ignored as it is not partial but total. The other minor problem was that the page table does not seem to have an index on page_id or the archive column. I believe something along those lines would be needed to speed up the join with the edit table. As it is, it makes the query used above take minutes instead of seconds without the join.

User Dimension

The bot flag was correctly set, as proved by the first four queries here. However, the last two queries in that script checked the log_action was imported correctly, and they found problems with a small minority of users in dewiki (warehouse has 702187 dewiki users) and enwiki (warehouse has 7660869 enwiki users):

+--------------+
| wrong_action |  <-- dewiki
+--------------+
|     7501     |
+--------------+

+--------------+
| wrong_action |  <-- enwiki
+--------------+
|     ????     |  <-- query crashed :(
+--------------+