You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Analytics/DataWarehouse/DataVerifications/2015-01-14: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Milimetric
No edit summary
 
imported>MarcoAurelio
 
(One intermediate revision by one other user not shown)
Line 1: Line 1:
Using a new script for this, which is actively [https://gerrit.wikimedia.org/r/#/c/185239/ being updated].  However, for the investigation below, I'll link to the specific queries used.
#REDIRECT [[Analytics/Archive/2015 data warehouse experiments/2015-01-14 verifications]]
 
== Edit Fact and Page Dimension ==
 
Checking [https://gerrit.wikimedia.org/r/#/c/185239/1/scripts/test-data-loading.sql records that should have made it from the revision and archive tables into the warehouse] (line 8 in the script):
 
<pre>
+--------+---------------------+--------------------+----------------------+
| wiki  | date_with_bad_match | value_in_warehouse | value_in_original_db |
+--------+---------------------+--------------------+----------------------+
| dewiki | 20140829            |              25278 |                25282 |
| dewiki | 20140830            |              24311 |                24312 |
| dewiki | 20140831            |              28807 |                28811 |
| dewiki | 20140901            |              30938 |                30941 |
| dewiki | 20140902            |              35705 |                35708 |
| dewiki | 20140903            |              40823 |                40825 |
| dewiki | 20140905            |              24393 |                24394 |
| dewiki | 20140906            |              23405 |                23412 |
| dewiki | 20140908            |              27503 |                27506 |
| dewiki | 20140910            |              29123 |                29125 |
| dewiki | 20140911            |              28001 |                28005 |
| dewiki | 20140912            |              30820 |                30823 |
| dewiki | 20140914            |              30595 |                30597 |
| dewiki | 20140915            |              28672 |                28675 |
| dewiki | 20140916            |              26216 |                26220 |
| dewiki | 20140917            |              27513 |                27516 |
| dewiki | 20140919            |              27570 |                27571 |
| dewiki | 20140920            |              24764 |                24779 |
| dewiki | 20140921            |              29747 |                29753 |
| dewiki | 20140922            |              29125 |                29128 |
| dewiki | 20140923            |              26579 |                26580 |
| dewiki | 20140924            |              27716 |                27724 |
| dewiki | 20140925            |              27318 |                27331 |
| dewiki | 20140926            |              26074 |                26108 |
| dewiki | 20140927            |              25270 |                25275 |
| dewiki | 20140928            |              27093 |                27097 |
| dewiki | 20140929            |              28185 |                28188 |
| dewiki | 20140930            |              26761 |                26764 |
| dewiki | 20141001            |                13 |                  14 |  <- should be easiest to look at
| enwiki | 20140924            |            129260 |              129264 |
| enwiki | 20140927            |            121163 |              121165 |
+--------+---------------------+--------------------+----------------------+
</pre>
 
The @from and @to parameters in the query were 20140829000000 and 20141001000100.  In enwiki, most days match except September 24 and 27.  In dewiki, all days do not match.  Two additional problems were found.  The page table has archive = 0 for all records, but it should be 1 for records coming from the archive table.  The investigating query was formulated to account for this but the problem was ignored as it is not partial but total.  The other minor problem was that the page table does not seem to have an index on page_id or the archive column.  I believe something along those lines would be needed to speed up the join with the edit table.  As it is, it makes the query used above take minutes instead of seconds without the join.
 
== User Dimension ==
 
The bot flag was correctly set, as proved by the [https://gerrit.wikimedia.org/r/#/c/185239/2/scripts/test-user-dimension.sql first four queries here].  However, the last two queries in that script checked the log_action was imported correctly, and they found problems with a small minority of users in dewiki (warehouse has 702187 dewiki users) and enwiki (warehouse has 7660869 enwiki users):
 
<pre>
+--------------+
| wrong_action |  <-- dewiki
+--------------+
|    7501    |
+--------------+
 
+--------------+
| wrong_action |  <-- enwiki
+--------------+
|    ????    |  <-- query crashed :(
+--------------+
</pre>

Latest revision as of 18:59, 13 July 2017