You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Analytics/Web publication: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Wargo
m (Undo revision 1853780 by Dick balls cock (talk))
imported>Neil P. Quinn-WMF
(Add SWAP-specific instructions)
Line 1: Line 1:
This page describes how to make '''safe, non-identifying''' datasets, notebooks, or other research products public on the web in the [https://analytics.wikimedia.org/published analytics.wikimedia.org/published] directory. For guidelines on how to formally release an open dataset (with metadata and persistent identifiers), please refer to [[Data releases]]. For regular, structured, and maintained datasets, please see [[Analytics#Datasets]].
This page describes how to make '''safe, non-identifying''' datasets, notebooks, or other research products public on the web in the [https://analytics.wikimedia.org/published analytics.wikimedia.org/published] directory. For guidelines on how to formally release an open dataset (with metadata and persistent identifiers), please refer to [[Data releases]]. For regular, structured, and maintained datasets, please see [[Analytics#Datasets]].


If you're looking for data here, some of it may not be maintained or documented.  If possible, please reach out to the authors of the data for help, or to [[Analytics/Team]].  If you're publishing data here, there are some guidelines in [https://analytics.wikimedia.org/datasets/README the README on the server].


If you're looking for data here, some of it may not be maintained or documented. If possible, please reach out to the authors of the data for help, or to [[Analytics/Team]]. If you're publishing data here, there are some guidelines in [https://analytics.wikimedia.org/datasets/README the README] on the server:
== Instructions ==
# Double-check that the dataset or notebook you want to publish is '''safe and non-identifying'''.
# Decide where you want to publish it. There are separate folders for notebooks and datasets; within those, you should browse the existing subfolders and decide where your code fits. For example, if you have <code>my-data-2020-01.tsv</code>, you may want to publish it as <code>datasets/one-off/my-data/my-data-2020-01.tsv</code>. Please try to use names that the complete strangers viewing the website will understand!
# Make sure it's on one of the [[Analytics/Systems/Clients|Analytics clients]].  
# Copy it to the corresponding location within the <code>/srv/published/</code> folder on that machine. Create the intermediate folders if necessary. If you're using [[SWAP]], for security reasons you will not be able to access this file from the terminal in your browser. You'll need to SSH directly into the notebook host and move the file using the command line.


* Please name your folders in a friendly way, think of strangers browsing through this data
Once you do this, it will be automatically synced to the website by a script that runs automatically every 15 minutes. If you want to run the sync immediately, you can do it manually with the <code>published-sync</code> command.
* Take a look at https://wikitech.wikimedia.org/wiki/Analytics/Reportupdater for ongoing reports
* '''Always Remember''': be careful what you share here
 
To share data via this server just copy '''safe, non-identifying''' data to <code>/srv/published/</code> on any of the [[Analytics/Systems/Clients|Analytics clients]].  For example, [[Analytics/Reportupdater|reportupdater]] jobs copy their output to <code>/srv/published/datasets/periodic/reports</code>. Another example: from stat1007, directories are synced from /srv/published/datasets to https://analytics.wikimedia.org/published/datasets/

Revision as of 04:43, 11 February 2020

This page describes how to make safe, non-identifying datasets, notebooks, or other research products public on the web in the analytics.wikimedia.org/published directory. For guidelines on how to formally release an open dataset (with metadata and persistent identifiers), please refer to Data releases. For regular, structured, and maintained datasets, please see Analytics#Datasets.

If you're looking for data here, some of it may not be maintained or documented. If possible, please reach out to the authors of the data for help, or to Analytics/Team. If you're publishing data here, there are some guidelines in the README on the server.

Instructions

  1. Double-check that the dataset or notebook you want to publish is safe and non-identifying.
  2. Decide where you want to publish it. There are separate folders for notebooks and datasets; within those, you should browse the existing subfolders and decide where your code fits. For example, if you have my-data-2020-01.tsv, you may want to publish it as datasets/one-off/my-data/my-data-2020-01.tsv. Please try to use names that the complete strangers viewing the website will understand!
  3. Make sure it's on one of the Analytics clients.
  4. Copy it to the corresponding location within the /srv/published/ folder on that machine. Create the intermediate folders if necessary. If you're using SWAP, for security reasons you will not be able to access this file from the terminal in your browser. You'll need to SSH directly into the notebook host and move the file using the command line.

Once you do this, it will be automatically synced to the website by a script that runs automatically every 15 minutes. If you want to run the sync immediately, you can do it manually with the published-sync command.