You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

User:Rush/analytics-tm/questions

From Wikitech-static
Jump to navigation Jump to search
  • What is the rsync daemon for on stat1005?
    • stat boxes are all allowed to rsync to each other in /srv
    • rsync to thorium
    • rsyned to something in dumps?
    • datasets.wikimedia.org is more adhoc. aaron halfaker has used it to produce for papers and stuff.



        • def not going ot expose it, tricky because it's basically full shell in the prod network in a web browser, shells that get launched automatically set hte http_proxy env, and also pip for adhoc packages
        • druid and hive and everything? fancy web shell within analytics cluster
        • notebooks never purged (hive databases within hadoop too!)
              • try to create purgeing as part of offboarding with users


  • stat boxes users usually download stuff in hadoop specific formats that look like a binary file on stat boxes
    • PI/PII issues!
    • snappy format


  • is netflow data really stopped or is there new data?
    • daemon runs on rheunium collecting all netflow data unencrypted with IPs, aggregated, and

pushed to kafka jumbo broker and pushing into Hive.

    • routers to rhenium unecrypted
    • rhenium to kafka unencrypted (but aggregated so is it really a disclosure issue?)
      • aggregated between ASs?
  • https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging says you need to be on stat1006 to be able to access the eventlogging db but later says "There are many Kafka tools with which you can read the EventLogging data streams. kafkacat is one that is installed on stat1005." The stat1006 login banner also says "stat1006 is a Statistics general compute node (non private data) (statistics::cruncher)"
    • stat1005 and stat1006 roles existed before we had hadoop
    • before hadoop or kafka we kept web request data on stat1005
    • stat1006 was more like a more public place where there was more public data and mediawiki analytics slaves
    • stat1006 could technically use a hdaoop client to access



  • when/where do things get pushed to dumps?
    • stat1005 via a fusehdfs mount for rysnc
  • does analytics consume directly from labsdb*?
    • yes
    • sqooped from prod adhoc -- users who know how to use sqoop on their own
    • sqoops cu_changes usually
  • Is zookeeper shared between kafka deployments? is zookeeper colocated with etcd?
    • yes
    • yes
      • what all uses zoopker? kafka, druid (different one on the druid nodes), hadoop, burrow?
        • burrow runs on kafka tools tha tis a ganeti VM poking zookeeper and kafka and produces metrics

about consumers to prometheus "kafkamon1001" and 2001

  • what is varnishkafka-statv?
    • mostly used by per team
    • simple eventlogging that only allows logging in statsd format
    • kafka topic
    • consumed by a python daemon somehwere that consumers and submits metrics to statsd
      • On varnish, statsv is filtered by webrequest via ReqURL ~ "^/beacon/statsv\?"'
  • Where is the DB hue uses?
    • analytics1003 is mysql database
  • What is the heck does oozie do? :)
    • scheduler (a cron like thing) that runs on a daemon on analytics1003 with a DB
    • submit what are called workflows to it that run with regularity
    • nice feature where it schedules jobs based on the existence of data (inotify in hadoop thing)
    • rerun on failures and SLA stuff


  • PI or PII backups or storage?
    • seems no
  • What databases are sqooped into Hadoop?