[心得] WiDS Taipei Conference 2019

這次 conference 各個大大的分享雖然都比較偏向在企業上的經驗,不過其實聽下來許多處理資料的想法都跟我們生態上所接觸的相去不遠,大概在資料的世界即使領域不同還是互通的吧~所以即使分享的大大們沒有生物相關背景的,這次的收穫還是非常的豐富!

以下就依每場演講來留個紀錄,雖然是很零散地筆記XD,不過還是留個紀錄

01_Health Care Analytics

Elvena Fong, Health Data Analytics Program Manager

  • Chellenges:
    1. Technology
      • 資料存取 -> SQL
      • tools for analysis
    2. Access and tracking
      • 統一獲取資料的窗口 -> 各個 Program manager
      • review data request
      • project tracking system 追蹤使用資料的相關發表
    3. Data cleaning
      • handling inconsistencies in the data
      • normalize data
    4. Knowledge Management
      • develope tools
      • build data analysis template
    5. Huge Dataset
      • use subset to do pretest/ build model

發現原來想要整合資料的人都會面臨一樣的困境QQ

02_The art and science of marketing in retail

Sharon Chai / 翟翎秀 Analytic Lead, Marketing Effectiveness

  • B corporation.

  • Single Channel Marketing vs. Multi-Channel Marketing

  • 以顧客為出發點去思考,而非行銷的方法

  • 特生的應用:推廣活動/課程的成效評估(推了活動增加的志工數? 認養的樣區? 網站流量? 持續程度)

  • data silos

  • Channel-driven focus(客戶行為)
  • non-linear consuner journeys
  • lack of visibility to key information

  • 對銷售的趨勢分析 deseasonalize
    desesonalize in r

最後是這一場講者分享,我自己也很喜歡的幾句話

Know who you are.
Know what you are committed to.
Think and live bigger than yourself.
Open your heart, having it get stomped on, and having the courage to do it again.

當然還是希望不要被蹂躪太多次才好啦…

My Data science journey - from physics to business

Tammy Yang
DT42 Numbers

  • 為什麼粒子物理學家是個好資料分析師XD

    1. Capture data: to understand where is the source of data, where the data come from
    2. Create data pipeline: to extract the data needed only, and delete those you don’t need
    3. Simulation: 運用模擬資料產生需要的 model(script)
    4. Understand experimental error: 知道每一個步驟(資料收集過程)都會產生怎樣的實驗誤差(bias)
    5. Background modeling
    6. Signal event selection (machine learning)
    7. Estimate uncertainty(系統誤差): 分析過程(filte data, modeling)時候的誤差
      任何一個篩選資料的步驟時會不會也把某些資訊(background)給刪除掉了?
    8. Results: 結過是否合理,在什麼樣的情境下可以解釋
  • 資料分析 = 有資料(被數據化)可以被分析

  • Issue priority
    給數字!index,希望每個進來的人頭上都有事情重要程度的指標XD

    • Give different weight to reporter
    • Weight based on importance(from reporter)
    • Prioritize by milestone (什麼時候該被解決)
    • Parse keyword in description
    • Let each party own thier own tags

然後底下的工程師就會開始養bug wwww (沒bug可以養,只好來養事情的重要性指數惹(誤))

Hypothesis Testing with Python

Mosky Liu

  • P-value 代表的是收集到的資料與 null hupothesis 的相容性
  • accept/reject != true/false

  • Boxplot 加上蝕刻可以方便地判斷組別之間是否有差異
    蝕刻範圍不重疊 -> 有差異(不過還是要做檢定比較保險XD)

範例 R code

1
p + geom_boxplot(notch = TRUE)   # 蝕刻為95% confidence interval

  • Actual negative rate

  • Complete a test

    1. Hypothesis -> what teat
      先確定假說後才能決定要用什麼檢定來驗證
    2. actual negative rate -> decide alpha, beta
      決定 Actual negative rate 才能決定 alpha 跟 beta
    3. alpha, beta, raw effect size -> decide sample size
      確定 alpha, beta, raw effect size才能決定至少需要多大的sample size
    4. collect a sample as large as possible
    5. understand and preproccess the sample
      plot, missing data, outlier, transform, ect.
    6. Test
    7. Report fully
      not just the p-value!!!
  • 推薦讀物

A data project is born - from initiative, to resource, to impact

Megan Sun, Project Manager, OLX Group

  • Data is an end-to-end journey

  • The lifecycle of a data project

    1. Initiative
    2. planning
    3. resource
    4. developing/ implementation
    5. promote/ maintain
  • 了解要解決的問題,一起來合作的 Team 他們針對這個 project 各自的需求是什麼,在這個 project 中他們在意(對他們有利)的是什麼?又有哪些事情是在這個 project 運行時不能被妥協的?

  • Promotion focus or Prevention focus