Friday, September 20, 2013

2013 Predictive Analytics World Conference Notes

Thursday Sessions
Darryl Issa. 
  •  Came in late.  Mostly answering questions about Medical care.  

David Williams-IG at USPS
  •  Some good quotes about the necessary changes for government especially with respect to managing and using data, including:
  • "Be everywhere businessman or be gone"
  • "No army in the world is big enough to stop an idea whose time is come" - Victor Hugo
  • Middleman higher cost than manufacturing
  • Government drowning in data
  • Future of governance
    • must use data
    • can't be bottlenecked
    • need interdependencies
  • Problems-data in silos, and we're buying data back from vendors
  • Solution:  essentially sharing data and databases.
  • Not sure how this works for IC
Deloitte plenary
  • Incredibly boring 
  • Bottom line is that they made changes to data storage, use, and applications that ultimately saved the Navy about 19 million dollars.
  • Nice job Deloitte.

Risk Assessment in the Brokerage Industry
  • Deloitte once again.
  • Risk assessment in the brokerage industry.
  • Julioi Girardi
  • PhD in economics
  • Foreign speaker
  • Begins'  4500 registered broker-dealers firms.
  • OCIE responsible for inspections & examinations.
  • Limited resources for so many firms
  • Broker dealer firms are classified into seven peer groups based on filing criteria and business practices.
  • Precictive analytics are then ru nwithin each group.
  • Firms within each group are ranked in some way.
  • Methodology
    • Gather data
    • develop hypothesis and identify critera (predictors)
    • Test hypotheses by building model
    • Three areas of risk assessment" 
      • financial and operational
      • workforce
      • firm structure and supervision
    • They simply sum the scores on four criterion (total score of 12)
    • In future, they may not longer use linear model because one criterion could be a better predictor than others.
    • Ultiamtely have a High, medium, and low risk designation.
    • These labels depend on criteria (no hard cut off)
    • Some proportion of firms they like to label as high, medium and low
    • Also, they look for breaks in the data
    • It's a continuous feedback model.
Text mining Case studies
  • text book-practical text mining
  • Dr. Andrew Fast
  • ESPN coach success predictor person
  • Big Data 3 vs, volume, variety, velocity
  • Variety
    • Big data system is a system that integrates informatin from varied sources for deepter and broader understanding (sue feldman, CEO of Synthexsis)
    • Combine structured and unstructured text for more power
    • Issues include complete foreign keys, keys across data entered manually.
    • Need to improve support for second users of the data
  • Goal is really to identify structured data from unstructured text.
  • examples presented include getting SSNs, phone numbers, etc. from text in places where the postal worker failed to collect it.
  • Lesson 1--expand your data by using extra data sets.
  • Lesson 2-expand your query. (e.g. theft, stoeln, opened, lost, not delivered, missing)
    • Gropu documents wtih similar content
    • example given of group who needed to find a document, but couldn't really describe it.
    • He says you can use entire docuemnt as the querry.  
    • strategy called cosine similarity.
    • Lesson 3-multiply your efforts
  • Case study" SSA Disability approval
    • Pain-approval process is up to 2 years
    • Goal-fast track easy eases
    • challenge-free-text on disability application
    • Results-20% of approvals possible immediately
    • Highlight pattersn of language likely to indicate abuse
    • uncover indicators mentioned in comments (financial stress)
    • look at supervisor notes and ot her oversight information (persennel risk)
    • Lesson 4: combine approaches
    • Text mininig can be viewed from many perspectives
    • no single view provides complete solution
    • must consider entire beast to get best solution
    • Blind men and elephant analogy
  • Finding elephants
    • Bigger data-which zipcodes have complained about cash4gold
    • query expansion-ail theft complaints
    • More like this-finding recipes for WMD
    • Each text mining area provides a different trade-off between power and generality.
    • Document classificaiton is most powerful

After Lunch Plenary-John Elder of Elder Research.
  • General lessons we can learn from black box trading
  • Investment modeling
  • Started company based on success with slim advantage over other hedge funds.  Had a lot of success and closed fund at peak of success because numbers said they would not longer have an edge.
  • Sucess is possible.
  • Huge reward-data plentiful, but noisy (bloomburg earnings)' market efficient' pockets of inefficiency' skill is almost indistinguishable from luck' system can change overnight
  • Discipline of partially solving issues has improved much of our other work.
  • Most failures as a company have been in stock market.
  • "WE FOUND SOMETHING"
  • New hdge fund investement system.  
  • Down to two parameters.  
  • Data challenges, leaks from the future-predicting interest rates' 
  • Have to hire someone to break your stuff.
  • Data analysts are like artists, they love their models.
  • But people just don't think of everything.
  • Tought to build something idiotproff because idiots are so ingenius
  • Look for things that work TOO WELL.  Issues most likely exist.
  • Model goal: get computer to feel like you do.
  • Careful abou tmaximizing accuracy because all errors are not equal.
  • Resampling to evaluate accuracy (e.g. cross-validation)
  • Train V models on differeent data subsets.
  • Test each on onseen data
  • Use distribution of results to score model realisticity.  
  • What the world needs is a one armed statitician. On the one hand.... No other hand.
  • What's the chance I could get a result like this by chance? That is the essential question for any statistical test
  • 5 lessons learned-1, assess cost and potential rewards (small improvements may lead to large rewards, later technology may matter, custom error metrics may be worthe the trouble.  2. Must have access to domain knowledge' 3. Data is going to be flawed, but don't let it stop you.  Don't wait for data warehouse; 4. work extrememly hard to break your model.  Need outside help, resampling is essential, visualize failure--need to reward breaking;  5. Share the work and share the reward becasue that will grow the pie.  
Plenary Panel led by Dean Silverman (IRS)
  • Developing an analytics framework and measuring success
  • Roles of data analysis, evangelist, storage, and something else...
  • He's more on the data evangelist side.
Accenture
  • Advanced analytics deliver insight for improved sales.
  • What doing to put Postal Service into 21st century.
  • 560000 employees
  • over 200000 vehicles
  • 36400 outlets, larget than mcdonalds, walmart, and starbucks, combined
  • 584 million pieces of mail a day to over 150 million residencies, po boxes, and businesses.
  • Sales responsible for 48 billion of 66 B total sales revenue for USPS.
  • 700+ sales reps, whereas USPS has more than 4000.
  • Problem-declining revenue with lower mail ivolume
  • Limited ability to hire to boost sales
  • Need to become more efficient.
  • No single view of customer, no data driving decisions.
  • Solutions-platform (bring data to one platform, single view of customer), process (build models), and third thing didn't get... sales?
  • How put everything in one central location?
  • Talk about predicitve analytics and sales effectiveness
  • Salesman were using gut decisions
  • Accenture lady, southern.  Designed model, processed data, built model, implemented model, and assessed.
  • logistic and linear regression worked best for th is project
  • probably that sale would ocur (logistic)
  • estimated revenue from a sale (linear)
  • Total sales are up significantly

Hudson Hollister-Open Data Reforms
  • Founder and Executive Director of Data Transparency Coalition.
  • Washington policymakers are getting their act together.
  • Want open data in structured formast for everybody.
  • 7 buckets represent fed gov
  • federal spending (inconsisten formats, lack of identifiers, complex reporting structure; data act will lead to transformation; leaving implacation for analytics)  5 people in OMB understand how MAX budget works.  He hasn't met them.  THERE IS NO DATA GOVERNANCE IN FEDERAL SPENDING.  Tresury deparrmtne is asked by law now to provide identifiers and more structure.  Senate sponsor, mark warner, GIPROMA, some act, says  performance and spending can be done on a program by program basis.  
  • management (subject matter experts, but unstructured).  Open Data policy by Obama commands all departments to create a data inventory.  Default should be opoen data (defined by seven things).  Roadblocks include the sME.  Most stuff not goign to be recognized as important or necessary for this effort.  DOCUMENTS ARE DATA.
  • financial regulation (financial regulators do not coordinate.  Collect overlapping information.  FIT act requiers SEC to have same standards for finance regulation.  OFR has authority to force all regulators to adopt standards in regulation.  
  • general regulation (same issues)  Will our enemeies have the same access?  Yes.  Is an issue
  • tax.  Standardized formats for tax returns, making turbotax possible.  KUDOS to IRS for doing that in the 90's.  Only exception is nonprofits, and they are brought ini through XML, but they put it into tiff documents.  Obama proposed changing this in 2012 budget.    Unfotunately, no member of congress has stepped up to propose this.  )
  • legistlation and the code.  Need to structure this so we can take advantage of searching, and analyzing.  Boehner and Cantor say we have to look for XML.
  • judicial.  Diverse formats.  Some briefs in wordperfect :).  
  • Need to replace pdfs with page breaks.... What does that mean???
  • Imagine if we could combine all of these data...
  • We can tie together everything.  Benefits far outweight the disadvantages.  
  • Prospect of automating all reporting is huge benefit.  
  • Will eliminate so many compliance lawyers and paper and solve a lot of problems.
  • DATA TRANSPARENCY COALITION
  • OMB setting up its own analytics office.  
  • Much further along in other nations than it is in the US.  UK is several years ahead.  
  • theODI.com--they certify datasets as open.h


  • asdf
  • asdf
  • asdf
  • asdf
  • asdf
  • asdf

2013 Predictive Analytics Conference Friday Afternoon

Predictive Analytics in Medicare-Kelly Gent

  • Models in credit card fraud
    • Rule-in FLA with charge in CA
    • Anomaly, 3 tvs in one day
    • predictive model-charges for multiple tvs out of state after a one dollar charge on wednesday
    • social network
    • charges at address known to be used by bad actor
  • Traditional analytics approach
    • run a model, use top tier and run an investigatio
  • Currently running all models mentioned.  Building a good data set.
  • This group is required to publish a fraud prevention report to congress.
  • They stopped prevented or identified 115 million in improper payments which is a 3 to 1 savings.
  • 536 leads for new investigations
  • New info for 511 existing investigations
  • Models are working
  • What is the command center?
    • Center for detection and investigation drivng integrity and innovation
    • Paradigm shift
    • Introduces mission
    • Speeds up actions
  • Old way
    • Have a lead
    • do an investigation
    • Take some action, i.e. remove provider or overpayment
    • Savings
    • LOTS OF people involved
  • New approach
    • Identify better leads faster.
    • Introduced command center to bring all people together in room, turn off blackberries, and solve problems.
  • What's next for the FRAUD PREVENTION SYSTEM
    • Evaluating feasability of expanding analytics in medicaie
      • There are 56 medicaid programs which is a problem.  
    • Activities to analyze feasability
      • focus groups with state medicaid agencies
      • evaluate outcomes of introduing post-payment medicaid data into FPS (e.g. if fraudulent in medicare, will aso be for medicaid
      • Also providing technical assistance
    • Prevention-partnership is designed to share info and best practices to improve det5ection and prevention.  
    • Lots o partners 
    • 11 partners contributed to first information exchange.                                                          
Industry Expert Panel

Daniel porter-pinpointing the persuadables, convicning the right voters to support barak obama
  • Big obama supporter
  • People were writing him off.
  • Pundits ranged from everywhere
  • Nate silver said obama was toast
  • challenge was how to persuade people to vote for obama
  • simulatede electionb ased on different turnout scenarios
  • Under each scenario, obama could not win unless he changed people's mind
  • How persuade president was a better choice than mitt romney?
  • 2 schools of though-election a referendeum
  • important for campaign to make sure it was a choice
  • hope and change was 2008
  • How make sure message doesn't backfire?
  • How determine which voters the campaign hopes to reach?
  • Targeting swing voters nothing new.
  • Targeting independents
  • From campaign manager-measure everything
  • Mandate to bring analytics to every facet of massive operation in just oneo year
  • Models included
    • support
    • turnout
    • generaic national support
    • contactability
    • many others
  • Persuasion challenge
    • not trying to measure who is likely to support obama
    • not trying to measure who is undecided
    • not tryin gto measure who cares about what isssue
    • Trying to measure who is likely to change his or her mind from voting for rmney to voting for obama
  • How did democracts do persuasion before 2012
    • Built support models for all registered voters essentially probability a voter would support a democrat
    • Messages were tested in focus groups with small numbers of voters
    • Small sample size issue.
    • Basing it on what people like, but not on what is persuasion, the true outcome of interest.
    • For persuasion, targeted those who had middle support scores, or people who were independents
  • Prior to 2012, they went after those who had a middle score, i.e. on the fence
  • middle person means they don't have strong partison characteristics
  • Many people not interested in politics
  • Apathetic
  • Low turnout
  • In 2012, they could easily differentiate supporters from non-supporters
  • Persuasion modeling had early promise
  • benefit of reelection is that you know who is the nominee.


  • asdf
  • sadf

Thursday, September 19, 2013

2013 Predictive Analytics World Friday Notes

Gene Dodaro-data analytics for government oversight
  • Talkl about how to put gov't on a more sustainable path.
  • They advise congress on how to improve performance
  • Driven by mandates and requests from congress.
  • Fact-based organization-analysis hinges on reviewing, compiling, analyzing data
  • Made 380 recommendatin to reduce overlap.
  • 108 billion in improper payments (some agencies haven't even reported.
  • Health care is particularly worrisome.  5 to 8% of GDP in next few yaers.
  • Number of people 65 and older will double in next few years.
  • Overpaying people by tens of billions.

Panel on Working HOrizontally" Analytics as a bridget
  • Questiona sked about analytics just being repackaged as something we have always done.
  • Essentially agreed except that the repackaging simply helps to brand the usefulness of using data in making decisions.

Using Social Data for Public Sector Analytics
  • Will mayo and Rebecca Goolsby
  • Information moving rapidly.  
  • Shared example of tweet from San Francisco flight and news story didn't come in for 10 minutes beyond that tweet.
  • Used Social Media in public sector in response to Hurrican sandy (e.g. finding where a bunch of trees fell down)
  • Weather alerts via twitter where otherwise equipment is unavailable
  • Need to use consented data and that available in private sector.                                                                        


Big Data is not new-no such ting
  • Goal is to talk about what is data to your organization
  • Mid-2011 was when the google search term exponentially increased
  • No such thing as big data, but there is a lot of data out here
  • 234 milion e-mails per minute.
  • 2.5 quintillionb ytes of data created with 90% of worlds data created in last two years alone.
  • Data governance expert peter aiken estimates 80% of the data is not useful.
  • Is any of it valuable?
  • Big data can be defined as data sets that are too large for you to handle.
  • Origin of Big Data
    • Some credit John Mashey, chief scientist at silicon graphics in 1990's.
    • He was using the label for a range of issues, essentially that the boundaries of computing keep advancing.  
  • Nate silver labels big data as a fashionable word.  It is when we deny our role in the process of data driven predicitons that the odds of failure rise. (signal and the noise book)
  • Mark Madsen says big data isn't hype but it is being hyped and says tthe reality is that big data is about new models for data processing.
  • Sue Feldman says it is a set of technologies that solve complex information economically and includes volume, variety, and velocity.
  • The Hype
    • Gardner shows graph of hype cycle.  Big Data is two to five years from top of hype cycle.  
  • Business is about making decisions, data can help
  • We gotta do the hard work to figure out the value of business data.
  • Need to be willing to experiemnt and willing to be wrong.
  • Valuable data is not always big!!!!!!!!!!!!!!!!!!!!
  • Analytics can scale in a number of ways.  
  • BIG DATA INITIATIVES--frighten presenter
  • Technology looking for a problem.  Cart before the horse.
  • Need to first understand the problem before we shotgun some big data initiative.
  • Kudzoo????
  • Data Vs
    • Vacant is a new oen.  Available.  Availability of data.  people not wanting to share data.
    • Volume.  Big data implies volume.  Easiest one from analytical view to solve.  1 tterrabyte to 8 megabytes of useful information.  You can shrink that data.  Get rid of data rot.  
    • Velocity
    • Variety-biggest challenge.  Connecting data that wasn't designed to be conbined.  Fuzzy matching.  
    • Value-different colored bubble on slide.  He says its a diferent kind of question.
    • Veracity, what is the truth of the data and who says what the truth of that data is?
    • Vitality-is the system able to adjust around the data.
    • Variability-changing environment around us.  
  • CRISP-DM process for data mining.
    • Goal definition to business understanding to data understanding to data preparation to modeling to evaluation to deployment to knowledge application.  Operationalize it.  
  • Background-at least two components to any analytic architecture.
    • data storage (databases, dw, spreadsheets)
    • Analytical processing (dashboards, models, metrics, etc.)
  • They'd rather have the raw data, vice the aggregate.
  • Current standard is to combine storage with in-core analytics.  Limited interactionb etween storage and processing
  • In-database-effort to improve sall data by pushing analytics into data storage.  Avoids data transfer from databse to analytics. (relational databases, teradata, SAS, etc.).
  • Computational" 9 million records with 2000 attributes?  Big data?
  • What if you wanted to test 17 drug interactions on multiple morbidity outcomes?
  • It is about the data-treating data as an asset implies systems is designed to support
  • An asset is a resource controlled by org-data has a value.
  • You may have a lot of data
    • Value first through 'test and learn'
    • Data governance to maintain flexibility
    • Use technology to operationalize.
  • It's like an irritating fly buzzing around your head-----big data.
  • If they push back and say we need analytics
  • He says we should push back and ask what problem do you want to solve?
  • New OMB analytics guy said all our stuff neds to be structured. 
    • A lot of knowledge in tha tbusiness is in the people.  Not captured in the data.
    • problems are much more important to define. 
  • Apps
    • ENTERPRISE MINER FROM SAS SAVES A LOT OF TIME
    • HADOOP goes well for google in text analyzsis I believe.
    • Oracle does good things.
  • Don't need to be a data scientist to do big data
    • Rather have a heart surgeon to do surgeory on my heart.

David Jakubek.  Case Study-data to decisions building dta analytics capability in the department of defense.
  • Today is international talk like a pirate day.  Okay....
  • Told a really long joke.  Not worth the time imo
  •                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  `
Predictive Analytics in Medicare


  • asdf
  • asdf