I'm an R guy myself and have done some machine algorithm work, so I'm thinking less about how to get the data and more about what to do with it once you've got it. What metrics are you going to use to define success? W-L, efficiency, yards, touchdowns? You'd also need to categorize and account for the type of offense they ran in high school compared to what we're running (or other college campuses if it were to expand). You may have thought about this already and have it mapped out, but you'd need both high school and college stats to make for an effective algorithm, and the more data you pull in the better the algorithm gets. I can't help too much with the web scraping/API portions, but if it's in a database somewhere I can do a lot with it.
IF there's a way to get this to work and it shows accurate results, there could be mega $$ to be made. There's always going to be intangibles that can't be picked up in stats, but this could be also be used for coaches too. You could pull any quarterback that's played under Chaney and measure year to year growth in certain areas, and compare it that of other coaches if you wanted. My guess is that certain metrics (efficiency/completion percentage) won't change a ton between high school and college after taking offensive system into account, but you never know. It'd be extremely interesting to look at.
Random thought stream of possible variables you'd need to start to get a full picture:
High School:
- Years started
- Level of play (3A, 6A, etc)
- State (Higher competition at lower levels in states like Florida, Texas, etc.)
- Measurements (Height mostly, weight maybe)
- Stats (Yards, Touchdowns, interceptions, sacks, rushing yards, fumbles, anything that's available)
- Type of system
- Injuries?
-W-L Record
- 3rd Down stats
- Championships
College:
- College class (Fr, So, etc)
- Conference
- Individual school
- Stats & 3rd down stats
- Measurements
- Type of system/OC/Qb coach
- W-L Record
- Championships/Bowl wins