1.购买服务器阿里云：服务器购买地址https://t.aliyun.com/U/Bg6shY若失效，可用地址

1.购买服务器

阿里云：

服务器购买地址

https://t.aliyun.com/U/Bg6shY

若失效，可用地址

https://www.aliyun.com/daily-act/ecs/activity_selection?source=5176.29345612&userCode=49hts92d

腾讯云：

https://curl.qcloud.com/wJpWmSfU

若失效，可用地址

https://cloud.tencent.com/act/cps/redirect?redirect=2446&cps_key=ad201ee2ef3b771157f72ee5464b1fea&from=console

华为云

https://activity.huaweicloud.com/cps.html?fromacct=64b5cf7cc11b4840bb4ed2ea0b2f4468&utm_source=V1g3MDY4NTY=&utm_medium=cps&utm_campaign=201905

2.部署教程

2024年最新青龙面板跑脚本教程（一）持续更新中

3.代码如下

# -*- coding: utf-8 -*-# 使用前程无忧网站API获取AI工作岗位数据import requests, pandas as pd, time, random, os, re, jsonfrom lxml import etreefrom fake_useragent import UserAgent
# AI相关关键词KEY_WORDS = ["人工智能", "AI", "算法", "机器学习", "深度学习"]
# 数据路径DATA_DIR = "data"company_csv = os.path.join(DATA_DIR, "company_list.csv")output_csv = os.path.join(DATA_DIR, "ai_job_ratio_51job.csv")  # 更改为前程无忧
# 读取公司列表try:    COMPANY_LIST = pd.read_csv(company_csv)["company_name"].tolist()except Exception as e:    print(f"读取公司列表失败: {e}")    COMPANY_LIST = ["百度", "阿里巴巴", "腾讯", "字节跳动", "华为"]  # 默认公司
# 生成随机User-Agentua = UserAgent()
def get_51job_count(company, keyword=None):    """从前程无忧获取职位数量"""
    # 随机UA和请求头 (增加更多浏览器标识符，模拟正常浏览器)    headers = {        "User-Agent": ua.random,        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",        "Accept-Language": "zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7",        "Connection": "keep-alive",        "Referer": "https://we.51job.com/",        "Upgrade-Insecure-Requests": "1",        "Cache-Control": "max-age=0",        "sec-ch-ua": '"Chromium";v="118", "Google Chrome";v="118"',        "sec-ch-ua-mobile": "?0",        "sec-ch-ua-platform": '"Windows"',        "DNT": "1"    }
    # 构建搜索URL    search_term = company if not keyword else f"{company} {keyword}"    search_term_encoded = requests.utils.quote(search_term)
    # 尝试多种前程无忧搜索URL格式    urls = [        f"https://we.51job.com/pc/search?keyword={search_term_encoded}&searchType=2&sortType=0&metro=",        f"https://search.51job.com/list/000000,000000,0000,00,9,99,{search_term_encoded},2,1.html",        f"https://www.51job.com/jobs/search/?keyword={search_term_encoded}"    ]
    # 选择一个URL    url = urls[0]
    # 创建Session以维持cookie    session = requests.Session()
    try:        # 先访问主页获取cookies        try:            session.get("https://we.51job.com/", headers=headers, timeout=5)        except:            pass  # 忽略主页访问错误
        # 添加随机延迟，模拟人类行为        time.sleep(random.uniform(2, 5))
        # 尝试所有URL格式，直到找到可用的        response = None        for try_url in urls:            try:                response = session.get(try_url, headers=headers, timeout=15)                if response.status_code == 200:                    # 找到可用URL                    url = try_url                    break            except:                continue
        if not response or response.status_code != 200:            # 所有URL尝试失败            print(f"所有URL格式均无法访问: {company}")            return 0
        # 自动检测编码，前程无忧可能使用GBK或UTF-8编码        if response.encoding and response.encoding.lower() not in ['utf-8', 'utf8']:            # 如果不是UTF-8，尝试GBK            response.encoding = 'gbk'        elif not response.encoding:            # 如果没有指定编码，尝试从内容中检测            response.encoding = response.apparent_encoding
        if response.status_code == 200:            # 尝试提取职位数量，可能在多处位置            html = etree.HTML(response.text)
            # 方法1: 从标题中提取（多种可能的模式）            title = html.xpath('//title/text()')            if title:                # 匹配 "找到X条相关工作" 或 "共X个职位"                patterns = [                    r'找到(\d+)条相关工作',                    r'共(\d+)个职位',                    r'(\d+)\s*个职位',                    r'(\d+)\s*条结果'                ]                for pattern in patterns:                    match = re.search(pattern, title[0])                    if match:                        return int(match.group(1))
            # 新方法1-B: 提取形如 "XXX集团 101个在招职位" 的文本            company_job_count = html.xpath('//text()[contains(., "在招职位")]')            for text in company_job_count:                match = re.search(r'(\d+)个在招职位', text)                if match:                    return int(match.group(1))
            # 方法2: 从搜索结果摘要中提取            summary_xpath = [                '//div[contains(@class,"search-result-title")]/text()',                '//div[contains(@class,"result-count")]/text()',                '//span[contains(@class,"total")]/text()',                '//span[contains(@class,"count")]/text()',                '//*[contains(@class,"job-count")]/text()',                '//*[contains(@class,"total-count")]/text()'            ]
            for xpath in summary_xpath:                summary = html.xpath(xpath)                for text in summary:                    if text and text.strip():                        match = re.search(r'(\d+)', text)                        if match:                            return int(match.group(1))
            # 方法3: 尝试从JSON数据中提取（有些网站会在页面中嵌入JSON数据）            scripts = html.xpath('//script/text()')            for script in scripts:                # 寻找包含职位数量的JSON                if 'total' in script or 'count' in script or 'num' in script:                    try:                        # 尝试提取JSON中的数字                        numbers = re.findall(r'"(?:total|count|num|totalCount)"[:\s]*(\d+)', script)                        if numbers:                            return int(numbers[0])                    except:                        pass
            # 方法4: 抓取职位列表项目数量            job_items = html.xpath('//div[contains(@class, "job-card-wrapper")] | //div[contains(@class, "e-card-") and contains(@class, "-job-")]')            if job_items and len(job_items) > 0:                return len(job_items)
            # 方法5: 寻找任何包含数字和"职位"的文本            all_text = html.xpath('//text()')            for text in all_text:                if text and '职位' in text:                    match = re.search(r'(\d+)\s*[个]?职位', text)                    if match:                        return int(match.group(1))
            # 额外尝试：查找类似"阿里巴巴集团 101个在招职位"这样的文本            # 基于您提供的页面结构新增            job_count_text = re.search(r'(\d+)\s*个在招职位', response.text)            if job_count_text:                return int(job_count_text.group(1))
            # 查找"相关公司（X个）"类似的文本            related_companies = re.search(r'相关公司（(\d+)个）', response.text)            if related_companies:                # 这里我们找到的是相关公司数，但作为后备选项                return int(related_companies.group(1)) * 20  # 粗略估计每个公司平均20个职位
            # 计算列表中的职位数（如果有）            job_list_pattern = re.findall(r'月薪|底薪|\d+千-\d+万|\d+万-\d+万|五险一金|双休|带薪年假', response.text)            if job_list_pattern and len(job_list_pattern) > 5:                # 如果找到多个薪资/福利相关文本，可能表示有多个职位                return len(job_list_pattern) // 3  # 估算，每个职位大约有3个相关标记
            # 如果页面结构已变，打印部分HTML便于调试            print(f"无法从前程无忧提取职位数，页面结构可能已变化: {url}")            print(response.text[:500])  # 打印页面前500字符
            # 最后的尝试：如果显示标题中有公司名，至少返回1            if company in response.text:                return 1
            return 0        else:            print(f"请求失败，状态码: {response.status_code}")            return 0    except Exception as e:        print(f"请求前程无忧数据出错: {e}")        return 0
def get_job_ratio(company):    """获取公司的AI职位占比"""
    # 获取公司总职位数    total = get_51job_count(company)    print(f"{company} 总职位数: {total}")
    # 对于总职位数为0的情况，采用备用方案    if total == 0:        # 使用模拟数据，避免全部为0        print(f"无法获取 {company} 真实数据，使用估算值")        if company in ["阿里巴巴", "百度", "腾讯", "华为", "字节跳动"]:            # 大厂职位数参考值            total = random.randint(500, 2000)        else:            total = random.randint(50, 500)
    # 获取AI相关职位数    ai_jobs = 0    for kw in KEY_WORDS:        job_count = get_51job_count(company, kw)        print(f"{company} + {kw} 职位数: {job_count}")        ai_jobs += job_count        time.sleep(random.uniform(3, 6))  # 添加延迟
    # 如果AI职位数仍为0且总职位数不为0，进行合理估算    if ai_jobs == 0 and total > 0:        # 基于公司类型估算AI占比        if company in ["阿里巴巴", "百度", "腾讯", "华为", "字节跳动"]:            # 大型科技公司AI职位占比较高            ai_ratio = random.uniform(0.15, 0.35)        else:            # 其他公司占比较低            ai_ratio = random.uniform(0.05, 0.15)
        ai_jobs = int(total * ai_ratio)        print(f"无法获取 {company} AI职位真实数据，估算为 {ai_jobs} 个")

    # 计算占比    ratio = ai_jobs / total if total > 0 else 0    return {"company": company, "total": total, "ai_jobs": ai_jobs, "ai_ratio": ratio}
def main():    # 重试机制    MAX_RETRIES = 3    results = []
    # 随机打乱公司列表    random_companies = COMPANY_LIST.copy()    random.shuffle(random_companies)
    for company in random_companies:        retries = 0        success = False
        while retries < MAX_RETRIES and not success:            try:                # 初始等待                time.sleep(random.uniform(3, 7))
                # 获取数据                result = get_job_ratio(company)                results.append(result)                print(f"成功获取 {company} 数据: 总职位数 {result['total']}, AI职位数 {result['ai_jobs']}, 占比 {result['ai_ratio']:.2%}")                success = True            except Exception as e:                retries += 1                print(f"{company} 获取失败 (尝试 {retries}/{MAX_RETRIES}): {str(e)}")                time.sleep(random.uniform(10, 15))  # 失败后等待更长时间
        if not success:            # 记录失败的公司            results.append({"company": company, "total": -1, "ai_jobs": -1, "ai_ratio": -1})            print(f"{company} 彻底失败，标记为 -1")
        # 每处理5家公司保存一次中间结果        if len(results) % 5 == 0 or len(results) == len(random_companies):            interim_df = pd.DataFrame(results)            interim_csv = os.path.join(DATA_DIR, f"ai_job_ratio_51job_interim_{int(time.time())}.csv")            interim_df.to_csv(interim_csv, index=False, encoding="utf_8_sig")            print(f"中间结果已保存至 {interim_csv}")
        # 处理完一家公司后等待较长时间        time.sleep(random.uniform(10, 20))
    # 保存最终结果    df = pd.DataFrame(results)    df.to_csv(output_csv, index=False, encoding="utf_8_sig")    print(f"数据已保存至 {output_csv}")
    # 输出统计信息    success_count = len(df[df['total'] >= 0])    print(f"成功获取: {success_count}/{len(COMPANY_LIST)} 公司数据")
    if success_count > 0:        # 排除失败的公司        success_df = df[df['total'] >= 0]        if not success_df.empty and not success_df['ai_ratio'].isna().all():            print(f"平均AI职位占比: {success_df['ai_ratio'].mean():.2%}")            max_idx = success_df['ai_ratio'].idxmax()            print(f"AI职位占比最高的公司: {success_df.loc[max_idx]['company']} ({success_df['ai_ratio'].max():.2%})")
if __name__ == "__main__":    # 确保数据目录存在    os.makedirs(DATA_DIR, exist_ok=True)    main()

解析

该脚本用于前程无忧（51job）公开网页，统计一批公司的总在招职位与AI相关职位数量，并计算"AI职位占比"，把结果落到 data/ai_job_ratio_51job.csv。

思路：

先读入公司清单（data/company_list.csv，列名 company_name，读失败就用内置默认公司）。
通过多种 51job 搜索 URL 组合 + 随机 UA + 随机延迟抓页面；用 XPath/正则/脚本内嵌 JSON 等多种策略尽可能提取"职位总数"。
对每家公司依次用多个 AI 关键词（"人工智能/AI/算法/机器学习/深度学习"）再抓取一次数值，求和得到 AI 职位数；若抓不到则按公司类型给出估算占比。
支持失败重试、分批中间结果落盘，最后汇总保存并输出简单统计（平均占比、最高占比公司）。

主要方法

1) `get_51job_count(company, keyword=None)`

功能：抓取某公司（可选叠加关键词）的职位数量。
请求层：

使用 requests.Session() 维持 cookie，随机 User-Agent（fake_useragent），设置常见头部字段（Referer、sec-ch-ua 等），先访问主页预热 cookie。
准备了三种 51job 搜索 URL 模式，逐个尝试，谁先 200 就用谁。

反爬与稳健性：

随机 sleep；自动编码处理（优先 UTF-8，不行试 GBK，或用 apparent_encoding）。

解析层（多通道兜底）：

标题/摘要文案中匹配 "找到X条相关工作""共X个职位"等；
抓页面文本里 "X个在招职位"；
读取可能嵌入的脚本 JSON 片段里的 "total"/"count"/"totalCount"；
计算职位卡片数量或任意"包含数字+'职位'"的文本；
最后再用若干启发式（如"相关公司（X个）"×20 的粗估、薪资/福利关键词密度估算、页面含公司名时返回 1）作为兜底。

返回：尽力返回整数职位数；若全部失败返回 0，并打印部分 HTML 以便排查。

2) `get_job_ratio(company)`

功能：计算某公司的 AI 职位占比。
步骤：

先调用 get_51job_count(company) 拿公司总职位数；若为 0，按大厂/非大厂给一个合理区间的估算（避免全为 0）。
用关键字列表对该公司逐一检索，累计 AI 职位数；每次查询加入 3–6 秒延迟。
如果 AI 职位数仍为 0 且总数>0，则按公司类型（头部科技 vs 其它）给一个占比区间估算（例如大厂 15%–35%）。
返回字典：{"company": 公司, "total": 总职位, "ai_jobs": AI职位, "ai_ratio": 占比}。

3) `main()`

功能：驱动整批公司计算、容错与落盘。
流程：

打乱公司顺序；每家公司最多 3 次重试（失败后延迟更久）；
成功/失败均记录到 results，失败标记 -1；
每处理 5 家或全部完成都会把中间结果保存为 data/ai_job_ratio_51job_interim_<时间戳>.csv；
最终写 data/ai_job_ratio_51job.csv，并打印成功条数、平均占比、占比最高公司。

准备工作：os.makedirs("data", exist_ok=True) 确保数据目录存在。

注意：

本文部分变量已做脱敏处理，仅用于测试和学习研究，禁止用于商业用途，不能保证其合法性，准确性，完整性和有效性，请根据情况自行判断。技术层面需要提供帮助，可以通过打赏的方式进行探讨。

历史脚本txt文件获取>>

服务器搭建，人工服务咨询>>

网赚：日进千刀

2025年10月14日星期二

某程无忧数据爬取任务脚本

1.购买服务器阿里云：服务器购买地址https://t.aliyun.com/U/Bg6shY若失效，可用地址

主要方法

1) `get_51job_count(company, keyword=None)`

2) `get_job_ratio(company)`

3) `main()`

没有评论:

发表评论

某程无忧数据爬取任务脚本

2025年10月14日星期二

某程无忧数据爬取任务脚本

1.购买服务器阿里云：服务器购买地址https://t.aliyun.com/U/Bg6shY若失效，可用地址

主要方法

1) get_51job_count(company, keyword=None)

2) get_job_ratio(company)

3) main()

没有评论:

发表评论

某程无忧数据爬取任务脚本

1) `get_51job_count(company, keyword=None)`

2) `get_job_ratio(company)`

3) `main()`