Python 爬取天气数据

前言

本文通过，Pandas + BeautifulSoup 实现，爬取天气后报网站中，国内指定城市、年份范围的天气数据（日期,天气状况,气温,风力风向）。

项目源码已开源，公众号对话框，回复 爬虫 即可获取项目地址。

代码讲解

项目结构说明

main.py ：主程序，用于执行爬取数据。
README.md：程序使用说明，讲解如何运行该程序。
Weather data for guangzhou in 2011 and 2013.csv：爬取生成的天气数据文件。
weather.py：爬取一个月/一年的天气数据代码。引入了下面几个包。

import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime

爬取一个月天气数据

# 抓取天气数据
def crawling_weather_data(url):
    print("抓取地址：" + url)
    # 获取网页源代码
    html = requests.get(url)
    # 设置编码，防止中文乱码
    html = html.content.decode('gbk')
    # 数据提取，进行解析，这里使用自带的解析库
    soup = BeautifulSoup(html, 'html.parser')
    tr_list = soup.find_all('tr')  # 提取其中的tr标签

    dates, conditions, temperature, wind = [], [], [], []
    # c从1开始，只要数据不要列名
    for text in tr_list[1:]:
        # 删除字符串前后空白内容
        sub_data = text.text.split()
        # 将日期加载到列表中
        dates.append(sub_data[0])
        # 根据 html 内容获取对应的 文字内容。
        # 索引 1-3 实际上取得是字符串的2,3内容，取不到第4个，join可以把字符串整合一块
        conditions.append(''.join(sub_data[1:3]))
        temperature.append(''.join(sub_data[3:6]))
        wind.append(''.join(sub_data[6:10]))

    # 创建表格，对其追加数据
    weather = pd.DataFrame()
    weather['日期'] = dates
    weather['天气状况'] = conditions
    weather['气温'] = temperature
    weather['风力风向'] = wind
    return weather

爬取一年天气数据

# 抓取一年的天气数据
def get_year_weather_data(year, city):
    print("开始抓取：" + year + "年，开始时间：" + datetime.now().strftime("%Y-%m-%d %H:%M:%S"))

    weather_data_of_month = []
    # 循环获取 1 ~ 12 月的数据
    for month in range(1, 13):
        if month >= 10:
            weather_data_of_month.append(
                crawling_weather_data(
                    'http://www.tianqihoubao.com/lishi/' + city + '/month/' + year + '{}.html'.format(month)))
        # 当月份为 1 ~ 9 时，链接中补充 0 前缀，例如 01 代表一月
        else:
            weather_data_of_month.append(
                crawling_weather_data(
                    'http://www.tianqihoubao.com/lishi/' + city + '/month/' + year + '0{}.html'.format(month)))

    print("抓取完成，结束时间：" + datetime.now().strftime("%Y-%m-%d %H:%M:%S"))
    # 拼接 12 个月的天气数据
    return pd.concat(weather_data_of_month).reset_index(drop=True)

主程序

可通过修改 city、start_year、end_year，获取指定城市、年份范围的天气数据。

from cn.yujian95.crawler.weather.weather import get_year_weather_data
import pandas as pd

# 抓取城市
city = 'guangzhou'
# 抓取年份开始
start_year = 2011
# 抓取年份结束
end_year = 2013

print("------------开始执行程序------------")
weather_data_of_year = []
# 设定抓取的年份范围
for year in range(start_year, end_year):
    weather_data_of_year.append(get_year_weather_data(str(year), city))
# 数据拼接，重新生成索引
result = pd.concat(weather_data_of_year).reset_index(drop=True)
# 数据储存为csv格式，去除索引，解码以防止乱码
result.to_csv(str(start_year) + '_to_' + city + '_data.csv', index=False, encoding='utf-8')
file_name = 'Weather data for ' + city + ' in ' + str(start_year) + ' and ' + str(end_year) + '.csv'
result.to_csv(file_name, index=False, encoding='utf-8')
print("------------结束执行程序------------")

运行效果

爬取结束后，文件默认输入在当前目录下（即 weather 目录下）。

![image-20201213215328960](Python 爬取天气数据/image-20201213215328960.png)

文件内容如下，详情可看 Weather data for guangzhou in 2011 and 2013.csv 文件：

日期,天气状况,气温,风力风向
2011年01月01日,晴/多云,18℃/3℃,北风3-4级/北风3-4级
2011年01月02日,多云/多云,19℃/9℃,无持续风向微风/无持续风向微风
2011年01月03日,小雨/小雨,11℃/5℃,无持续风向微风/无持续风向微风
2011年01月04日,小雨/阴,8℃/5℃,无持续风向微风/无持续风向微风
2011年01月05日,阴/小雨,14℃/7℃,无持续风向微风/无持续风向微风
2011年01月06日,小雨/阴,9℃/4℃,北风4-5级/北风3-4级
2011年01月07日,多云/多云,12℃/5℃,北风3-4级/无持续风向微风
2011年01月08日,多云/多云,15℃/7℃,无持续风向微风/无持续风向微风
2011年01月09日,多云/阴,13℃/6℃,北风3-4级/北风3-4级
...

工具说明

Pandas

Pandas 是 Python 的核心数据分析支持库，提供了快速、灵活、明确的数据结构，旨在简单、直观地处理关系型、标记型数据。Pandas 的目标是成为 Python 数据分析实践与实战的必备高级工具，其长远目标是成为最强大、最灵活、可以支持任何语言的开源数据分析工具。

BeautifulSoup

BeautifulSoup4 是爬虫必学的技能。BeautifulSoup 最主要的功能是从网页抓取数据，Beautiful Soup 自动将输入文档转换为 Unicode 编码，输出文档转换为 utf-8 编码。BeautifulSoup 支持 Python 标准库中的 HTML 解析器，还支持一些第三方的解析器，如果我们不安装它，则 Python 会使用 Python 默认的解析器，lxml 解析器更加强大，速度更快，推荐使用 lxml 解析器。

Python 爬取天气数据

Python 爬取天气数据

前言

代码讲解

爬取一个月天气数据

爬取一年天气数据

主程序

运行效果

工具说明

Pandas

BeautifulSoup

推荐阅读

通过 wordgo 生成 word 文件

集成 Spring Boot Admin 实现 Spring Boot 应用监控

Github 隐藏功能——README.md

评论