nada

[Python] JSON 파일 불러오기 :: pd.read_json( ) 본문

Python/Pandas

[Python] JSON 파일 불러오기 :: pd.read_json( )

ds-nada 2023. 8. 9. 00:06

JSON 파일 불러오기 :: pd.read_json( )

Reference Pandas In Action
  • JSON(Jave Script Object Notation)
    • 텍스트 데이터를 저장하고 전송하기 위한 형식
    • 키 - 값 쌍으로 구성
    • Python의 딕셔너리 객체와 유사
  • 린터(Linter)
    • 각 키 - 값 쌍을 별도의 줄에 배치하여 JSON 응답을 가독성 있는 형식으로 나타냄

< JSON 파일 불러오기 >

pd.read_csv(
    path_or_buf = None,
)
  • Option
    • path_or_buf : 파일 경로 및 파일 이름
import pandas as pd

nobel = pd.read_json("./Data/nobel.json")
nobel
  prizes
0 {'year': '2019', 'category': 'chemistry', 'lau...
1 {'year': '2019', 'category': 'economics', 'lau...
2 {'year': '2019', 'category': 'literature', 'la...
3 {'year': '2019', 'category': 'peace', 'laureat...
4 {'year': '2019', 'category': 'physics', 'overa...
... ...
641 {'year': '1901', 'category': 'chemistry', 'lau...
642 {'year': '1901', 'category': 'literature', 'la...
643 {'year': '1901', 'category': 'peace', 'laureat...
644 {'year': '1901', 'category': 'physics', 'laure...
645 {'year': '1901', 'category': 'medicine', 'laur...

646 rows × 1 columns

< Result >
prizes에 중첩된 딕셔너리가 존재

< 평탄화(Flattening)\ or *정규화(Normalizing) >*

pd.json_normalize(
    data,
    record_path = None,
    meta = None,
)
  • 중첩된 데이터 레코드를 단일 1차원 리스트로 변형하는 과정
  • Option
    • data : 직렬화되지 않은 JSON 객체
    • record_path : 레코드 목록에 대한 각 개체의 경로
    • meta : 결과 테이블의 각 레코드에 대한 메타데이터
# `prizes`데이터 중 첫번째 최상위 딕셔너리 키(`year`, `category`, `laureates`)를 추출
pd.json_normalize(data = nobel['prizes'][0])
  year category laureates
0 2019 chemistry [{'id': '976', 'firstname': 'John', 'surname':...

< Result >
laureates에 여전히 중첩된 딕셔너리가 존재

# 중첩된 `laureates`레코드를 정규화
pd.json_normalize(
    data = nobel['prizes'][0],
    record_path = 'laureates'
)
  id firstname surname motivation share
0 976 John Goodenough "for the development of lithium-ion batteries" 3
1 977 M. Stanley Whittingham "for the development of lithium-ion batteries" 3
2 978 Akira Yoshino "for the development of lithium-ion batteries" 3

< Result >
→ 새로운 열로 확장했지만 기존의 yearcategory열이 사라짐

# 최상위 키 - 값 쌍을 유지 (`year`, `category`)
pd.json_normalize(
    data = nobel['prizes'][0],
    record_path = 'laureates',
    meta = ['year', 'category']
)
  id firstname surname motivation share year category
0 976 John Goodenough "for the development of lithium-ion batteries" 3 2019 chemistry
1 977 M. Stanley Whittingham "for the development of lithium-ion batteries" 3 2019 chemistry
2 978 Akira Yoshino "for the development of lithium-ion batteries" 3 2019 chemistry
# Error
pd.json_normalize(
    data = nobel['prizes'],
    record_path = 'laureates',
    meta = ['year', 'category']
)
Result
    ---------------------------------------------------------------------------
  KeyError                                  Traceback (most recent call last)
  File ~/anaconda3/envs/torch/lib/python3.11/site-packages/pandas/io/json/_normalize.py:399, in _json_normalize.<locals>._pull_field(js, spec, extract_record)
      398     else:
  --> 399         result = result[spec]
      400 except KeyError as e:

  KeyError: 'laureates'

  The above exception was the direct cause of the following exception:

  KeyError                                  Traceback (most recent call last)
  Cell In[6], line 2
        1 # Error
  ----> 2 pd.json_normalize(
        3     data = nobel['prizes'],
        4     record_path = 'laureates',
        5     meta = ['year', 'category']
        6 )

  File ~/anaconda3/envs/torch/lib/python3.11/site-packages/pandas/io/json/_normalize.py:518, in _json_normalize(data, record_path, meta, meta_prefix, record_prefix, errors, sep, max_level)
      515                 meta_vals[key].append(meta_val)
      516             records.extend(recs)
  --> 518 _recursive_extract(data, record_path, {}, level=0)
      520 result = DataFrame(records)
      522 if record_prefix is not None:

  File ~/anaconda3/envs/torch/lib/python3.11/site-packages/pandas/io/json/_normalize.py:500, in _json_normalize.<locals>._recursive_extract(data, path, seen_meta, level)
      498 else:
      499     for obj in data:
  --> 500         recs = _pull_records(obj, path[0])
      501         recs = [
      502             nested_to_record(r, sep=sep, max_level=max_level)
      503             if isinstance(r, dict)
      504             else r
      505             for r in recs
      506         ]
      508         # For repeating the metadata later

  File ~/anaconda3/envs/torch/lib/python3.11/site-packages/pandas/io/json/_normalize.py:422, in _json_normalize.<locals>._pull_records(js, spec)
      416 def _pull_records(js: dict[str, Any], spec: list | str) -> list:
      417     """
      418     Internal function to pull field for records, and similar to
      419     _pull_field, but require to return list. And will raise error
      420     if has non iterable value.
      421     """
  --> 422     result = _pull_field(js, spec, extract_record=True)
      424     # GH 31507 GH 30145, GH 26284 if result is not list, raise TypeError if not
      425     # null, otherwise return an empty list
      426     if not isinstance(result, list):

  File ~/anaconda3/envs/torch/lib/python3.11/site-packages/pandas/io/json/_normalize.py:402, in _json_normalize.<locals>._pull_field(js, spec, extract_record)
      400 except KeyError as e:
      401     if extract_record:
  --> 402         raise KeyError(
      403             f"Key {e} not found. If specifying a record_path, all elements of "
      404             f"data should have the path."
      405         ) from e
      406     elif errors == "ignore":
      407         return np.nan

  KeyError: "Key 'laureates' not found. If specifying a record_path, all elements of data should have the path."

< Result >
→ Error : prizes Series에 있는 딕셔너리 중 일부는 laureates라는 키가 없기 때문..

dictionary.setdefault(
    key,
    value
)
  • 딕셔너리 키에 대한 기본 값을 할당
  • 딕셔너리에 키가 없는 경우에는 키 - 값 쌍을 할당
  • 딕셔너리에 키가 있는 경우 기존 값을 반환
def add_laureates_key(entry):
    entry.setdefault('laureates', [])

# prizes에 있는 딕셔너리 자체를 변경하므로 기존의 Series를 덮어쓸 필요가 없음
nobel['prizes'].apply(add_laureates_key)
0      None
1      None
2      None
3      None
4      None
       ... 
641    None
642    None
643    None
644    None
645    None
Name: prizes, Length: 646, dtype: object
# 완성된 JSON 파일 불러오기
winners = pd.json_normalize(
    data = nobel['prizes'],
    record_path = 'laureates',
    meta = ['year', 'category']
)
winners
  id firstname surname motivation share year category
0 976 John Goodenough "for the development of lithium-ion batteries" 3 2019 chemistry
1 977 M. Stanley Whittingham "for the development of lithium-ion batteries" 3 2019 chemistry
2 978 Akira Yoshino "for the development of lithium-ion batteries" 3 2019 chemistry
3 982 Abhijit Banerjee "for their experimental approach to alleviatin... 3 2019 economics
4 983 Esther Duflo "for their experimental approach to alleviatin... 3 2019 economics
... ... ... ... ... ... ... ...
945 569 Sully Prudhomme "in special recognition of his poetic composit... 1 1901 literature
946 462 Henry Dunant "for his humanitarian efforts to help wounded ... 2 1901 peace
947 463 Frédéric Passy "for his lifelong work for international peace... 2 1901 peace
948 1 Wilhelm Conrad Röntgen "in recognition of the extraordinary services ... 1 1901 physics
949 293 Emil von Behring "for his work on serum therapy, especially its... 1 1901 medicine

950 rows × 7 columns